Context Navigation

Changes between Version 7 and Version 8 of VDEchp

Timestamp:: 10/04/11 00:58:18 (14 years ago)
Author:: lvpeng
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

VDEchp

-                      v7
+                      v8
 the checkpointing service. It doesn’t need to be deployed on the privileged guest system like the Domain 0 in Xen. When VDEchp starts to record the globally consistent checkpoint, the Initiator broadcasts the checkpoint request and waits for acknowledgements from all the recipients. Upon receiving a
 checkpoint request, each VM checks the latest recorded in-disk stable-copy (not the in-memory checkpoint), marks this stablecopy as part of the global checkpoint, and sends a “success” acknowledgement back to the Initiator. The algorithm terminates when the Initiator receives the acknowledgements from all the VMs. For example, if the Initiator sends a request (marked as rn) to checkpoint the entire VDE, a VM named VM1 in the VDE will record a stable-copy named “vm1 global rn”. All of the table-copies from each VM compose a globally consistent checkpoint for the entire VDE. Besides, if the
 VDEchp Initiator sends the checkpoint request at a userspecified frequency, the correct state of the entire VDE is recorded periodically.
+VDEchp Initiator sends the checkpoint request at a user-specified frequency, the correct state of the entire VDE is recorded periodically.
 == Evaluation Results ==
 === Downtime Evaluation for Solo VM ===
 Table I shows the downtime results under different mechanisms. We compare VDEchp with Remus and the VNsnap-memory daemon, under the same checkpoint interval. We measure the downtime of all three mechanisms, with the same VM (with 512MB of RAM), for three cases: a) when the VM is idle, b) when the VM runs the NPB-EP benchmark program [5], and c) when the VM runs the Apache web server workload [2].
 Several observations are in order regarding the downtime measurements.
 First, the downtime results of all three mechanisms are short and very similar for the idle case. This is not surprising, as memory updates are rare during idle runs, so the downtime of all mechanisms is short and similar.
 Second, the downtime of both VDEchp and Remus remain almost the same when running NPB-EP and Apache. This is because, the downtime depends on the amount of memory remaining to be copied when the guest VM is suspended. Since both VDEchp and Remus use a high-frequency methodology, the dirty pages in the last round are almost the same.
 Third, when running the NPB-EP program, VDEchp has lesser downtime than the VNsnap-memory daemon (reduction is more than 20%). This is because, NPB-EP is a computationally intensive workload. Thus, the guest VM memory is updated at high frequency. When saving the checkpoint, compared with other high-frequency checkpoint solutions, the VNsnap-memory daemon takes more time to save larger dirty data due to its low memory transfer frequency.
 Finally, when running the Apache application, the memory update is not so much as that when running NPB. But the memory update is definitely more than the idle run. The results show that VDEchp has lower downtime than VNsnap-memory daemon (downtime is reduced by roughly 16%).
+=== VDE Downtime ===
+The VDE downtime is the time from when the failure was detected in the VDE until the entire VDE resumes from the last globally consistent checkpoint. We conducted experiments to measure the downtime. To induce failures in the VDE, we developed an application program that causes a segmentation failure after executing for a while. This program is launched on several VMs to generate a failure while the distributed application workload is running in the VDE. The protected VDE is then rolled back to the last globally consistent checkpoint. We use the NPB-EP program (MPI task in the VDE) and the Apache web server benchmark as the distributed workload on the protected VMs.
+Figure 7 shows the results. From the figure, we observe
+that, in our 36-node (VM) environment, the measured VDE
+downtime under VDEchp ranges from 2.46 seconds to 4.77
+seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp
+slightly increases as the checkpoint interval grows. This is
+because, the VDE downtime depends on the number of memory
+pages restored during recovery. Thus, as the checkpoint
+interval grows, the checkpoint size also grows, so does the
+number of restored pages during recovery.