Changes between Version 10 and Version 11 of VDEchp
- Timestamp:
- 10/04/11 01:05:30 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
VDEchp
v10 v11 1 1 = [wiki:VDEchp VDEchp] = 2 2 3 A virtual machine (VM) running solo can be stopped and resumed easily, and has much less running cost than physical machines. All of the successful VMs (e.g., Xen, VMware, KVM) have similarly good resume-ability. Our previous ([wiki:LLM LLM], [wiki:FGBI FGBI]) focus on achieving high availability (HA) by using solo VM. However, when considering the virtual distributed environment (VDE) which consists of multiple VMs running on different physical hosts, interconnected by a virtual network, things become different. In a VDE, multiple VMs are distributed as computing nodes across different hosts, so the failure of one VM can affect other related VMs and may cause them to also fail. In VDEchp we develop a lightweight, globally consistent checkpoint mechanism, which checkpoints the VDE for immediate restoration after VM’s failures. (For a full description and evaluation, please see our [wiki:Publications VDEchp Technical Report]).3 A virtual machine (VM) running solo can be stopped and resumed easily, and has much less running cost than physical machines. All of the successful VMs (e.g., Xen, VMware, KVM) have similarly good resume-ability. Our previous ([wiki:LLM LLM], [wiki:FGBI FGBI]) focus on achieving high availability (HA) by using solo VM. However, when considering the virtual distributed environment (VDE) which consists of multiple VMs running on different physical hosts, interconnected by a virtual network, things become different. In a VDE, multiple VMs are distributed as computing nodes across different hosts, so the failure of one VM can affect other related VMs and may cause them to also fail. In VDEchp we develop a lightweight, globally consistent checkpoint mechanism, which checkpoints the VDE for immediate restoration after VM’s failures. 4 4 5 5 == Problem When Checkpointing A VDE == … … 55 55 The VDE downtime is the time from when the failure was detected in the VDE until the entire VDE resumes from the last globally consistent checkpoint. We conducted experiments to measure the downtime. To induce failures in the VDE, we developed an application program that causes a segmentation failure after executing for a while. This program is launched on several VMs to generate a failure while the distributed application workload is running in the VDE. The protected VDE is then rolled back to the last globally consistent checkpoint. We use the NPB-EP program (MPI task in the VDE) and the Apache web server benchmark as the distributed workload on the protected VMs. 56 56 57 Figure 7 shows the results. From the figure, we observe that, in our 36-node (VM) environment, the measured VDE downtime under VDEchp ranges from 2.46 seconds to 4.77 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp slightly increases as the checkpoint interval grows. This is because, the VDE downtime depends on the number of memory pages restored during recovery. Thus, as the checkpoint interval grows, the checkpoint size also grows, so does the number of restored pages during recovery. 57 Figure 7 shows the results. From the figure, we observe that, in our 36-node (VM) environment, the measured VDE downtime under VDEchp ranges from 2.46 seconds to 4.77 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp slightly increases as the checkpoint interval grows. This is because, the VDE downtime depends on the number of memory pages restored during recovery. Thus, as the checkpoint interval grows, the checkpoint size also grows, so does the number of restored pages during recovery. 58 59 For a full description and evaluation, please see our [wiki:Publications VDEchp Technical Report].