3 | | (setup soon based on the IPDPS submission) |
| 3 | A virtual machine (VM) running solo can be stopped and resumed easily, and has much less running cost than physical machines. All of the successful VMs (e.g., Xen, VMware, KVM) have similarly good resume-ability. Our previous ([wiki:LLM LLM], [wiki:FGBI FGBI]) focus on achieving high availability (HA) by using solo VM. However, when considering the virtual distributed environment (VDE) which consists of multiple VMs running on different physical hosts, interconnected by a virtual network, things become different. In a VDE, multiple VMs are distributed as computing nodes across different hosts, so the failure of one VM can affect other related VMs and may cause them to also fail. In VDEchp we develop a lightweight, globally consistent checkpoint mechanism, which checkpoints the VDE for immediate restoration after VM’s failures. |
| 4 | |
| 5 | == Problem When Checkpointing A VDE == |
| 6 | A straightforward way of checkpointing a VDE is to restart all VMs in the VDE upon a failure. However, this can lead to significant data loss. Before the failure happens, none of the correct state is saved, neither for each VM nor for the entire VDE. Thus, if the entire VDE were to crash (due to the failure of one VM) just before the completion of a long-running task, all the VMs must be re-started immediately after the failure, resulting in loss of computational time and data. For example, assume that we have two virtual machines, named VM1 and VM2, running in a virtual network. Say, VM2 sends some messages to VM1 and then fails. These messages may be correctly received by VM1 and may change the state of VM1. Thus, when VM2 is rolled-back to its latest correct state, VM1 must also be rolled-back to a state before the messages were received from VM2. In other words, the VMs (and thus the entire VDE) must check-pointed at globally consistent states. |