3 | | A virtual machine (VM) running solo can be stopped and resumed easily, and has much less running cost than physical machines. All of the successful VMs (e.g., Xen, VMware, KVM) have similarly good resume-ability. Our previous ([wiki:LLM LLM], [wiki:FGBI FGBI]) focus on achieving high availability (HA) by using solo VM. However, when considering the virtual distributed environment (VDE) which consists of multiple VMs running on different physical hosts, interconnected by a virtual network, things become different. In a VDE, multiple VMs are distributed as computing nodes across different hosts, so the failure of one VM can affect other related VMs and may cause them to also fail. In VDEchp we develop a lightweight, globally consistent checkpoint mechanism, which checkpoints the VDE for immediate restoration after VM’s failures. |
| 3 | A virtual machine (VM) running solo can be stopped and resumed easily, and has much less running cost than physical machines. All of the successful VMs (e.g., Xen, VMware, KVM) have similarly good resume-ability. Our previous ([wiki:LLM LLM], [wiki:FGBI FGBI]) focus on achieving high availability (HA) by using solo VM. However, when considering the virtual distributed environment (VDE) which consists of multiple VMs running on different physical hosts, interconnected by a virtual network, things become different. In a VDE, multiple VMs are distributed as computing nodes across different hosts, so the failure of one VM can affect other related VMs and may cause them to also fail. In VDEchp we develop a lightweight, globally consistent checkpoint mechanism, which checkpoints the VDE for immediate restoration after VM’s failures. (For a full description and evaluation, please see our [wiki:Publications VDEchp Technical Report]). |
4 | 4 | |
5 | 5 | == Problem When Checkpointing A VDE == |
6 | 6 | A straightforward way of checkpointing a VDE is to restart all VMs in the VDE upon a failure. However, this can lead to significant data loss. Before the failure happens, none of the correct state is saved, neither for each VM nor for the entire VDE. Thus, if the entire VDE were to crash (due to the failure of one VM) just before the completion of a long-running task, all the VMs must be re-started immediately after the failure, resulting in loss of computational time and data. For example, assume that we have two virtual machines, named VM1 and VM2, running in a virtual network. Say, VM2 sends some messages to VM1 and then fails. These messages may be correctly received by VM1 and may change the state of VM1. Thus, when VM2 is rolled-back to its latest correct state, VM1 must also be rolled-back to a state before the messages were received from VM2. In other words, the VMs (and thus the entire VDE) must check-pointed at globally consistent states. |
| 7 | |
| 8 | == Lightweight Checkpoint Implementation == |
| 9 | We deploy the VDEchp agent that encapsulates our checkpointing mechanism on every host. The mechanism uses the same host memory with other running VMs. For each VM on a host, in addition to the memory space assigned to its guest OS, we assign a small amount of additional memory for its VDEchp agent to use. During system initialization, we save the initial state of each VM on the disk. To differentiate this state from the phrase “checkpoint,” we call this state, “stablecopy” After the VMs start execution, the VDEchp agents begin saving the correct state for the VMs. For each VM, during each checkpoint interval, all memory pages are set as read-only. Thus, if there is any write to a page, it will trigger a page fault. Since we leverage the shadow-paging feature of Xen, we are able to control whether a page is read-only and to trace whether a page is dirty. When there is a write to a |
| 10 | read-only page, a page fault is triggered and reported to the VMM, and we save the current state of this page. |
| 11 | |
| 12 | When a page fault occurs, this memory page is set as writeable, but we don’t save the modified page at this time, because there may be another “new” write to the same page in the same interval. Instead, at the end of each checkpoint interval, we copy the “final” state of the modified page to |
| 13 | the agent’s memory, and reset all pages to read-only again. Therefore, each checkpoint consists of only the pages which are updated within that checkpoint interval. And, since each checkpoint interval is very small, the number of updated pages in this interval is small as well, so it is unnecessary to assign large memory to each VDEchp agent. |