Changes between Version 8 and Version 9 of VDEchp
- Timestamp:
- 10/04/11 00:59:48 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
VDEchp
v8 v9 14 14 15 15 == Different Execution Cases Under VDEchp == 16 [[Image()]] 16 17 In the VDEchp design, for each VM, the state of its stablecopy is always one checkpoint interval behind the current VM’s state except the initial state. This means that, when a new checkpoint is generated, it is not copied to the stable-copy immediately. Instead, the last checkpoint will be copied to the stable-copy. The reason is that, there is latency between when an error occurs and when the failure caused by this error is detected. 17 18 … … 19 20 20 21 == The Definition Of The Global Checkpoint == 22 [[Image()]] 21 23 To compose a globally consistent state of all the VMs, the checkpoint of each VM must be coordinated. Besides checkpointing each VM’s correct state, it’s also essential to guarantee the consistency of all communication states within the virtual network. In the figure 4, the messages exchanged among the VMs are marked by arrows going from the sender to the receiver. The execution line of the VMs is separated by their corresponding checkpoints. The upper part of each checkpoint corresponds to the state before the checkpoint and the lower part of each checkpoint corresponds to the state after the checkpoint. A global checkpoint (consistent or not) is marked as the “cut” line, which separates each VM’s timeline into two parts. We can label the messages exchanged in the virtual network into three categories: 22 24 (1) The state of the message’s source and the destination are on the same side of the cut line. For example, in Figure 4, both the source state and the destination state of message m1 are above the cut line. Similarly, both the source state and the destination state of messages m2 are under the cut line. … … 36 38 == Evaluation Results == 37 39 === Downtime Evaluation for Solo VM === 40 [[Image()]] 38 41 Table I shows the downtime results under different mechanisms. We compare VDEchp with Remus and the VNsnap-memory daemon, under the same checkpoint interval. We measure the downtime of all three mechanisms, with the same VM (with 512MB of RAM), for three cases: a) when the VM is idle, b) when the VM runs the NPB-EP benchmark program [5], and c) when the VM runs the Apache web server workload [2]. 39 42 … … 49 52 50 53 === VDE Downtime === 54 [[Image()]] 51 55 The VDE downtime is the time from when the failure was detected in the VDE until the entire VDE resumes from the last globally consistent checkpoint. We conducted experiments to measure the downtime. To induce failures in the VDE, we developed an application program that causes a segmentation failure after executing for a while. This program is launched on several VMs to generate a failure while the distributed application workload is running in the VDE. The protected VDE is then rolled back to the last globally consistent checkpoint. We use the NPB-EP program (MPI task in the VDE) and the Apache web server benchmark as the distributed workload on the protected VMs. 52 56 53 Figure 7 shows the results. From the figure, we observe 54 that, in our 36-node (VM) environment, the measured VDE 55 downtime under VDEchp ranges from 2.46 seconds to 4.77 56 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp 57 slightly increases as the checkpoint interval grows. This is 58 because, the VDE downtime depends on the number of memory 59 pages restored during recovery. Thus, as the checkpoint 60 interval grows, the checkpoint size also grows, so does the 61 number of restored pages during recovery. 57 Figure 7 shows the results. From the figure, we observe that, in our 36-node (VM) environment, the measured VDE downtime under VDEchp ranges from 2.46 seconds to 4.77 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp slightly increases as the checkpoint interval grows. This is because, the VDE downtime depends on the number of memory pages restored during recovery. Thus, as the checkpoint interval grows, the checkpoint size also grows, so does the number of restored pages during recovery.