Changes between Version 8 and Version 9 of VDEchp


Ignore:
Timestamp:
10/04/11 00:59:48 (13 years ago)
Author:
lvpeng
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VDEchp

    v8 v9  
    1414
    1515== Different Execution Cases Under VDEchp ==
     16[[Image()]]
    1617In the VDEchp design, for each VM, the state of its stablecopy is always one checkpoint interval behind the current VM’s state except the initial state. This means that, when a new checkpoint is generated, it is not copied to the stable-copy immediately. Instead, the last checkpoint will be copied to the stable-copy. The reason is that, there is latency between when an error occurs and when the failure caused by this error is detected.
    1718
     
    1920
    2021== The Definition Of The Global Checkpoint ==
     22[[Image()]]
    2123To compose a globally consistent state of all the VMs, the checkpoint of each VM must be coordinated. Besides checkpointing each VM’s correct state, it’s also essential to guarantee the consistency of all communication states within the virtual network. In the figure 4, the messages exchanged among the VMs are marked by arrows going from the sender to the receiver. The execution line of the VMs is separated by their corresponding checkpoints. The upper part of each checkpoint corresponds to the state before the checkpoint and the lower part of each checkpoint corresponds to the state after the checkpoint. A global checkpoint (consistent or not) is marked as the “cut” line, which separates each VM’s timeline into two parts. We can label the messages exchanged in the virtual network into three categories:
    2224(1) The state of the message’s source and the destination are on the same side of the cut line. For example, in Figure 4, both the source state and the destination state of message m1 are above the cut line. Similarly, both the source state and the destination state of messages m2 are under the cut line.
     
    3638== Evaluation Results ==
    3739=== Downtime Evaluation for Solo VM ===
     40[[Image()]]
    3841Table I shows the downtime results under different mechanisms. We compare VDEchp with Remus and the VNsnap-memory daemon, under the same checkpoint interval. We measure the downtime of all three mechanisms, with the same VM (with 512MB of RAM), for three cases: a) when the VM is idle, b) when the VM runs the NPB-EP benchmark program [5], and c) when the VM runs the Apache web server workload [2].
    3942
     
    4952
    5053=== VDE Downtime ===
     54[[Image()]]
    5155The VDE downtime is the time from when the failure was detected in the VDE until the entire VDE resumes from the last globally consistent checkpoint. We conducted experiments to measure the downtime. To induce failures in the VDE, we developed an application program that causes a segmentation failure after executing for a while. This program is launched on several VMs to generate a failure while the distributed application workload is running in the VDE. The protected VDE is then rolled back to the last globally consistent checkpoint. We use the NPB-EP program (MPI task in the VDE) and the Apache web server benchmark as the distributed workload on the protected VMs.
    5256
    53 Figure 7 shows the results. From the figure, we observe
    54 that, in our 36-node (VM) environment, the measured VDE
    55 downtime under VDEchp ranges from 2.46 seconds to 4.77
    56 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp
    57 slightly increases as the checkpoint interval grows. This is
    58 because, the VDE downtime depends on the number of memory
    59 pages restored during recovery. Thus, as the checkpoint
    60 interval grows, the checkpoint size also grows, so does the
    61 number of restored pages during recovery.
     57Figure 7 shows the results. From the figure, we observe that, in our 36-node (VM) environment, the measured VDE downtime under VDEchp ranges from 2.46 seconds to 4.77 seconds, with an average of 3.54 seconds. Another observation is that the VDE downtime in VDEchp slightly increases as the checkpoint interval grows. This is because, the VDE downtime depends on the number of memory pages restored during recovery. Thus, as the checkpoint interval grows, the checkpoint size also grows, so does the number of restored pages during recovery.