Changes between Version 29 and Version 30 of VDEchp


Ignore:
Timestamp:
10/06/11 23:22:05 (13 years ago)
Author:
lvpeng
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VDEchp

    v29 v30  
    44
    55== Checkpointing a VDE ==
    6 A straightforward way of checkpointing a VDE is to restart all VMs in the VDE upon a failure. However, this can lead to significant data loss. Before the failure happens, none of the correct state is saved, neither for each VM nor for the entire VDE. Thus, if the entire VDE were to crash (due to the failure of one VM) just before the completion of a long-running task, all the VMs must be re-started immediately after the failure, resulting in loss of computational time and data. For example, assume that we have two virtual machines, named VM1 and VM2, running in a virtual network. Say, VM2 sends some messages to VM1 and then fails. These messages may be correctly received by VM1 and may change the state of VM1. Thus, when VM2 is rolled-back to its latest correct state, VM1 must also be rolled-back to a state before the messages were received from VM2. In other words, the VMs (and thus the entire VDE) must check-pointed at globally consistent states.
     6A straightforward way of checkpointing a VDE is to restart all VMs in the VDE upon a failure. However, this can lead to significant data loss. Before the failure happens, none of the correct state is saved, neither for each VM nor for the entire VDE. Thus, if the entire VDE were to crash (due to the failure of one VM) just before the completion of a long-running task, all the VMs must be re-started immediately after the failure, resulting in loss of computational time and data. For example, assume that we have two virtual machines, named VM,,1,, and VM,,2,,, running in a virtual network. Say, VM,,2,, sends some messages to VM,,1,, and then fails. These messages may be correctly received by VM,,1,, and may change the state of VM,,1,,. Thus, when VM,,2,, is rolled-back to its latest correct state, VM1 must also be rolled-back to a state before the messages were received from VM,,2,,. In other words, the VMs (and thus the entire VDE) must check-pointed at globally consistent states.
    77
    88== Lightweight Checkpoint Implementation ==
     
    1919In the [wiki:VDEchp VDEchp] design, for each VM, the state of its stable copy is always one checkpoint interval behind the current VM’s state except the initial state. This means that, when a new checkpoint is generated, it is not copied to the stable copy immediately. Instead, the last checkpoint is copied to the stable copy. The reason is that, there is a latency between when an error occurs and when the failure caused by this error is detected.
    2020
    21 For example, in Figure 1, an error happens at time t0 and causes the system to fail at time t1. Since most error latency is small, in most cases, t1 - t0 < Te. In the Case A, the latest checkpoint is chp1, and the system needs to roll back to the state S1 by resuming from the checkpoint chp1. However, in the Case B, an error happens at time t2, and then a new checkpoint chp3 is saved. After the system moves to the state S3, this error causes the system to fail at time t3. Here, we assume that t3 - t2 < Te. But, if we choose chp3 as the latest correct checkpoint and roll the system back to state S3, after resuming, the system will fail again. We can see that, in this case, the latest checkpoint should be chp2, and when the system crashes, we should roll it back to state S2, by resuming from the checkpoint chp2.
     21For example, in Figure 1, an error happens at time t0 and causes the system to fail at time t,,1,,. Since most error latency is small, in most cases, t,,1,, - t,,0,, < T,,e,,. In the Case A, the latest checkpoint is chp,,1,,, and the system needs to roll back to the state S,,1,, by resuming from the checkpoint chp,,1,,. However, in the Case B, an error happens at time t,,2,,, and then a new checkpoint chp,,3,, is saved. After the system moves to the state S,,3,,, this error causes the system to fail at time t,,3,,. Here, we assume that t,,3,, - t,,2,, < T,,e,,. But, if we choose chp,,3,, as the latest correct checkpoint and roll the system back to state S,,3,,, after resuming, the system will fail again. We can see that, in this case, the latest checkpoint should be chp,,2,,, and when the system crashes, we should roll it back to state S,,2,,, by resuming from the checkpoint chp,,2,,.
    2222
    2323== Definition of Global Checkpoint ==