Changes between Version 30 and Version 31 of VDEchp


Ignore:
Timestamp:
10/06/11 23:24:13 (13 years ago)
Author:
lvpeng
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • VDEchp

    v30 v31  
    1919In the [wiki:VDEchp VDEchp] design, for each VM, the state of its stable copy is always one checkpoint interval behind the current VM’s state except the initial state. This means that, when a new checkpoint is generated, it is not copied to the stable copy immediately. Instead, the last checkpoint is copied to the stable copy. The reason is that, there is a latency between when an error occurs and when the failure caused by this error is detected.
    2020
    21 For example, in Figure 1, an error happens at time t0 and causes the system to fail at time t,,1,,. Since most error latency is small, in most cases, t,,1,, - t,,0,, < T,,e,,. In the Case A, the latest checkpoint is chp,,1,,, and the system needs to roll back to the state S,,1,, by resuming from the checkpoint chp,,1,,. However, in the Case B, an error happens at time t,,2,,, and then a new checkpoint chp,,3,, is saved. After the system moves to the state S,,3,,, this error causes the system to fail at time t,,3,,. Here, we assume that t,,3,, - t,,2,, < T,,e,,. But, if we choose chp,,3,, as the latest correct checkpoint and roll the system back to state S,,3,,, after resuming, the system will fail again. We can see that, in this case, the latest checkpoint should be chp,,2,,, and when the system crashes, we should roll it back to state S,,2,,, by resuming from the checkpoint chp,,2,,.
     21For example, in Figure 1, an error happens at time t,,0,, and causes the system to fail at time t,,1,,. Since most error latency is small, in most cases, t,,1,, - t,,0,, < T,,e,,. In the Case A, the latest checkpoint is chp,,1,,, and the system needs to roll back to the state S,,1,, by resuming from the checkpoint chp,,1,,. However, in the Case B, an error happens at time t,,2,,, and then a new checkpoint chp,,3,, is saved. After the system moves to the state S,,3,,, this error causes the system to fail at time t,,3,,. Here, we assume that t,,3,, - t,,2,, < T,,e,,. But, if we choose chp,,3,, as the latest correct checkpoint and roll the system back to state S,,3,,, after resuming, the system will fail again. We can see that, in this case, the latest checkpoint should be chp,,2,,, and when the system crashes, we should roll it back to state S,,2,,, by resuming from the checkpoint chp,,2,,.
    2222
    2323== Definition of Global Checkpoint ==
     
    2828To compose a globally consistent state of all the VMs, the checkpoint of each VM must be coordinated. Besides checkpointing each VM’s correct state, it’s also essential to guarantee the consistency of all communication states within the virtual network. In Figure 2, the messages exchanged among the VMs are marked by arrows going from the sender to the receiver. The execution line of the VMs is separated by their corresponding checkpoints. The upper part of each checkpoint corresponds to the state before the checkpoint and the lower part of each checkpoint corresponds to the state after the checkpoint. A global checkpoint (consistent or not) is marked as the “cut” line, which separates each VM’s timeline into two parts. We label the messages exchanged in the virtual network into three categories:
    2929
    30 (1) The state of the message’s source and the destination are on the same side of the cut line. For example, in Figure 2, both the source state and the destination state of message m1 are above the cut line. Similarly, both the source state and the destination state of message m2 are under the cut line.
     30(1) The state of the message’s source and the destination are on the same side of the cut line. For example, in Figure 2, both the source state and the destination state of message m,,1,, are above the cut line. Similarly, both the source state and the destination state of message m,,2,, are under the cut line.
    3131
    32 (2) The message’s source state is above the cut line while the destination state is under the cut line, like message m3.
     32(2) The message’s source state is above the cut line while the destination state is under the cut line, like message m,,3,,.
    3333
    34 (3) The message’s source state is under the cut line while the destination state is above the cut line, like message m4.
     34(3) The message’s source state is under the cut line while the destination state is above the cut line, like message m,,4,,.
    3535
    36 For these three types of messages, we can see that a globally consistent cut must ensure the delivery of type (1) and type (2) messages, but avoid type (3) messages. For example, consider the message m4. In VM3’s checkpoint saved on the cut line, m4 is already recorded as being received. However, in VM4’s checkpoint saved on the same cut line, it has no record that m4 has been sent out. Therefore, the state saved on VM4’s global cut is inconsistent, because in VM4’s view, VM3 receives a message m4, which is sent by no one.
     36For these three types of messages, we can see that a globally consistent cut must ensure the delivery of type (1) and type (2) messages, but avoid type (3) messages. For example, consider the message m,,4,,. In VM3’s checkpoint saved on the cut line, m,,4,, is already recorded as being received. However, in VM,,4,,’s checkpoint saved on the same cut line, it has no record that m4 has been sent out. Therefore, the state saved on VM,,4,,’s global cut is inconsistent, because in VM,,4,,’s view, VM,,3,, receives a message m,,4,,, which is sent by no one.
    3737
    3838== Distributed Checkpoint Algorithm in [wiki:VDEchp VDEchp] ==