Context Navigation

Changes between Version 43 and Version 44 of FGBI

Timestamp:: 10/10/11 01:28:24 (14 years ago)
Author:: lvpeng
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

FGBI

v43	v44
10	10	Figure 1. Primary-Backup model and the downtime problem.
11	11
12		Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From ~~Jiang’s~~ paper, we observe that for memory-intensive workloads running on guest VMs (such as the HighSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests.
	12	Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From the [wiki:Publications SSS'10] paper, we observe that for memory-intensive workloads running on guest VMs (such as the HighSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests.
13	13
14	14	There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, while trying to synchronize the memory state between the primary and backup hosts all the time, then at the last epoch, there won’t be significant new memory updates that need to be transferred. Thus, we can also reduce type I downtime.