Context Navigation

Changes between Version 45 and Version 46 of FGBI

Timestamp:: 10/10/11 01:36:35 (14 years ago)
Author:: lvpeng
Comment:: --

Legend:

: Unmodified
: Added
: Removed
: Modified

FGBI

-                      v45
+                      v46
 Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From the [wiki:Publications SSS'10] paper, we observe that for memory-intensive workloads running on guest VMs (such as the highSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests.
 There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, while trying to synchronize the memory state between the primary and backup hosts all the time, then at the last epoch, there won’t be significant new memory updates that need to be transferred. Thus, we can also reduce type I downtime.
+There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we synchronize the memory state between the primary and backup hosts all the time, and reduce the transferred data at each epoch, then at the last epoch, there won’t be significant memory updates that need to be transferred. Thus, we can also reduce type I downtime.
 == [wiki:FGBI FGBI] Design ==
 …
                 Figure 2. Type I Downtime comparison under different benchmarks.
 Figures 2a, 2b, 2c, and 2d show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP],
+Figures 2(a), 2(b), 2(c), and 2(d) show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP],
 [http://www.spec.org/web2005/ SPECweb], and [http://www.spec.org/sfs97r1/ SPECsys] applications, respectively. The block size used in all
 experiments is 64 bytes. For [http://nss.cs.ubc.ca/remus/ Remus] and [wiki:FGBI FGBI], the checkpointing period is the
 …
 same value for the checkpointing frequency of [http://nss.cs.ubc.ca/remus/ Remus]/[wiki:FGBI FGBI] and the network
 buffer frequency of [wiki:LLM LLM], we ensure the fairness of the comparison. We observe
 that Figures 2a and 2b show a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM].
 Under [http://httpd.apache.org/ Apache] (Figure 2a), the network load is high but system updates are
+that Figures 2(a) and 2(b) show a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM].
+Under [http://httpd.apache.org/ Apache] (Figure 2(a)), the network load is high but system updates are
 rare. Therefore, [wiki:LLM LLM] performs better than [wiki:FGBI FGBI], since it uses a much higher
 frequency to migrate the network service requests. On the other hand, when
 running memory-intensive applications (Figure 2b and 2d), which involve high
+running memory-intensive applications (Figure 2(b) and 2(d)), which involve high
 computational loads, [wiki:LLM LLM] endures a much longer downtime than [wiki:FGBI FGBI] (even
 worse than [http://nss.cs.ubc.ca/remus/ Remus]).
 …
 pages/second. Thus, [http://www.spec.org/web2005/ SPECweb] is not a lightweight computational workload for
 these migration mechanisms. As a result, the relationship between [wiki:FGBI FGBI] and
 [wiki:LLM LLM] in Figure 2c is more similar to that in Figure 2b (and also Figure 2d),
 rather than Figure 2a. In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the
+[wiki:LLM LLM] in Figure 2(c) is more similar to that in Figure 2(b) (and also Figure 2(d)),
+rather than Figure 2(a). In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the
 downtime by as much as 77%. Moreover, compared with [http://nss.cs.ubc.ca/remus/ Remus], [wiki:FGBI FGBI] yields a
 shorter downtime, by as much as 31% under [http://httpd.apache.org/ Apache], 45% under [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], 39%
 …
     Figure 3. (a) Overhead under different block size. (b) comparison of proposed techniques.
 Figure 3a shows the overhead during VM migration. The figure compares the
+Figure 3(a) shows the overhead during VM migration. The figure compares the
 applications' runtime with and without migration, under [http://httpd.apache.org/ Apache], [http://www.spec.org/web2005/ SPECweb],
 [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], and [http://www.spec.org/sfs97r1/ SPECsys], with the size of the fine-grained blocks varying from 64 bytes to 128 bytes and 256 bytes. We observe that, in all cases, the overhead is
 …
 than 10 times the downtime that [wiki:FGBI FGBI] incurs.
 We observe from Figure 3b that if we just apply the [wiki:FGBI FGBI] mechanism without
+We observe from Figure 3(b) that if we just apply the [wiki:FGBI FGBI] mechanism without
 integrating sharing or compression support, the downtime is reduced, compared
 with that of [http://nss.cs.ubc.ca/remus/ Remus] in Figure 3b, but it is not significant (reduction is no more