Changes between Version 45 and Version 46 of FGBI
- Timestamp:
- 10/10/11 01:36:35 (13 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
FGBI
v45 v46 12 12 Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From the [wiki:Publications SSS'10] paper, we observe that for memory-intensive workloads running on guest VMs (such as the highSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests. 13 13 14 There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, while trying to synchronize the memory state between the primary and backup hosts all the time, then at the last epoch, there won’t be significant newmemory updates that need to be transferred. Thus, we can also reduce type I downtime.14 There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we synchronize the memory state between the primary and backup hosts all the time, and reduce the transferred data at each epoch, then at the last epoch, there won’t be significant memory updates that need to be transferred. Thus, we can also reduce type I downtime. 15 15 16 16 == [wiki:FGBI FGBI] Design == … … 48 48 Figure 2. Type I Downtime comparison under different benchmarks. 49 49 50 Figures 2 a, 2b, 2c, and 2dshow the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP],50 Figures 2(a), 2(b), 2(c), and 2(d) show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], 51 51 [http://www.spec.org/web2005/ SPECweb], and [http://www.spec.org/sfs97r1/ SPECsys] applications, respectively. The block size used in all 52 52 experiments is 64 bytes. For [http://nss.cs.ubc.ca/remus/ Remus] and [wiki:FGBI FGBI], the checkpointing period is the … … 55 55 same value for the checkpointing frequency of [http://nss.cs.ubc.ca/remus/ Remus]/[wiki:FGBI FGBI] and the network 56 56 buffer frequency of [wiki:LLM LLM], we ensure the fairness of the comparison. We observe 57 that Figures 2 a and 2bshow a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM].58 Under [http://httpd.apache.org/ Apache] (Figure 2 a), the network load is high but system updates are57 that Figures 2(a) and 2(b) show a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM]. 58 Under [http://httpd.apache.org/ Apache] (Figure 2(a)), the network load is high but system updates are 59 59 rare. Therefore, [wiki:LLM LLM] performs better than [wiki:FGBI FGBI], since it uses a much higher 60 60 frequency to migrate the network service requests. On the other hand, when 61 running memory-intensive applications (Figure 2 b and 2d), which involve high61 running memory-intensive applications (Figure 2(b) and 2(d)), which involve high 62 62 computational loads, [wiki:LLM LLM] endures a much longer downtime than [wiki:FGBI FGBI] (even 63 63 worse than [http://nss.cs.ubc.ca/remus/ Remus]). … … 66 66 pages/second. Thus, [http://www.spec.org/web2005/ SPECweb] is not a lightweight computational workload for 67 67 these migration mechanisms. As a result, the relationship between [wiki:FGBI FGBI] and 68 [wiki:LLM LLM] in Figure 2 c is more similar to that in Figure 2b (and also Figure 2d),69 rather than Figure 2 a. In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the68 [wiki:LLM LLM] in Figure 2(c) is more similar to that in Figure 2(b) (and also Figure 2(d)), 69 rather than Figure 2(a). In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the 70 70 downtime by as much as 77%. Moreover, compared with [http://nss.cs.ubc.ca/remus/ Remus], [wiki:FGBI FGBI] yields a 71 71 shorter downtime, by as much as 31% under [http://httpd.apache.org/ Apache], 45% under [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], 39% … … 91 91 Figure 3. (a) Overhead under different block size. (b) comparison of proposed techniques. 92 92 93 Figure 3 ashows the overhead during VM migration. The figure compares the93 Figure 3(a) shows the overhead during VM migration. The figure compares the 94 94 applications' runtime with and without migration, under [http://httpd.apache.org/ Apache], [http://www.spec.org/web2005/ SPECweb], 95 95 [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], and [http://www.spec.org/sfs97r1/ SPECsys], with the size of the fine-grained blocks varying from 64 bytes to 128 bytes and 256 bytes. We observe that, in all cases, the overhead is … … 112 112 than 10 times the downtime that [wiki:FGBI FGBI] incurs. 113 113 114 We observe from Figure 3 bthat if we just apply the [wiki:FGBI FGBI] mechanism without114 We observe from Figure 3(b) that if we just apply the [wiki:FGBI FGBI] mechanism without 115 115 integrating sharing or compression support, the downtime is reduced, compared 116 116 with that of [http://nss.cs.ubc.ca/remus/ Remus] in Figure 3b, but it is not significant (reduction is no more