Changes between Version 45 and Version 46 of FGBI


Ignore:
Timestamp:
10/10/11 01:36:35 (13 years ago)
Author:
lvpeng
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FGBI

    v45 v46  
    1212Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From the [wiki:Publications SSS'10] paper, we observe that for memory-intensive workloads running on guest VMs (such as the highSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests.
    1313
    14 There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, while trying to synchronize the memory state between the primary and backup hosts all the time, then at the last epoch, there won’t be significant new memory updates that need to be transferred. Thus, we can also reduce type I downtime.
     14There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we synchronize the memory state between the primary and backup hosts all the time, and reduce the transferred data at each epoch, then at the last epoch, there won’t be significant memory updates that need to be transferred. Thus, we can also reduce type I downtime.
    1515
    1616== [wiki:FGBI FGBI] Design ==
     
    4848                Figure 2. Type I Downtime comparison under different benchmarks.
    4949
    50 Figures 2a, 2b, 2c, and 2d show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP],
     50Figures 2(a), 2(b), 2(c), and 2(d) show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP],
    5151[http://www.spec.org/web2005/ SPECweb], and [http://www.spec.org/sfs97r1/ SPECsys] applications, respectively. The block size used in all
    5252experiments is 64 bytes. For [http://nss.cs.ubc.ca/remus/ Remus] and [wiki:FGBI FGBI], the checkpointing period is the
     
    5555same value for the checkpointing frequency of [http://nss.cs.ubc.ca/remus/ Remus]/[wiki:FGBI FGBI] and the network
    5656buffer frequency of [wiki:LLM LLM], we ensure the fairness of the comparison. We observe
    57 that Figures 2a and 2b show a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM].
    58 Under [http://httpd.apache.org/ Apache] (Figure 2a), the network load is high but system updates are
     57that Figures 2(a) and 2(b) show a reverse relationship between [wiki:FGBI FGBI] and [wiki:LLM LLM].
     58Under [http://httpd.apache.org/ Apache] (Figure 2(a)), the network load is high but system updates are
    5959rare. Therefore, [wiki:LLM LLM] performs better than [wiki:FGBI FGBI], since it uses a much higher
    6060frequency to migrate the network service requests. On the other hand, when
    61 running memory-intensive applications (Figure 2b and 2d), which involve high
     61running memory-intensive applications (Figure 2(b) and 2(d)), which involve high
    6262computational loads, [wiki:LLM LLM] endures a much longer downtime than [wiki:FGBI FGBI] (even
    6363worse than [http://nss.cs.ubc.ca/remus/ Remus]).
     
    6666pages/second. Thus, [http://www.spec.org/web2005/ SPECweb] is not a lightweight computational workload for
    6767these migration mechanisms. As a result, the relationship between [wiki:FGBI FGBI] and
    68 [wiki:LLM LLM] in Figure 2c is more similar to that in Figure 2b (and also Figure 2d),
    69 rather than Figure 2a. In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the
     68[wiki:LLM LLM] in Figure 2(c) is more similar to that in Figure 2(b) (and also Figure 2(d)),
     69rather than Figure 2(a). In conclusion, compared with [wiki:LLM LLM], [wiki:FGBI FGBI] reduces the
    7070downtime by as much as 77%. Moreover, compared with [http://nss.cs.ubc.ca/remus/ Remus], [wiki:FGBI FGBI] yields a
    7171shorter downtime, by as much as 31% under [http://httpd.apache.org/ Apache], 45% under [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], 39%
     
    9191    Figure 3. (a) Overhead under different block size. (b) comparison of proposed techniques.
    9292
    93 Figure 3a shows the overhead during VM migration. The figure compares the
     93Figure 3(a) shows the overhead during VM migration. The figure compares the
    9494applications' runtime with and without migration, under [http://httpd.apache.org/ Apache], [http://www.spec.org/web2005/ SPECweb],
    9595[http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], and [http://www.spec.org/sfs97r1/ SPECsys], with the size of the fine-grained blocks varying from 64 bytes to 128 bytes and 256 bytes. We observe that, in all cases, the overhead is
     
    112112than 10 times the downtime that [wiki:FGBI FGBI] incurs.
    113113
    114 We observe from Figure 3b that if we just apply the [wiki:FGBI FGBI] mechanism without
     114We observe from Figure 3(b) that if we just apply the [wiki:FGBI FGBI] mechanism without
    115115integrating sharing or compression support, the downtime is reduced, compared
    116116with that of [http://nss.cs.ubc.ca/remus/ Remus] in Figure 3b, but it is not significant (reduction is no more