3 | | Traditional xen-based systems track memory updates by keeping evidence of the dirty pages at each migration epoch. In [http://nss.cs.ubc.ca/remus/ Remus] and also our previous work, [wiki:LLM LLM], they use the same page size as Xen (for x86, this is 4KB), which is also the granularity for detecting memory changes. However, when running computational-intensive workloads under [wiki:LLM LLM] system, the long downtime performance becomes unacceptable. [wiki:FGBI FGBI] (Fine-Grained Block Identification) is a mechanism which uses smaller memory blocks (smaller than page sizes) as the granularity for detecting memory changes. [wiki:FGBI FGBI] calculates the hash value for each memory block at the beginning of each migration epoch. At the end of each epoch, instead of transferring the whole dirty page, [wiki:FGBI FGBI] computes new hash values for each block and compares them with the corresponding old values. Blocks are only modified if their corresponding hash values don’t match. Therefore, [wiki:FGBI FGBI] marks such blocks as dirty and replaces the old hash values with the new ones. Afterwards, [wiki:FGBI FGBI] only transfers dirty blocks to the backup host. |
| 3 | Traditional Xen-based systems track memory updates by keeping evidence of the dirty pages at each migration epoch. In [http://nss.cs.ubc.ca/remus/ Remus] (and in our previous work [wiki:LLM LLM]), the same page size as that of Xen (for x86, this is 4KB) is used as the granularity for detecting memory changes. However, when running computationally intensive workloads under [wiki:LLM LLM], the downtime becomes unacceptably long. [wiki:FGBI FGBI] (Fine-Grained Block Identification) is a mechanism, which uses smaller memory blocks (smaller than a page size) as the granularity for detecting memory changes. [wiki:FGBI FGBI] calculates the hash value for each memory block at the beginning of each migration epoch. At the end of each epoch, instead of transferring the whole dirty page, [wiki:FGBI FGBI] computes new hash values for each block and compares them with the corresponding old values. Blocks are only modified if their corresponding hash values don’t match. Therefore, [wiki:FGBI FGBI] marks such blocks as dirty and replaces the old hash values with the new ones. Afterwards, [wiki:FGBI FGBI] only transfers dirty blocks to the backup host. |
12 | | Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From Jiang’s paper we observe that for memory-intensive workloads running on guest VMs (such as the highSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, these workloads update the guest memory at high frequency. On the other side, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency but uses input replay as an auxiliary. In this case, when failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, it needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests. |
| 12 | Downtime is the primary factor for estimating the high availability of a system, since any long downtime experience for clients may result in loss of client loyalty and thus revenue loss. Under the Primary-Backup model (Figure 1), there are two types of downtime: I) the time from when the primary host crashes until the VM resumes from the last checkpointed state on the backup host and starts to handle client requests (D,,1,, = T,,3,, - T,,1,,); and II) the time from when the VM pauses on the primary (to save for the checkpoint) until it resumes (D,,2,,). From Jiang’s paper, we observe that for memory-intensive workloads running on guest VMs (such as the HighSys workload), [wiki:LLM LLM] endures much longer type I downtime than [http://nss.cs.ubc.ca/remus/ Remus]. This is because, such workloads update the guest memory at high frequency. In contrast, [wiki:LLM LLM] migrates the guest VM image update (mostly from memory) at low frequency, but uses input replay as an auxiliary. Thus, when a failure happens, a significant number of memory updates are needed in order to ensure synchronization between the primary and backup hosts. Therefore, [wiki:LLM LLM] needs significantly more time for the input replay process in order to resume the VM on the backup host and begin handling client requests. |
14 | | Regarding the type II downtime, there are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, trying to synchronize the memory state between primary and backup host all the time, then at the last epoch, there won’t be too much new memory update that need to be transferred, so we can reduce the type I downtime too. |
| 14 | There are several migration epochs between two checkpoints, and the newly updated memory data is copied to the backup host at each epoch. At the last epoch, the VM running on the primary host is suspended and the remaining memory states are transferred to the backup host. Thus, the type II downtime depends on the amount of memory that remains to be copied and transferred when pausing the VM on the primary host. If we reduce the dirty data which need to be transferred at the last epoch, then we can reduce the type II downtime. Moreover, if we reduce the dirty data which needs to be transferred at each epoch, while trying to synchronize the memory state between the primary and backup hosts all the time, then at the last epoch, there won’t be significant new memory updates that need to be transferred. Thus, we can also reduce type I downtime. |
51 | | Figures 2a, 2b, 2c, and 2d show the type I downtime com- |
52 | | parison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], |
| 50 | Figures 2a, 2b, 2c, and 2d show the type I downtime comparison among [wiki:FGBI FGBI], [wiki:LLM LLM], and [http://nss.cs.ubc.ca/remus/ Remus] mechanisms under [http://httpd.apache.org/ Apache], [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], |
67 | | Although [http://www.spec.org/web2005/ SPECweb] is a web workload, it still has a high page modifi- |
68 | | cation rate, which is approximately 12,000 pages/second. In our experi- |
69 | | ment, the 1 Gbps migration link is capable of transferring approximately 25,000 |
| 65 | Although [http://www.spec.org/web2005/ SPECweb] is a web workload, it still has a high page modification rate, which is approximately 12,000 pages/second. In our experiment, the 1 Gbps migration link is capable of transferring approximately 25,000 |
84 | | main observations: (1) Their downtime results are very similar for "idle" run. |
85 | | This is because [http://nss.cs.ubc.ca/remus/ Remus] is a fast checkpointing mechanism and both [wiki:LLM LLM] and [wiki:FGBI FGBI] are based on it. There is rare memory update for "idle" run, so the type |
86 | | II downtime in all three mechanisms is short. (2) When running [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP] ap- |
87 | | plication, the guest VM memory is updated at high frequency. When saved for |
88 | | the checkpoint, [wiki:LLM LLM] takes much more time to save huge "dirty" data caused |
89 | | by its low memory transfer frequency. Therefore in this case [wiki:FGBI FGBI] achieves a |
90 | | much lower downtime than [http://nss.cs.ubc.ca/remus/ Remus] (reduce more than 70%) and [wiki:LLM LLM] (more |
91 | | than 90%). (3) When running [http://httpd.apache.org/ Apache] application, the memory update is not so |
92 | | much as that when running [http://www.nas.nasa.gov/Resources/Software/npb.html NPB], but the memory update is definitely more than |
93 | | "idle" run. The downtime results shows [wiki:FGBI FGBI] still outperforms both [http://nss.cs.ubc.ca/remus/ Remus] and |
| 80 | main observations. First, the downtime results are very similar for the idle run case. |
| 81 | This is because, [http://nss.cs.ubc.ca/remus/ Remus] is a fast checkpointing mechanism and both [wiki:LLM LLM] and [wiki:FGBI FGBI] are based on it. Memory update are rare during idle runs, so the type II downtime in all three mechanisms is short and similar. Second, when running the [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP] application, the guest VM memory is updated at high frequency. When saving the checkpoint, [wiki:LLM LLM] takes much more time to save large "dirty" data caused |
| 82 | by its low memory transfer frequency. Therefore in this case, [wiki:FGBI FGBI] achieves a |
| 83 | much lower downtime than [http://nss.cs.ubc.ca/remus/ Remus] (reduction is more than 70%) and [wiki:LLM LLM] (reduction is more |
| 84 | than 90%). Finally, when running the [http://httpd.apache.org/ Apache] application, the memory update is not so |
| 85 | much as that when running [http://www.nas.nasa.gov/Resources/Software/npb.html NPB], but the memory update is significantly more than the idle run. The downtime results show that [wiki:FGBI FGBI] still outperforms both [http://nss.cs.ubc.ca/remus/ Remus] and |
103 | | [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], and [http://www.spec.org/sfs97r1/ SPECsys], with the size of the fine-grained blocks varies from 64 |
104 | | bytes to 128 bytes and 256 bytes. We observe that in all cases the overhead is |
105 | | low, no more than 13% ([http://httpd.apache.org/ Apache] with 64 bytes block). As we discuss in Section 3, |
| 95 | [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP], and [http://www.spec.org/sfs97r1/ SPECsys], with the size of the fine-grained blocks varying from 64 bytes to 128 bytes and 256 bytes. We observe that, in all cases, the overhead is |
| 96 | low, no more than 13% ([http://httpd.apache.org/ Apache] with 64 byte block). As discussed before, |
113 | | techniques (i.e., [wiki:FGBI FGBI], sharing, and compression), Figure 3b shows the break- |
114 | | down of the performance improvement among them under the [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP] bench- |
115 | | mark. It compares the downtime between integrated [wiki:FGBI FGBI] (which we use for |
116 | | evaluation in this Section), [wiki:FGBI FGBI] with sharing but no compression support, |
| 104 | techniques (i.e., [wiki:FGBI FGBI], sharing, and compression), Figure 3(b) shows the break- |
| 105 | down of the performance improvement among them under the [http://www.nas.nasa.gov/Resources/Software/npb.html NPB-EP] benchmark. The figure compares the downtime between integrated [wiki:FGBI FGBI] (which we use for |
| 106 | evaluation here), [wiki:FGBI FGBI] with sharing but no compression support, |