Commit | Line | Data |
---|---|---|
c627d613 AT |
1 | \documentclass[a4paper]{article} |
2 | \begin{document} | |
3 | ||
4 | ||
5 | \title{The rsync algorithm} | |
6 | ||
7 | \author{Andrew Tridgell \quad\quad Paul Mackerras\\ | |
8 | Department of Computer Science \\ | |
9 | Australian National University \\ | |
10 | Canberra, ACT 0200, Australia} | |
11 | ||
12 | \maketitle | |
13 | ||
14 | \begin{abstract} | |
15 | This report presents an algorithm for updating a file on one machine | |
16 | to be identical to a file on another machine. We assume that the | |
17 | two machines are connected by a low-bandwidth high-latency | |
18 | bi-directional communications link. The algorithm identifies parts | |
19 | of the source file which are identical to some part of the | |
20 | destination file, and only sends those parts which cannot be matched | |
21 | in this way. Effectively, the algorithm computes a set of | |
22 | differences without having both files on the same machine. The | |
23 | algorithm works best when the files are similar, but will also | |
24 | function correctly and reasonably efficiently when the files are | |
25 | quite different. | |
26 | \end{abstract} | |
27 | ||
28 | \section{The problem} | |
29 | ||
30 | Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be | |
31 | the same as $A$. The obvious method is to copy $A$ onto $B$. | |
32 | ||
33 | Now imagine that the two files are on machines connected by a slow | |
34 | communications link, for example a dial up IP link. If $A$ is large, | |
35 | copying $A$ onto $B$ will be slow. To make it faster you could | |
36 | compress $A$ before sending it, but that will usually only gain a | |
37 | factor of 2 to 4. | |
38 | ||
39 | Now assume that $A$ and $B$ are quite similar, perhaps both derived | |
40 | from the same original file. To really speed things up you would need | |
41 | to take advantage of this similarity. A common method is to send just | |
42 | the differences between $A$ and $B$ down the link and then use this | |
43 | list of differences to reconstruct the file. | |
44 | ||
45 | The problem is that the normal methods for creating a set of | |
46 | differences between two files rely on being able to read both files. | |
47 | Thus they require that both files are available beforehand at one end | |
48 | of the link. If they are not both available on the same machine, | |
49 | these algorithms cannot be used (once you had copied the file over, | |
50 | you wouldn't need the differences). This is the problem that rsync | |
51 | addresses. | |
52 | ||
53 | The rsync algorithm efficiently computes which parts of a source file | |
54 | match some part of an existing destination file. These parts need not | |
55 | be sent across the link; all that is needed is a reference to the part | |
56 | of the destination file. Only parts of the source file which are not | |
57 | matched in this way need to be sent verbatim. The receiver can then | |
58 | construct a copy of the source file using the references to parts of | |
59 | the existing destination file and the verbatim material. | |
60 | ||
61 | Trivially, the data sent to the receiver can be compressed using any | |
62 | of a range of common compression algorithms, for further speed | |
63 | improvements. | |
64 | ||
65 | \section{The rsync algorithm} | |
66 | ||
67 | Suppose we have two general purpose computers $\alpha$ and $\beta$. | |
68 | Computer $\alpha$ has access to a file $A$ and $\beta$ has access to | |
69 | file $B$, where $A$ and $B$ are ``similar''. There is a slow | |
70 | communications link between $\alpha$ and $\beta$. | |
71 | ||
72 | The rsync algorithm consists of the following steps: | |
73 | ||
74 | \begin{enumerate} | |
75 | \item $\beta$ splits the file $B$ into a series of non-overlapping | |
76 | fixed-sized blocks of size S bytes\footnote{We have found that | |
77 | values of S between 500 and 1000 are quite good for most purposes}. | |
78 | The last block may be shorter than $S$ bytes. | |
79 | ||
80 | \item For each of these blocks $\beta$ calculates two checksums: | |
81 | a weak ``rolling'' 32-bit checksum (described below) and a strong | |
82 | 128-bit MD4 checksum. | |
83 | ||
84 | \item $\beta$ sends these checksums to $\alpha$. | |
85 | ||
86 | \item $\alpha$ searches through $A$ to find all blocks of length $S$ | |
87 | bytes (at any offset, not just multiples of $S$) that have the same | |
88 | weak and strong checksum as one of the blocks of $B$. This can be | |
89 | done in a single pass very quickly using a special property of the | |
90 | rolling checksum described below. | |
91 | ||
92 | \item $\alpha$ sends $\beta$ a sequence of instructions for | |
93 | constructing a copy of $A$. Each instruction is either a reference | |
94 | to a block of $B$, or literal data. Literal data is sent only for | |
95 | those sections of $A$ which did not match any of the blocks of $B$. | |
96 | \end{enumerate} | |
97 | ||
98 | The end result is that $\beta$ gets a copy of $A$, but only the pieces | |
99 | of $A$ that are not found in $B$ (plus a small amount of data for | |
100 | checksums and block indexes) are sent over the link. The algorithm | |
101 | also only requires one round trip, which minimises the impact of the | |
102 | link latency. | |
103 | ||
104 | The most important details of the algorithm are the rolling checksum | |
105 | and the associated multi-alternate search mechanism which allows the | |
106 | all-offsets checksum search to proceed very quickly. These will be | |
107 | discussed in greater detail below. | |
108 | ||
109 | \section{Rolling checksum} | |
110 | ||
111 | The weak rolling checksum used in the rsync algorithm needs to have | |
112 | the property that it is very cheap to calculate the checksum of a | |
113 | buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and | |
114 | the values of the bytes $X_1$ and $X_{n+1}$. | |
115 | ||
116 | The weak checksum algorithm we used in our implementation was inspired | |
117 | by Mark Adler's adler-32 checksum. Our checksum is defined by | |
118 | $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$ | |
119 | $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$ | |
120 | $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$ | |
121 | ||
122 | where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$. | |
123 | For simplicity and speed, we use $M = 2^{16}$. | |
124 | ||
125 | The important property of this checksum is that successive values can | |
126 | be computed very efficiently using the recurrence relations | |
127 | ||
128 | $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$ | |
129 | $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$ | |
130 | ||
131 | Thus the checksum can be calculated for blocks of length S at all | |
132 | possible offsets within a file in a ``rolling'' fashion, with very | |
133 | little computation at each point. | |
134 | ||
135 | Despite its simplicity, this checksum was found to be quite adequate as | |
136 | a first level check for a match of two file blocks. We have found in | |
137 | practice that the probability of this checksum matching when the | |
138 | blocks are not equal is quite low. This is important because the much | |
139 | more expensive strong checksum must be calculated for each block where | |
140 | the weak checksum matches. | |
141 | ||
142 | \section{Checksum searching} | |
143 | ||
144 | Once $\alpha$ has received the list of checksums of the blocks of $B$, | |
145 | it must search $A$ for any blocks at any offset that match the | |
146 | checksum of some block of $B$. The basic strategy is to compute the | |
147 | 32-bit rolling checksum for a block of length $S$ starting at each | |
148 | byte of $A$ in turn, and for each checksum, search the list for a | |
149 | match. To do this our implementation uses a | |
150 | simple 3 level searching scheme. | |
151 | ||
152 | The first level uses a 16-bit hash of the 32-bit rolling checksum and | |
153 | a $2^{16}$ entry hash table. The list of checksum values (i.e., the | |
154 | checksums from the blocks of $B$) is sorted according to the 16-bit | |
155 | hash of the 32-bit rolling checksum. Each entry in the hash table | |
156 | points to the first element of the list for that hash value, or | |
157 | contains a null value if no element of the list has that hash value. | |
158 | ||
159 | At each offset in the file the 32-bit rolling checksum and its 16-bit | |
160 | hash are calculated. If the hash table entry for that hash value is | |
161 | not a null value, the second level check is invoked. | |
162 | ||
163 | The second level check involves scanning the sorted checksum list | |
164 | starting with the entry pointed to by the hash table entry, looking | |
165 | for an entry whose 32-bit rolling checksum matches the current value. | |
166 | The scan terminates when it reaches an entry whose 16-bit hash | |
167 | differs. If this search finds a match, the third level check is | |
168 | invoked. | |
169 | ||
170 | The third level check involves calculating the strong checksum for the | |
171 | current offset in the file and comparing it with the strong checksum | |
172 | value in the current list entry. If the two strong checksums match, | |
173 | we assume that we have found a block of $A$ which matches a block of | |
174 | $B$. In fact the blocks could be different, but the probability of | |
175 | this is microscopic, and in practice this is a reasonable assumption. | |
176 | ||
177 | When a match is found, $\alpha$ sends $\beta$ the data in $A$ between | |
178 | the current file offset and the end of the previous match, followed by | |
179 | the index of the block in $B$ that matched. This data is sent | |
180 | immediately a match is found, which allows us to overlap the | |
181 | communication with further computation. | |
182 | ||
183 | If no match is found at a given offset in the file, the rolling | |
184 | checksum is updated to the next offset and the search proceeds. If a | |
185 | match is found, the search is restarted at the end of the matched | |
186 | block. This strategy saves a considerable amount of computation for | |
187 | the common case where the two files are nearly identical. In | |
188 | addition, it would be a simple matter to encode the block indexes as | |
189 | runs, for the common case where a portion of $A$ matches a series of | |
190 | blocks of $B$ in order. | |
191 | ||
192 | \section{Pipelining} | |
193 | ||
194 | The above sections describe the process for constructing a copy of one | |
195 | file on a remote system. If we have a several files to copy, we can | |
196 | gain a considerable latency advantage by pipelining the process. | |
197 | ||
198 | This involves $\beta$ initiating two independent processes. One of the | |
199 | processes generates and sends the checksums to $\alpha$ while the | |
200 | other receives the difference information from $\alpha$ and | |
201 | reconstructs the files. | |
202 | ||
203 | If the communications link is buffered then these two processes can | |
204 | proceed independently and the link should be kept fully utilised in | |
205 | both directions for most of the time. | |
206 | ||
207 | \section{Results} | |
208 | ||
209 | To test the algorithm, tar files were created of the Linux kernel | |
210 | sources for two versions of the kernel. The two kernel versions were | |
211 | 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and | |
212 | are separated by 5 released patch levels. | |
213 | ||
214 | Out of the 2441 files in the 1.99.10 release, 291 files had changed in | |
215 | the 2.0.0 release, 19 files had been removed and 25 files had been | |
216 | added. | |
217 | ||
218 | A ``diff'' of the two tar files using the standard GNU diff utility | |
219 | produced over 32 thousand lines of output totalling 2.1 MB. | |
220 | ||
221 | The following table shows the results for rsync between the two files | |
222 | with a varying block size.\footnote{All the tests in this section were | |
223 | carried out using rsync version 0.5} | |
224 | ||
225 | \vspace*{5mm} | |
226 | \begin{tabular}{|l|l|l|l|l|l|l|} \hline | |
227 | {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ | |
228 | {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline | |
229 | ||
230 | 300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline | |
231 | 500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline | |
232 | 700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline | |
233 | 900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline | |
234 | 1100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline | |
235 | \end{tabular} | |
236 | \vspace*{5mm} | |
237 | ||
238 | In each case, the CPU time taken was less than the | |
239 | time it takes to run ``diff'' on the two files.\footnote{The wall | |
240 | clock time was approximately 2 minutes per run on a 50 MHz SPARC 10 | |
241 | running SunOS, using rsh over loopback for communication. GNU diff | |
242 | took about 4 minutes.} | |
243 | ||
244 | The columns in the table are as follows: | |
245 | ||
246 | \begin{description} | |
247 | \item [block size] The size in bytes of the checksummed blocks. | |
248 | \item [matches] The number of times a block of $B$ was found in $A$. | |
249 | \item [tag hits] The number of times the 16 bit hash of the rolling | |
250 | checksum matched a hash of one of the checksums from $B$. | |
251 | \item [false alarms] The number of times the 32 bit rolling checksum | |
252 | matched but the strong checksum didn't. | |
253 | \item [data] The amount of file data transferred verbatim, in bytes. | |
254 | \item [written] The total number of bytes written by $\alpha$ | |
255 | including protocol overheads. This is almost all file data. | |
256 | \item [read] The total number of bytes read by $\alpha$ including | |
257 | protocol overheads. This is almost all checksum information. | |
258 | \end{description} | |
259 | ||
260 | The results demonstrate that for block sizes above 300 bytes, only a | |
261 | small fraction (around 5\%) of the file was transferred. The amount | |
262 | transferred was also considerably less than the size of the diff file | |
263 | that would have been transferred if the diff/patch method of updating | |
264 | a remote file was used. | |
265 | ||
266 | The checksums themselves took up a considerable amount of space, | |
267 | although much less than the size of the data transferred in each | |
268 | case. Each pair of checksums consumes 20 bytes: 4 bytes for the | |
269 | rolling checksum plus 16 bytes for the 128-bit MD4 checksum. | |
270 | ||
271 | The number of false alarms was less than $1/1000$ of the number of | |
272 | true matches, indicating that the 32 bit rolling checksum is quite | |
273 | good at screening out false matches. | |
274 | ||
275 | The number of tag hits indicates that the second level of the | |
276 | checksum search algorithm was invoked about once every 50 | |
277 | characters. This is quite high because the total number of blocks in | |
278 | the file is a large fraction of the size of the tag hash table. For | |
279 | smaller files we would expect the tag hit rate to be much closer to | |
280 | the number of matches. For extremely large files, we should probably | |
281 | increase the size of the hash table. | |
282 | ||
283 | The next table shows similar results for a much smaller set of files. | |
284 | In this case the files were not packed into a tar file first. Rather, | |
285 | rsync was invoked with an option to recursively descend the directory | |
286 | tree. The files used were from two source releases of another software | |
287 | package called Samba. The total source code size is 1.7 MB and the | |
288 | diff between the two releases is 4155 lines long totalling 120 kB. | |
289 | ||
290 | \vspace*{5mm} | |
291 | \begin{tabular}{|l|l|l|l|l|l|l|} \hline | |
292 | {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ | |
293 | {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline | |
294 | ||
295 | 300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline | |
296 | 500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline | |
297 | 700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline | |
298 | 900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline | |
299 | 1100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline | |
300 | \end{tabular} | |
301 | \vspace*{5mm} | |
302 | ||
303 | ||
304 | \section{Availability} | |
305 | ||
306 | An implementation of rsync which provides a convenient interface | |
307 | similar to the common UNIX command rcp has been written and is | |
308 | available for download from ftp://samba.anu.edu.au/pub/rsync. | |
309 | ||
310 | \end{document} |