| 1 | \documentclass[a4paper]{article} |
| 2 | \begin{document} |
| 3 | |
| 4 | |
| 5 | \title{The rsync algorithm} |
| 6 | |
| 7 | \author{Andrew Tridgell \quad\quad Paul Mackerras\\ |
| 8 | Department of Computer Science \\ |
| 9 | Australian National University \\ |
| 10 | Canberra, ACT 0200, Australia} |
| 11 | |
| 12 | \maketitle |
| 13 | |
| 14 | \begin{abstract} |
| 15 | This report presents an algorithm for updating a file on one machine |
| 16 | to be identical to a file on another machine. We assume that the |
| 17 | two machines are connected by a low-bandwidth high-latency |
| 18 | bi-directional communications link. The algorithm identifies parts |
| 19 | of the source file which are identical to some part of the |
| 20 | destination file, and only sends those parts which cannot be matched |
| 21 | in this way. Effectively, the algorithm computes a set of |
| 22 | differences without having both files on the same machine. The |
| 23 | algorithm works best when the files are similar, but will also |
| 24 | function correctly and reasonably efficiently when the files are |
| 25 | quite different. |
| 26 | \end{abstract} |
| 27 | |
| 28 | \section{The problem} |
| 29 | |
| 30 | Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be |
| 31 | the same as $A$. The obvious method is to copy $A$ onto $B$. |
| 32 | |
| 33 | Now imagine that the two files are on machines connected by a slow |
| 34 | communications link, for example a dial up IP link. If $A$ is large, |
| 35 | copying $A$ onto $B$ will be slow. To make it faster you could |
| 36 | compress $A$ before sending it, but that will usually only gain a |
| 37 | factor of 2 to 4. |
| 38 | |
| 39 | Now assume that $A$ and $B$ are quite similar, perhaps both derived |
| 40 | from the same original file. To really speed things up you would need |
| 41 | to take advantage of this similarity. A common method is to send just |
| 42 | the differences between $A$ and $B$ down the link and then use this |
| 43 | list of differences to reconstruct the file. |
| 44 | |
| 45 | The problem is that the normal methods for creating a set of |
| 46 | differences between two files rely on being able to read both files. |
| 47 | Thus they require that both files are available beforehand at one end |
| 48 | of the link. If they are not both available on the same machine, |
| 49 | these algorithms cannot be used (once you had copied the file over, |
| 50 | you wouldn't need the differences). This is the problem that rsync |
| 51 | addresses. |
| 52 | |
| 53 | The rsync algorithm efficiently computes which parts of a source file |
| 54 | match some part of an existing destination file. These parts need not |
| 55 | be sent across the link; all that is needed is a reference to the part |
| 56 | of the destination file. Only parts of the source file which are not |
| 57 | matched in this way need to be sent verbatim. The receiver can then |
| 58 | construct a copy of the source file using the references to parts of |
| 59 | the existing destination file and the verbatim material. |
| 60 | |
| 61 | Trivially, the data sent to the receiver can be compressed using any |
| 62 | of a range of common compression algorithms, for further speed |
| 63 | improvements. |
| 64 | |
| 65 | \section{The rsync algorithm} |
| 66 | |
| 67 | Suppose we have two general purpose computers $\alpha$ and $\beta$. |
| 68 | Computer $\alpha$ has access to a file $A$ and $\beta$ has access to |
| 69 | file $B$, where $A$ and $B$ are ``similar''. There is a slow |
| 70 | communications link between $\alpha$ and $\beta$. |
| 71 | |
| 72 | The rsync algorithm consists of the following steps: |
| 73 | |
| 74 | \begin{enumerate} |
| 75 | \item $\beta$ splits the file $B$ into a series of non-overlapping |
| 76 | fixed-sized blocks of size S bytes\footnote{We have found that |
| 77 | values of S between 500 and 1000 are quite good for most purposes}. |
| 78 | The last block may be shorter than $S$ bytes. |
| 79 | |
| 80 | \item For each of these blocks $\beta$ calculates two checksums: |
| 81 | a weak ``rolling'' 32-bit checksum (described below) and a strong |
| 82 | 128-bit MD4 checksum. |
| 83 | |
| 84 | \item $\beta$ sends these checksums to $\alpha$. |
| 85 | |
| 86 | \item $\alpha$ searches through $A$ to find all blocks of length $S$ |
| 87 | bytes (at any offset, not just multiples of $S$) that have the same |
| 88 | weak and strong checksum as one of the blocks of $B$. This can be |
| 89 | done in a single pass very quickly using a special property of the |
| 90 | rolling checksum described below. |
| 91 | |
| 92 | \item $\alpha$ sends $\beta$ a sequence of instructions for |
| 93 | constructing a copy of $A$. Each instruction is either a reference |
| 94 | to a block of $B$, or literal data. Literal data is sent only for |
| 95 | those sections of $A$ which did not match any of the blocks of $B$. |
| 96 | \end{enumerate} |
| 97 | |
| 98 | The end result is that $\beta$ gets a copy of $A$, but only the pieces |
| 99 | of $A$ that are not found in $B$ (plus a small amount of data for |
| 100 | checksums and block indexes) are sent over the link. The algorithm |
| 101 | also only requires one round trip, which minimises the impact of the |
| 102 | link latency. |
| 103 | |
| 104 | The most important details of the algorithm are the rolling checksum |
| 105 | and the associated multi-alternate search mechanism which allows the |
| 106 | all-offsets checksum search to proceed very quickly. These will be |
| 107 | discussed in greater detail below. |
| 108 | |
| 109 | \section{Rolling checksum} |
| 110 | |
| 111 | The weak rolling checksum used in the rsync algorithm needs to have |
| 112 | the property that it is very cheap to calculate the checksum of a |
| 113 | buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and |
| 114 | the values of the bytes $X_1$ and $X_{n+1}$. |
| 115 | |
| 116 | The weak checksum algorithm we used in our implementation was inspired |
| 117 | by Mark Adler's adler-32 checksum. Our checksum is defined by |
| 118 | $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$ |
| 119 | $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$ |
| 120 | $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$ |
| 121 | |
| 122 | where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$. |
| 123 | For simplicity and speed, we use $M = 2^{16}$. |
| 124 | |
| 125 | The important property of this checksum is that successive values can |
| 126 | be computed very efficiently using the recurrence relations |
| 127 | |
| 128 | $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$ |
| 129 | $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$ |
| 130 | |
| 131 | Thus the checksum can be calculated for blocks of length S at all |
| 132 | possible offsets within a file in a ``rolling'' fashion, with very |
| 133 | little computation at each point. |
| 134 | |
| 135 | Despite its simplicity, this checksum was found to be quite adequate as |
| 136 | a first level check for a match of two file blocks. We have found in |
| 137 | practice that the probability of this checksum matching when the |
| 138 | blocks are not equal is quite low. This is important because the much |
| 139 | more expensive strong checksum must be calculated for each block where |
| 140 | the weak checksum matches. |
| 141 | |
| 142 | \section{Checksum searching} |
| 143 | |
| 144 | Once $\alpha$ has received the list of checksums of the blocks of $B$, |
| 145 | it must search $A$ for any blocks at any offset that match the |
| 146 | checksum of some block of $B$. The basic strategy is to compute the |
| 147 | 32-bit rolling checksum for a block of length $S$ starting at each |
| 148 | byte of $A$ in turn, and for each checksum, search the list for a |
| 149 | match. To do this our implementation uses a |
| 150 | simple 3 level searching scheme. |
| 151 | |
| 152 | The first level uses a 16-bit hash of the 32-bit rolling checksum and |
| 153 | a $2^{16}$ entry hash table. The list of checksum values (i.e., the |
| 154 | checksums from the blocks of $B$) is sorted according to the 16-bit |
| 155 | hash of the 32-bit rolling checksum. Each entry in the hash table |
| 156 | points to the first element of the list for that hash value, or |
| 157 | contains a null value if no element of the list has that hash value. |
| 158 | |
| 159 | At each offset in the file the 32-bit rolling checksum and its 16-bit |
| 160 | hash are calculated. If the hash table entry for that hash value is |
| 161 | not a null value, the second level check is invoked. |
| 162 | |
| 163 | The second level check involves scanning the sorted checksum list |
| 164 | starting with the entry pointed to by the hash table entry, looking |
| 165 | for an entry whose 32-bit rolling checksum matches the current value. |
| 166 | The scan terminates when it reaches an entry whose 16-bit hash |
| 167 | differs. If this search finds a match, the third level check is |
| 168 | invoked. |
| 169 | |
| 170 | The third level check involves calculating the strong checksum for the |
| 171 | current offset in the file and comparing it with the strong checksum |
| 172 | value in the current list entry. If the two strong checksums match, |
| 173 | we assume that we have found a block of $A$ which matches a block of |
| 174 | $B$. In fact the blocks could be different, but the probability of |
| 175 | this is microscopic, and in practice this is a reasonable assumption. |
| 176 | |
| 177 | When a match is found, $\alpha$ sends $\beta$ the data in $A$ between |
| 178 | the current file offset and the end of the previous match, followed by |
| 179 | the index of the block in $B$ that matched. This data is sent |
| 180 | immediately a match is found, which allows us to overlap the |
| 181 | communication with further computation. |
| 182 | |
| 183 | If no match is found at a given offset in the file, the rolling |
| 184 | checksum is updated to the next offset and the search proceeds. If a |
| 185 | match is found, the search is restarted at the end of the matched |
| 186 | block. This strategy saves a considerable amount of computation for |
| 187 | the common case where the two files are nearly identical. In |
| 188 | addition, it would be a simple matter to encode the block indexes as |
| 189 | runs, for the common case where a portion of $A$ matches a series of |
| 190 | blocks of $B$ in order. |
| 191 | |
| 192 | \section{Pipelining} |
| 193 | |
| 194 | The above sections describe the process for constructing a copy of one |
| 195 | file on a remote system. If we have a several files to copy, we can |
| 196 | gain a considerable latency advantage by pipelining the process. |
| 197 | |
| 198 | This involves $\beta$ initiating two independent processes. One of the |
| 199 | processes generates and sends the checksums to $\alpha$ while the |
| 200 | other receives the difference information from $\alpha$ and |
| 201 | reconstructs the files. |
| 202 | |
| 203 | If the communications link is buffered then these two processes can |
| 204 | proceed independently and the link should be kept fully utilised in |
| 205 | both directions for most of the time. |
| 206 | |
| 207 | \section{Results} |
| 208 | |
| 209 | To test the algorithm, tar files were created of the Linux kernel |
| 210 | sources for two versions of the kernel. The two kernel versions were |
| 211 | 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and |
| 212 | are separated by 5 released patch levels. |
| 213 | |
| 214 | Out of the 2441 files in the 1.99.10 release, 291 files had changed in |
| 215 | the 2.0.0 release, 19 files had been removed and 25 files had been |
| 216 | added. |
| 217 | |
| 218 | A ``diff'' of the two tar files using the standard GNU diff utility |
| 219 | produced over 32 thousand lines of output totalling 2.1 MB. |
| 220 | |
| 221 | The following table shows the results for rsync between the two files |
| 222 | with a varying block size.\footnote{All the tests in this section were |
| 223 | carried out using rsync version 0.5} |
| 224 | |
| 225 | \vspace*{5mm} |
| 226 | \begin{tabular}{|l|l|l|l|l|l|l|} \hline |
| 227 | {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ |
| 228 | {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline |
| 229 | |
| 230 | 300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline |
| 231 | 500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline |
| 232 | 700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline |
| 233 | 900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline |
| 234 | 1100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline |
| 235 | \end{tabular} |
| 236 | \vspace*{5mm} |
| 237 | |
| 238 | In each case, the CPU time taken was less than the |
| 239 | time it takes to run ``diff'' on the two files.\footnote{The wall |
| 240 | clock time was approximately 2 minutes per run on a 50 MHz SPARC 10 |
| 241 | running SunOS, using rsh over loopback for communication. GNU diff |
| 242 | took about 4 minutes.} |
| 243 | |
| 244 | The columns in the table are as follows: |
| 245 | |
| 246 | \begin{description} |
| 247 | \item [block size] The size in bytes of the checksummed blocks. |
| 248 | \item [matches] The number of times a block of $B$ was found in $A$. |
| 249 | \item [tag hits] The number of times the 16 bit hash of the rolling |
| 250 | checksum matched a hash of one of the checksums from $B$. |
| 251 | \item [false alarms] The number of times the 32 bit rolling checksum |
| 252 | matched but the strong checksum didn't. |
| 253 | \item [data] The amount of file data transferred verbatim, in bytes. |
| 254 | \item [written] The total number of bytes written by $\alpha$ |
| 255 | including protocol overheads. This is almost all file data. |
| 256 | \item [read] The total number of bytes read by $\alpha$ including |
| 257 | protocol overheads. This is almost all checksum information. |
| 258 | \end{description} |
| 259 | |
| 260 | The results demonstrate that for block sizes above 300 bytes, only a |
| 261 | small fraction (around 5\%) of the file was transferred. The amount |
| 262 | transferred was also considerably less than the size of the diff file |
| 263 | that would have been transferred if the diff/patch method of updating |
| 264 | a remote file was used. |
| 265 | |
| 266 | The checksums themselves took up a considerable amount of space, |
| 267 | although much less than the size of the data transferred in each |
| 268 | case. Each pair of checksums consumes 20 bytes: 4 bytes for the |
| 269 | rolling checksum plus 16 bytes for the 128-bit MD4 checksum. |
| 270 | |
| 271 | The number of false alarms was less than $1/1000$ of the number of |
| 272 | true matches, indicating that the 32 bit rolling checksum is quite |
| 273 | good at screening out false matches. |
| 274 | |
| 275 | The number of tag hits indicates that the second level of the |
| 276 | checksum search algorithm was invoked about once every 50 |
| 277 | characters. This is quite high because the total number of blocks in |
| 278 | the file is a large fraction of the size of the tag hash table. For |
| 279 | smaller files we would expect the tag hit rate to be much closer to |
| 280 | the number of matches. For extremely large files, we should probably |
| 281 | increase the size of the hash table. |
| 282 | |
| 283 | The next table shows similar results for a much smaller set of files. |
| 284 | In this case the files were not packed into a tar file first. Rather, |
| 285 | rsync was invoked with an option to recursively descend the directory |
| 286 | tree. The files used were from two source releases of another software |
| 287 | package called Samba. The total source code size is 1.7 MB and the |
| 288 | diff between the two releases is 4155 lines long totalling 120 kB. |
| 289 | |
| 290 | \vspace*{5mm} |
| 291 | \begin{tabular}{|l|l|l|l|l|l|l|} \hline |
| 292 | {\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\ |
| 293 | {\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline |
| 294 | |
| 295 | 300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline |
| 296 | 500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline |
| 297 | 700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline |
| 298 | 900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline |
| 299 | 1100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline |
| 300 | \end{tabular} |
| 301 | \vspace*{5mm} |
| 302 | |
| 303 | |
| 304 | \section{Availability} |
| 305 | |
| 306 | An implementation of rsync which provides a convenient interface |
| 307 | similar to the common UNIX command rcp has been written and is |
| 308 | available for download from ftp://samba.anu.edu.au/pub/rsync. |
| 309 | |
| 310 | \end{document} |