Don't die if man-copy fails.
[rsync/rsync.git] / tech_report.tex
CommitLineData
c627d613
AT
1\documentclass[a4paper]{article}
2\begin{document}
3
4
5\title{The rsync algorithm}
6
7\author{Andrew Tridgell \quad\quad Paul Mackerras\\
8Department of Computer Science \\
9Australian National University \\
10Canberra, ACT 0200, Australia}
11
12\maketitle
13
14\begin{abstract}
15 This report presents an algorithm for updating a file on one machine
16 to be identical to a file on another machine. We assume that the
17 two machines are connected by a low-bandwidth high-latency
18 bi-directional communications link. The algorithm identifies parts
19 of the source file which are identical to some part of the
20 destination file, and only sends those parts which cannot be matched
21 in this way. Effectively, the algorithm computes a set of
22 differences without having both files on the same machine. The
23 algorithm works best when the files are similar, but will also
24 function correctly and reasonably efficiently when the files are
25 quite different.
26\end{abstract}
27
28\section{The problem}
29
30Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
31the same as $A$. The obvious method is to copy $A$ onto $B$.
32
33Now imagine that the two files are on machines connected by a slow
14d43f1f 34communications link, for example a dialup IP link. If $A$ is large,
c627d613
AT
35copying $A$ onto $B$ will be slow. To make it faster you could
36compress $A$ before sending it, but that will usually only gain a
37factor of 2 to 4.
38
39Now assume that $A$ and $B$ are quite similar, perhaps both derived
40from the same original file. To really speed things up you would need
41to take advantage of this similarity. A common method is to send just
42the differences between $A$ and $B$ down the link and then use this
43list of differences to reconstruct the file.
44
45The problem is that the normal methods for creating a set of
46differences between two files rely on being able to read both files.
47Thus they require that both files are available beforehand at one end
48of the link. If they are not both available on the same machine,
49these algorithms cannot be used (once you had copied the file over,
50you wouldn't need the differences). This is the problem that rsync
51addresses.
52
53The rsync algorithm efficiently computes which parts of a source file
54match some part of an existing destination file. These parts need not
55be sent across the link; all that is needed is a reference to the part
56of the destination file. Only parts of the source file which are not
57matched in this way need to be sent verbatim. The receiver can then
58construct a copy of the source file using the references to parts of
59the existing destination file and the verbatim material.
60
61Trivially, the data sent to the receiver can be compressed using any
62of a range of common compression algorithms, for further speed
63improvements.
64
65\section{The rsync algorithm}
66
67Suppose we have two general purpose computers $\alpha$ and $\beta$.
68Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
69file $B$, where $A$ and $B$ are ``similar''. There is a slow
70communications link between $\alpha$ and $\beta$.
71
72The rsync algorithm consists of the following steps:
73
74\begin{enumerate}
75\item $\beta$ splits the file $B$ into a series of non-overlapping
76 fixed-sized blocks of size S bytes\footnote{We have found that
77 values of S between 500 and 1000 are quite good for most purposes}.
78 The last block may be shorter than $S$ bytes.
79
80\item For each of these blocks $\beta$ calculates two checksums:
81 a weak ``rolling'' 32-bit checksum (described below) and a strong
82 128-bit MD4 checksum.
83
84\item $\beta$ sends these checksums to $\alpha$.
85
86\item $\alpha$ searches through $A$ to find all blocks of length $S$
87 bytes (at any offset, not just multiples of $S$) that have the same
88 weak and strong checksum as one of the blocks of $B$. This can be
89 done in a single pass very quickly using a special property of the
90 rolling checksum described below.
91
92\item $\alpha$ sends $\beta$ a sequence of instructions for
93 constructing a copy of $A$. Each instruction is either a reference
94 to a block of $B$, or literal data. Literal data is sent only for
95 those sections of $A$ which did not match any of the blocks of $B$.
96\end{enumerate}
97
98The end result is that $\beta$ gets a copy of $A$, but only the pieces
99of $A$ that are not found in $B$ (plus a small amount of data for
100checksums and block indexes) are sent over the link. The algorithm
101also only requires one round trip, which minimises the impact of the
102link latency.
103
104The most important details of the algorithm are the rolling checksum
105and the associated multi-alternate search mechanism which allows the
106all-offsets checksum search to proceed very quickly. These will be
107discussed in greater detail below.
108
109\section{Rolling checksum}
110
111The weak rolling checksum used in the rsync algorithm needs to have
112the property that it is very cheap to calculate the checksum of a
113buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
114the values of the bytes $X_1$ and $X_{n+1}$.
115
116The weak checksum algorithm we used in our implementation was inspired
117by Mark Adler's adler-32 checksum. Our checksum is defined by
118$$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
119$$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
120$$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
121
122where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
123For simplicity and speed, we use $M = 2^{16}$.
124
125The important property of this checksum is that successive values can
126be computed very efficiently using the recurrence relations
127
128$$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
129$$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
130
131Thus the checksum can be calculated for blocks of length S at all
132possible offsets within a file in a ``rolling'' fashion, with very
133little computation at each point.
134
135Despite its simplicity, this checksum was found to be quite adequate as
14d43f1f 136a first-level check for a match of two file blocks. We have found in
c627d613
AT
137practice that the probability of this checksum matching when the
138blocks are not equal is quite low. This is important because the much
139more expensive strong checksum must be calculated for each block where
140the weak checksum matches.
141
142\section{Checksum searching}
143
144Once $\alpha$ has received the list of checksums of the blocks of $B$,
145it must search $A$ for any blocks at any offset that match the
146checksum of some block of $B$. The basic strategy is to compute the
14732-bit rolling checksum for a block of length $S$ starting at each
148byte of $A$ in turn, and for each checksum, search the list for a
149match. To do this our implementation uses a
150simple 3 level searching scheme.
151
152The first level uses a 16-bit hash of the 32-bit rolling checksum and
153a $2^{16}$ entry hash table. The list of checksum values (i.e., the
154checksums from the blocks of $B$) is sorted according to the 16-bit
155hash of the 32-bit rolling checksum. Each entry in the hash table
156points to the first element of the list for that hash value, or
157contains a null value if no element of the list has that hash value.
158
159At each offset in the file the 32-bit rolling checksum and its 16-bit
160hash are calculated. If the hash table entry for that hash value is
14d43f1f 161not a null value, the second-level check is invoked.
c627d613 162
14d43f1f 163The second-level check involves scanning the sorted checksum list
c627d613
AT
164starting with the entry pointed to by the hash table entry, looking
165for an entry whose 32-bit rolling checksum matches the current value.
166The scan terminates when it reaches an entry whose 16-bit hash
14d43f1f 167differs. If this search finds a match, the third-level check is
c627d613
AT
168invoked.
169
14d43f1f 170The third-level check involves calculating the strong checksum for the
c627d613
AT
171current offset in the file and comparing it with the strong checksum
172value in the current list entry. If the two strong checksums match,
173we assume that we have found a block of $A$ which matches a block of
174$B$. In fact the blocks could be different, but the probability of
175this is microscopic, and in practice this is a reasonable assumption.
176
177When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
178the current file offset and the end of the previous match, followed by
179the index of the block in $B$ that matched. This data is sent
180immediately a match is found, which allows us to overlap the
181communication with further computation.
182
183If no match is found at a given offset in the file, the rolling
184checksum is updated to the next offset and the search proceeds. If a
185match is found, the search is restarted at the end of the matched
186block. This strategy saves a considerable amount of computation for
187the common case where the two files are nearly identical. In
188addition, it would be a simple matter to encode the block indexes as
189runs, for the common case where a portion of $A$ matches a series of
190blocks of $B$ in order.
191
192\section{Pipelining}
193
194The above sections describe the process for constructing a copy of one
195file on a remote system. If we have a several files to copy, we can
196gain a considerable latency advantage by pipelining the process.
197
198This involves $\beta$ initiating two independent processes. One of the
199processes generates and sends the checksums to $\alpha$ while the
200other receives the difference information from $\alpha$ and
201reconstructs the files.
202
203If the communications link is buffered then these two processes can
204proceed independently and the link should be kept fully utilised in
205both directions for most of the time.
206
207\section{Results}
208
209To test the algorithm, tar files were created of the Linux kernel
210sources for two versions of the kernel. The two kernel versions were
2111.99.10 and 2.0.0. These tar files are approximately 24MB in size and
212are separated by 5 released patch levels.
213
214Out of the 2441 files in the 1.99.10 release, 291 files had changed in
215the 2.0.0 release, 19 files had been removed and 25 files had been
216added.
217
218A ``diff'' of the two tar files using the standard GNU diff utility
219produced over 32 thousand lines of output totalling 2.1 MB.
220
221The following table shows the results for rsync between the two files
222with a varying block size.\footnote{All the tests in this section were
223 carried out using rsync version 0.5}
224
225\vspace*{5mm}
226\begin{tabular}{|l|l|l|l|l|l|l|} \hline
227{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
228{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
229
230300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline
231500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline
232700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline
233900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline
2341100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline
235\end{tabular}
236\vspace*{5mm}
237
238In each case, the CPU time taken was less than the
239time it takes to run ``diff'' on the two files.\footnote{The wall
240 clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
241 running SunOS, using rsh over loopback for communication. GNU diff
242 took about 4 minutes.}
243
244The columns in the table are as follows:
245
246\begin{description}
247\item [block size] The size in bytes of the checksummed blocks.
248\item [matches] The number of times a block of $B$ was found in $A$.
14d43f1f 249\item [tag hits] The number of times the 16-bit hash of the rolling
c627d613 250 checksum matched a hash of one of the checksums from $B$.
14d43f1f 251\item [false alarms] The number of times the 32-bit rolling checksum
c627d613
AT
252 matched but the strong checksum didn't.
253\item [data] The amount of file data transferred verbatim, in bytes.
14d43f1f 254\item [written] The total number of bytes written by $\alpha$,
c627d613 255 including protocol overheads. This is almost all file data.
14d43f1f 256\item [read] The total number of bytes read by $\alpha$, including
c627d613
AT
257 protocol overheads. This is almost all checksum information.
258\end{description}
259
260The results demonstrate that for block sizes above 300 bytes, only a
261small fraction (around 5\%) of the file was transferred. The amount
262transferred was also considerably less than the size of the diff file
263that would have been transferred if the diff/patch method of updating
264a remote file was used.
265
266The checksums themselves took up a considerable amount of space,
267although much less than the size of the data transferred in each
268case. Each pair of checksums consumes 20 bytes: 4 bytes for the
269rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
270
271The number of false alarms was less than $1/1000$ of the number of
14d43f1f 272true matches, indicating that the 32-bit rolling checksum is quite
c627d613
AT
273good at screening out false matches.
274
275The number of tag hits indicates that the second level of the
276checksum search algorithm was invoked about once every 50
277characters. This is quite high because the total number of blocks in
278the file is a large fraction of the size of the tag hash table. For
279smaller files we would expect the tag hit rate to be much closer to
280the number of matches. For extremely large files, we should probably
281increase the size of the hash table.
282
283The next table shows similar results for a much smaller set of files.
284In this case the files were not packed into a tar file first. Rather,
285rsync was invoked with an option to recursively descend the directory
286tree. The files used were from two source releases of another software
287package called Samba. The total source code size is 1.7 MB and the
288diff between the two releases is 4155 lines long totalling 120 kB.
289
290\vspace*{5mm}
291\begin{tabular}{|l|l|l|l|l|l|l|} \hline
292{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
293{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
294
295300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline
296500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline
297700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline
298900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline
2991100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline
300\end{tabular}
301\vspace*{5mm}
302
303
304\section{Availability}
305
306An implementation of rsync which provides a convenient interface
307similar to the common UNIX command rcp has been written and is
9e3c856a 308available for download from http://rsync.samba.org/
c627d613
AT
309
310\end{document}