tech_report.tex

   1 \documentclass[a4paper]{article}
   2 \begin{document}
   3
   4
   5 \title{The rsync algorithm}
   6
   7 \author{Andrew Tridgell \quad\quad Paul Mackerras\\
   8 Department of Computer Science \\
   9 Australian National University \\
  10 Canberra, ACT 0200, Australia}
  11
  12 \maketitle
  13
  14 \begin{abstract}
  15   This report presents an algorithm for updating a file on one machine
  16   to be identical to a file on another machine.  We assume that the
  17   two machines are connected by a low-bandwidth high-latency
  18   bi-directional communications link.  The algorithm identifies parts
  19   of the source file which are identical to some part of the
  20   destination file, and only sends those parts which cannot be matched
  21   in this way.  Effectively, the algorithm computes a set of
  22   differences without having both files on the same machine.  The
  23   algorithm works best when the files are similar, but will also
  24   function correctly and reasonably efficiently when the files are
  25   quite different.
  26 \end{abstract}
  27
  28 \section{The problem}
  29
  30 Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
  31 the same as $A$. The obvious method is to copy $A$ onto $B$.
  32
  33 Now imagine that the two files are on machines connected by a slow
  34 communications link, for example a dialup IP link.  If $A$ is large,
  35 copying $A$ onto $B$ will be slow.  To make it faster you could
  36 compress $A$ before sending it, but that will usually only gain a
  37 factor of 2 to 4.
  38
  39 Now assume that $A$ and $B$ are quite similar, perhaps both derived
  40 from the same original file. To really speed things up you would need
  41 to take advantage of this similarity. A common method is to send just
  42 the differences between $A$ and $B$ down the link and then use this
  43 list of differences to reconstruct the file.
  44
  45 The problem is that the normal methods for creating a set of
  46 differences between two files rely on being able to read both files.
  47 Thus they require that both files are available beforehand at one end
  48 of the link.  If they are not both available on the same machine,
  49 these algorithms cannot be used (once you had copied the file over,
  50 you wouldn't need the differences).  This is the problem that rsync
  51 addresses.
  52
  53 The rsync algorithm efficiently computes which parts of a source file
  54 match some part of an existing destination file.  These parts need not
  55 be sent across the link; all that is needed is a reference to the part
  56 of the destination file.  Only parts of the source file which are not
  57 matched in this way need to be sent verbatim.  The receiver can then
  58 construct a copy of the source file using the references to parts of
  59 the existing destination file and the verbatim material.
  60
  61 Trivially, the data sent to the receiver can be compressed using any
  62 of a range of common compression algorithms, for further speed
  63 improvements.
  64
  65 \section{The rsync algorithm}
  66
  67 Suppose we have two general purpose computers $\alpha$ and $\beta$.
  68 Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
  69 file $B$, where $A$ and $B$ are ``similar''.  There is a slow
  70 communications link between $\alpha$ and $\beta$.
  71
  72 The rsync algorithm consists of the following steps:
  73
  74 \begin{enumerate}
  75 \item $\beta$ splits the file $B$ into a series of non-overlapping
  76   fixed-sized blocks of size S bytes\footnote{We have found that
  77   values of S between 500 and 1000 are quite good for most purposes}.
  78   The last block may be shorter than $S$ bytes.
  79
  80 \item For each of these blocks $\beta$ calculates two checksums:
  81   a weak ``rolling'' 32-bit checksum (described below) and a strong
  82   128-bit MD4 checksum.
  83
  84 \item $\beta$ sends these checksums to $\alpha$.
  85
  86 \item $\alpha$ searches through $A$ to find all blocks of length $S$
  87   bytes (at any offset, not just multiples of $S$) that have the same
  88   weak and strong checksum as one of the blocks of $B$. This can be
  89   done in a single pass very quickly using a special property of the
  90   rolling checksum described below.
  91
  92 \item $\alpha$ sends $\beta$ a sequence of instructions for
  93   constructing a copy of $A$.  Each instruction is either a reference
  94   to a block of $B$, or literal data.  Literal data is sent only for
  95   those sections of $A$ which did not match any of the blocks of $B$.
  96 \end{enumerate}
  97
  98 The end result is that $\beta$ gets a copy of $A$, but only the pieces
  99 of $A$ that are not found in $B$ (plus a small amount of data for
 100 checksums and block indexes) are sent over the link. The algorithm
 101 also only requires one round trip, which minimises the impact of the
 102 link latency.
 103
 104 The most important details of the algorithm are the rolling checksum
 105 and the associated multi-alternate search mechanism which allows the
 106 all-offsets checksum search to proceed very quickly. These will be
 107 discussed in greater detail below.
 108
 109 \section{Rolling checksum}
 110
 111 The weak rolling checksum used in the rsync algorithm needs to have
 112 the property that it is very cheap to calculate the checksum of a
 113 buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
 114 the values of the bytes $X_1$ and $X_{n+1}$.
 115
 116 The weak checksum algorithm we used in our implementation was inspired
 117 by Mark Adler's adler-32 checksum.  Our checksum is defined by
 118 $$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
 119 $$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
 120 $$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
 121
 122 where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
 123 For simplicity and speed, we use $M = 2^{16}$.
 124
 125 The important property of this checksum is that successive values can
 126 be computed very efficiently using the recurrence relations
 127
 128 $$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
 129 $$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
 130
 131 Thus the checksum can be calculated for blocks of length S at all
 132 possible offsets within a file in a ``rolling'' fashion, with very
 133 little computation at each point.
 134
 135 Despite its simplicity, this checksum was found to be quite adequate as
 136 a first-level check for a match of two file blocks.  We have found in
 137 practice that the probability of this checksum matching when the
 138 blocks are not equal is quite low.  This is important because the much
 139 more expensive strong checksum must be calculated for each block where
 140 the weak checksum matches.
 141
 142 \section{Checksum searching}
 143
 144 Once $\alpha$ has received the list of checksums of the blocks of $B$,
 145 it must search $A$ for any blocks at any offset that match the
 146 checksum of some block of $B$.  The basic strategy is to compute the
 147 32-bit rolling checksum for a block of length $S$ starting at each
 148 byte of $A$ in turn, and for each checksum, search the list for a
 149 match.  To do this our implementation uses a
 150 simple 3 level searching scheme.
 151
 152 The first level uses a 16-bit hash of the 32-bit rolling checksum and
 153 a $2^{16}$ entry hash table.  The list of checksum values (i.e., the
 154 checksums from the blocks of $B$) is sorted according to the 16-bit
 155 hash of the 32-bit rolling checksum.  Each entry in the hash table
 156 points to the first element of the list for that hash value, or
 157 contains a null value if no element of the list has that hash value.
 158
 159 At each offset in the file the 32-bit rolling checksum and its 16-bit
 160 hash are calculated.  If the hash table entry for that hash value is
 161 not a null value, the second-level check is invoked.
 162
 163 The second-level check involves scanning the sorted checksum list
 164 starting with the entry pointed to by the hash table entry, looking
 165 for an entry whose 32-bit rolling checksum matches the current value.
 166 The scan terminates when it reaches an entry whose 16-bit hash
 167 differs.  If this search finds a match, the third-level check is
 168 invoked.
 169
 170 The third-level check involves calculating the strong checksum for the
 171 current offset in the file and comparing it with the strong checksum
 172 value in the current list entry.  If the two strong checksums match,
 173 we assume that we have found a block of $A$ which matches a block of
 174 $B$.  In fact the blocks could be different, but the probability of
 175 this is microscopic, and in practice this is a reasonable assumption.
 176
 177 When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
 178 the current file offset and the end of the previous match, followed by
 179 the index of the block in $B$ that matched.  This data is sent
 180 immediately a match is found, which allows us to overlap the
 181 communication with further computation.
 182
 183 If no match is found at a given offset in the file, the rolling
 184 checksum is updated to the next offset and the search proceeds.  If a
 185 match is found, the search is restarted at the end of the matched
 186 block.  This strategy saves a considerable amount of computation for
 187 the common case where the two files are nearly identical.  In
 188 addition, it would be a simple matter to encode the block indexes as
 189 runs, for the common case where a portion of $A$ matches a series of
 190 blocks of $B$ in order.
 191
 192 \section{Pipelining}
 193
 194 The above sections describe the process for constructing a copy of one
 195 file on a remote system.  If we have a several files to copy, we can
 196 gain a considerable latency advantage by pipelining the process.
 197
 198 This involves $\beta$ initiating two independent processes. One of the
 199 processes generates and sends the checksums to $\alpha$ while the
 200 other receives the difference information from $\alpha$ and
 201 reconstructs the files.
 202
 203 If the communications link is buffered then these two processes can
 204 proceed independently and the link should be kept fully utilised in
 205 both directions for most of the time.
 206
 207 \section{Results}
 208
 209 To test the algorithm, tar files were created of the Linux kernel
 210 sources for two versions of the kernel. The two kernel versions were
 211 1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
 212 are separated by 5 released patch levels.
 213
 214 Out of the 2441 files in the 1.99.10 release, 291 files had changed in
 215 the 2.0.0 release, 19 files had been removed and 25 files had been
 216 added.
 217
 218 A ``diff'' of the two tar files using the standard GNU diff utility
 219 produced over 32 thousand lines of output totalling 2.1 MB.
 220
 221 The following table shows the results for rsync between the two files
 222 with a varying block size.\footnote{All the tests in this section were
 223   carried out using rsync version 0.5}
 224
 225 \vspace*{5mm}
 226 \begin{tabular}{|l|l|l|l|l|l|l|} \hline
 227 {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
 228 {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
 229
 230 300         & 64247         & 3817434    & 948          & 5312200    & 5629158        & 1632284   \\ \hline
 231 500         & 46989         & 620013     & 64           & 1091900    & 1283906        & 979384    \\ \hline
 232 700         & 33255         & 571970     & 22           & 1307800    & 1444346        & 699564    \\ \hline
 233 900         & 25686         & 525058     & 24           & 1469500    & 1575438        & 544124    \\ \hline
 234 1100        & 20848         & 496844     & 21           & 1654500    & 1740838        & 445204    \\ \hline
 235 \end{tabular}
 236 \vspace*{5mm}
 237
 238 In each case, the CPU time taken was less than the
 239 time it takes to run ``diff'' on the two files.\footnote{The wall
 240   clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
 241   running SunOS, using rsh over loopback for communication.  GNU diff
 242   took about 4 minutes.}
 243
 244 The columns in the table are as follows:
 245
 246 \begin{description}
 247 \item [block size] The size in bytes of the checksummed blocks.
 248 \item [matches] The number of times a block of $B$ was found in $A$.
 249 \item [tag hits] The number of times the 16-bit hash of the rolling
 250   checksum matched a hash of one of the checksums from $B$.
 251 \item [false alarms] The number of times the 32-bit rolling checksum
 252   matched but the strong checksum didn't.
 253 \item [data] The amount of file data transferred verbatim, in bytes.
 254 \item [written] The total number of bytes written by $\alpha$,
 255   including protocol overheads. This is almost all file data.
 256 \item [read] The total number of bytes read by $\alpha$, including
 257   protocol overheads. This is almost all checksum information.
 258 \end{description}
 259
 260 The results demonstrate that for block sizes above 300 bytes, only a
 261 small fraction (around 5\%) of the file was transferred. The amount
 262 transferred was also considerably less than the size of the diff file
 263 that would have been transferred if the diff/patch method of updating
 264 a remote file was used.
 265
 266 The checksums themselves took up a considerable amount of space,
 267 although much less than the size of the data transferred in each
 268 case. Each pair of checksums consumes 20 bytes: 4 bytes for the
 269 rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
 270
 271 The number of false alarms was less than $1/1000$ of the number of
 272 true matches, indicating that the 32-bit rolling checksum is quite
 273 good at screening out false matches.
 274
 275 The number of tag hits indicates that the second level of the
 276 checksum search algorithm was invoked about once every 50
 277 characters.  This is quite high because the total number of blocks in
 278 the file is a large fraction of the size of the tag hash table. For
 279 smaller files we would expect the tag hit rate to be much closer to
 280 the number of matches.  For extremely large files, we should probably
 281 increase the size of the hash table.
 282
 283 The next table shows similar results for a much smaller set of files.
 284 In this case the files were not packed into a tar file first. Rather,
 285 rsync was invoked with an option to recursively descend the directory
 286 tree. The files used were from two source releases of another software
 287 package called Samba. The total source code size is 1.7 MB and the
 288 diff between the two releases is 4155 lines long totalling 120 kB.
 289
 290 \vspace*{5mm}
 291 \begin{tabular}{|l|l|l|l|l|l|l|} \hline
 292 {\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
 293 {\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline
 294
 295 300         & 3727          & 3899       & 0            & 129775     & 153999         & 83948       \\ \hline
 296 500         & 2158          & 2325       & 0            & 171574     & 189330         & 50908       \\ \hline
 297 700         & 1517          & 1649       & 0            & 195024     & 210144         & 36828        \\ \hline
 298 900         & 1156          & 1281       & 0            & 222847     & 236471         & 29048        \\ \hline
 299 1100        & 921           & 1049       & 0            & 250073     & 262725         & 23988        \\ \hline
 300 \end{tabular}
 301 \vspace*{5mm}
 302
 303
 304 \section{Availability}
 305
 306 An implementation of rsync which provides a convenient interface
 307 similar to the common UNIX command rcp has been written and is
 308 available for download from http://rsync.samba.org/
 309
 310 \end{document}