[rsync/rsync.git] / tech_report.tex

\documentclass[a4paper]{article}
\begin{document}


\title{The rsync algorithm}

\author{Andrew Tridgell \quad\quad Paul Mackerras\\
Department of Computer Science \\
Australian National University \\
Canberra, ACT 0200, Australia}

\maketitle

\begin{abstract}
  This report presents an algorithm for updating a file on one machine
  to be identical to a file on another machine.  We assume that the
  two machines are connected by a low-bandwidth high-latency
  bi-directional communications link.  The algorithm identifies parts
  of the source file which are identical to some part of the
  destination file, and only sends those parts which cannot be matched
  in this way.  Effectively, the algorithm computes a set of
  differences without having both files on the same machine.  The
  algorithm works best when the files are similar, but will also
  function correctly and reasonably efficiently when the files are
  quite different.
\end{abstract}

\section{The problem}

Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
the same as $A$. The obvious method is to copy $A$ onto $B$.

Now imagine that the two files are on machines connected by a slow
communications link, for example a dialup IP link.  If $A$ is large,
copying $A$ onto $B$ will be slow.  To make it faster you could
compress $A$ before sending it, but that will usually only gain a
factor of 2 to 4.

Now assume that $A$ and $B$ are quite similar, perhaps both derived
from the same original file. To really speed things up you would need
to take advantage of this similarity. A common method is to send just
the differences between $A$ and $B$ down the link and then use this
list of differences to reconstruct the file.

The problem is that the normal methods for creating a set of
differences between two files rely on being able to read both files.
Thus they require that both files are available beforehand at one end
of the link.  If they are not both available on the same machine,
these algorithms cannot be used (once you had copied the file over,
you wouldn't need the differences).  This is the problem that rsync
addresses.

The rsync algorithm efficiently computes which parts of a source file
match some part of an existing destination file.  These parts need not
be sent across the link; all that is needed is a reference to the part
of the destination file.  Only parts of the source file which are not
matched in this way need to be sent verbatim.  The receiver can then
construct a copy of the source file using the references to parts of
the existing destination file and the verbatim material.

Trivially, the data sent to the receiver can be compressed using any
of a range of common compression algorithms, for further speed
improvements.

\section{The rsync algorithm}

Suppose we have two general purpose computers $\alpha$ and $\beta$.
Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
file $B$, where $A$ and $B$ are ``similar''.  There is a slow
communications link between $\alpha$ and $\beta$.

The rsync algorithm consists of the following steps:

\begin{enumerate}
\item $\beta$ splits the file $B$ into a series of non-overlapping
  fixed-sized blocks of size S bytes\footnote{We have found that
  values of S between 500 and 1000 are quite good for most purposes}.
  The last block may be shorter than $S$ bytes.

\item For each of these blocks $\beta$ calculates two checksums:
  a weak ``rolling'' 32-bit checksum (described below) and a strong
  128-bit MD4 checksum.

\item $\beta$ sends these checksums to $\alpha$.
  
\item $\alpha$ searches through $A$ to find all blocks of length $S$
  bytes (at any offset, not just multiples of $S$) that have the same
  weak and strong checksum as one of the blocks of $B$. This can be
  done in a single pass very quickly using a special property of the
  rolling checksum described below.
  
\item $\alpha$ sends $\beta$ a sequence of instructions for
  constructing a copy of $A$.  Each instruction is either a reference
  to a block of $B$, or literal data.  Literal data is sent only for
  those sections of $A$ which did not match any of the blocks of $B$.
\end{enumerate}

The end result is that $\beta$ gets a copy of $A$, but only the pieces
of $A$ that are not found in $B$ (plus a small amount of data for
checksums and block indexes) are sent over the link. The algorithm
also only requires one round trip, which minimises the impact of the
link latency.

The most important details of the algorithm are the rolling checksum
and the associated multi-alternate search mechanism which allows the
all-offsets checksum search to proceed very quickly. These will be
discussed in greater detail below.

\section{Rolling checksum}

The weak rolling checksum used in the rsync algorithm needs to have
the property that it is very cheap to calculate the checksum of a
buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
the values of the bytes $X_1$ and $X_{n+1}$.

The weak checksum algorithm we used in our implementation was inspired
by Mark Adler's adler-32 checksum.  Our checksum is defined by
$$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
$$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
$$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$

where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
For simplicity and speed, we use $M = 2^{16}$.  

The important property of this checksum is that successive values can
be computed very efficiently using the recurrence relations

$$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
$$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$

Thus the checksum can be calculated for blocks of length S at all
possible offsets within a file in a ``rolling'' fashion, with very
little computation at each point.

Despite its simplicity, this checksum was found to be quite adequate as
a first-level check for a match of two file blocks.  We have found in
practice that the probability of this checksum matching when the
blocks are not equal is quite low.  This is important because the much
more expensive strong checksum must be calculated for each block where
the weak checksum matches.

\section{Checksum searching}

Once $\alpha$ has received the list of checksums of the blocks of $B$,
it must search $A$ for any blocks at any offset that match the
checksum of some block of $B$.  The basic strategy is to compute the
32-bit rolling checksum for a block of length $S$ starting at each
byte of $A$ in turn, and for each checksum, search the list for a
match.  To do this our implementation uses a
simple 3 level searching scheme.

The first level uses a 16-bit hash of the 32-bit rolling checksum and
a $2^{16}$ entry hash table.  The list of checksum values (i.e., the
checksums from the blocks of $B$) is sorted according to the 16-bit
hash of the 32-bit rolling checksum.  Each entry in the hash table
points to the first element of the list for that hash value, or
contains a null value if no element of the list has that hash value.

At each offset in the file the 32-bit rolling checksum and its 16-bit
hash are calculated.  If the hash table entry for that hash value is
not a null value, the second-level check is invoked.

The second-level check involves scanning the sorted checksum list
starting with the entry pointed to by the hash table entry, looking
for an entry whose 32-bit rolling checksum matches the current value.
The scan terminates when it reaches an entry whose 16-bit hash
differs.  If this search finds a match, the third-level check is
invoked.

The third-level check involves calculating the strong checksum for the
current offset in the file and comparing it with the strong checksum
value in the current list entry.  If the two strong checksums match,
we assume that we have found a block of $A$ which matches a block of
$B$.  In fact the blocks could be different, but the probability of
this is microscopic, and in practice this is a reasonable assumption.

When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
the current file offset and the end of the previous match, followed by
the index of the block in $B$ that matched.  This data is sent
immediately a match is found, which allows us to overlap the
communication with further computation.

If no match is found at a given offset in the file, the rolling
checksum is updated to the next offset and the search proceeds.  If a
match is found, the search is restarted at the end of the matched
block.  This strategy saves a considerable amount of computation for
the common case where the two files are nearly identical.  In
addition, it would be a simple matter to encode the block indexes as
runs, for the common case where a portion of $A$ matches a series of
blocks of $B$ in order.

\section{Pipelining}

The above sections describe the process for constructing a copy of one
file on a remote system.  If we have a several files to copy, we can
gain a considerable latency advantage by pipelining the process.

This involves $\beta$ initiating two independent processes. One of the
processes generates and sends the checksums to $\alpha$ while the
other receives the difference information from $\alpha$ and
reconstructs the files. 

If the communications link is buffered then these two processes can
proceed independently and the link should be kept fully utilised in
both directions for most of the time. 

\section{Results}

To test the algorithm, tar files were created of the Linux kernel
sources for two versions of the kernel. The two kernel versions were
1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
are separated by 5 released patch levels.

Out of the 2441 files in the 1.99.10 release, 291 files had changed in
the 2.0.0 release, 19 files had been removed and 25 files had been
added.

A ``diff'' of the two tar files using the standard GNU diff utility
produced over 32 thousand lines of output totalling 2.1 MB.

The following table shows the results for rsync between the two files
with a varying block size.\footnote{All the tests in this section were
  carried out using rsync version 0.5}

\vspace*{5mm}
\begin{tabular}{|l|l|l|l|l|l|l|} \hline
{\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
{\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline

300         & 64247         & 3817434    & 948          & 5312200    & 5629158        & 1632284   \\ \hline
500         & 46989         & 620013     & 64           & 1091900    & 1283906        & 979384    \\ \hline
700         & 33255         & 571970     & 22           & 1307800    & 1444346        & 699564    \\ \hline
900         & 25686         & 525058     & 24           & 1469500    & 1575438        & 544124    \\ \hline
1100        & 20848         & 496844     & 21           & 1654500    & 1740838        & 445204    \\ \hline
\end{tabular}
\vspace*{5mm}

In each case, the CPU time taken was less than the
time it takes to run ``diff'' on the two files.\footnote{The wall
  clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
  running SunOS, using rsh over loopback for communication.  GNU diff
  took about 4 minutes.}

The columns in the table are as follows:

\begin{description}
\item [block size] The size in bytes of the checksummed blocks.
\item [matches] The number of times a block of $B$ was found in $A$.
\item [tag hits] The number of times the 16-bit hash of the rolling
  checksum matched a hash of one of the checksums from $B$.
\item [false alarms] The number of times the 32-bit rolling checksum
  matched but the strong checksum didn't.
\item [data] The amount of file data transferred verbatim, in bytes.
\item [written] The total number of bytes written by $\alpha$,
  including protocol overheads. This is almost all file data.
\item [read] The total number of bytes read by $\alpha$, including
  protocol overheads. This is almost all checksum information.
\end{description}

The results demonstrate that for block sizes above 300 bytes, only a
small fraction (around 5\%) of the file was transferred. The amount
transferred was also considerably less than the size of the diff file
that would have been transferred if the diff/patch method of updating
a remote file was used.

The checksums themselves took up a considerable amount of space,
although much less than the size of the data transferred in each
case. Each pair of checksums consumes 20 bytes: 4 bytes for the
rolling checksum plus 16 bytes for the 128-bit MD4 checksum.

The number of false alarms was less than $1/1000$ of the number of
true matches, indicating that the 32-bit rolling checksum is quite
good at screening out false matches. 

The number of tag hits indicates that the second level of the
checksum search algorithm was invoked about once every 50
characters.  This is quite high because the total number of blocks in
the file is a large fraction of the size of the tag hash table. For
smaller files we would expect the tag hit rate to be much closer to
the number of matches.  For extremely large files, we should probably
increase the size of the hash table.

The next table shows similar results for a much smaller set of files.
In this case the files were not packed into a tar file first. Rather,
rsync was invoked with an option to recursively descend the directory
tree. The files used were from two source releases of another software
package called Samba. The total source code size is 1.7 MB and the
diff between the two releases is 4155 lines long totalling 120 kB.

\vspace*{5mm}
\begin{tabular}{|l|l|l|l|l|l|l|} \hline
{\bf block} & {\bf matches} & {\bf tag}  & {\bf false}  & {\bf data} & {\bf written}  & {\bf read} \\
{\bf size}  &               & {\bf hits} & {\bf alarms} &            &                &            \\ \hline \hline

300         & 3727          & 3899       & 0            & 129775     & 153999         & 83948       \\ \hline
500         & 2158          & 2325       & 0            & 171574     & 189330         & 50908       \\ \hline
700         & 1517          & 1649       & 0            & 195024     & 210144         & 36828        \\ \hline
900         & 1156          & 1281       & 0            & 222847     & 236471         & 29048        \\ \hline
1100        & 921           & 1049       & 0            & 250073     & 262725         & 23988        \\ \hline
\end{tabular}
\vspace*{5mm}


\section{Availability}

An implementation of rsync which provides a convenient interface
similar to the common UNIX command rcp has been written and is
available for download from http://rsync.samba.org/

\end{document}
Commit	Line	Data
c627d613 AT	1	\documentclass[a4paper]{article}
	2	\begin{document}
	3
	4
	5	\title{The rsync algorithm}
	6
	7	\author{Andrew Tridgell \quad\quad Paul Mackerras\\
	8	Department of Computer Science \\
	9	Australian National University \\
	10	Canberra, ACT 0200, Australia}
	11
	12	\maketitle
	13
	14	\begin{abstract}
	15	This report presents an algorithm for updating a file on one machine
	16	to be identical to a file on another machine. We assume that the
	17	two machines are connected by a low-bandwidth high-latency
	18	bi-directional communications link. The algorithm identifies parts
	19	of the source file which are identical to some part of the
	20	destination file, and only sends those parts which cannot be matched
	21	in this way. Effectively, the algorithm computes a set of
	22	differences without having both files on the same machine. The
	23	algorithm works best when the files are similar, but will also
	24	function correctly and reasonably efficiently when the files are
	25	quite different.
	26	\end{abstract}
	27
	28	\section{The problem}
	29
	30	Imagine you have two files, $A$ and $B$, and you wish to update $B$ to be
	31	the same as $A$. The obvious method is to copy $A$ onto $B$.
	32
	33	Now imagine that the two files are on machines connected by a slow
14d43f1f	34	communications link, for example a dialup IP link. If $A$ is large,
c627d613 AT	35	copying $A$ onto $B$ will be slow. To make it faster you could
	36	compress $A$ before sending it, but that will usually only gain a
	37	factor of 2 to 4.
	38
	39	Now assume that $A$ and $B$ are quite similar, perhaps both derived
	40	from the same original file. To really speed things up you would need
	41	to take advantage of this similarity. A common method is to send just
	42	the differences between $A$ and $B$ down the link and then use this
	43	list of differences to reconstruct the file.
	44
	45	The problem is that the normal methods for creating a set of
	46	differences between two files rely on being able to read both files.
	47	Thus they require that both files are available beforehand at one end
	48	of the link. If they are not both available on the same machine,
	49	these algorithms cannot be used (once you had copied the file over,
	50	you wouldn't need the differences). This is the problem that rsync
	51	addresses.
	52
	53	The rsync algorithm efficiently computes which parts of a source file
	54	match some part of an existing destination file. These parts need not
	55	be sent across the link; all that is needed is a reference to the part
	56	of the destination file. Only parts of the source file which are not
	57	matched in this way need to be sent verbatim. The receiver can then
	58	construct a copy of the source file using the references to parts of
	59	the existing destination file and the verbatim material.
	60
	61	Trivially, the data sent to the receiver can be compressed using any
	62	of a range of common compression algorithms, for further speed
	63	improvements.
	64
	65	\section{The rsync algorithm}
	66
	67	Suppose we have two general purpose computers $\alpha$ and $\beta$.
	68	Computer $\alpha$ has access to a file $A$ and $\beta$ has access to
	69	file $B$, where $A$ and $B$ are ``similar''. There is a slow
	70	communications link between $\alpha$ and $\beta$.
	71
	72	The rsync algorithm consists of the following steps:
	73
	74	\begin{enumerate}
	75	\item $\beta$ splits the file $B$ into a series of non-overlapping
	76	fixed-sized blocks of size S bytes\footnote{We have found that
	77	values of S between 500 and 1000 are quite good for most purposes}.
	78	The last block may be shorter than $S$ bytes.
	79
	80	\item For each of these blocks $\beta$ calculates two checksums:
	81	a weak ``rolling'' 32-bit checksum (described below) and a strong
	82	128-bit MD4 checksum.
	83
	84	\item $\beta$ sends these checksums to $\alpha$.
	85
	86	\item $\alpha$ searches through $A$ to find all blocks of length $S$
	87	bytes (at any offset, not just multiples of $S$) that have the same
	88	weak and strong checksum as one of the blocks of $B$. This can be
	89	done in a single pass very quickly using a special property of the
	90	rolling checksum described below.
	91
	92	\item $\alpha$ sends $\beta$ a sequence of instructions for
	93	constructing a copy of $A$. Each instruction is either a reference
	94	to a block of $B$, or literal data. Literal data is sent only for
	95	those sections of $A$ which did not match any of the blocks of $B$.
	96	\end{enumerate}
	97
	98	The end result is that $\beta$ gets a copy of $A$, but only the pieces
99	of $A$ that are not found in $B$ (plus a small amount of data for
100	checksums and block indexes) are sent over the link. The algorithm
101	also only requires one round trip, which minimises the impact of the
102	link latency.
103
104	The most important details of the algorithm are the rolling checksum
105	and the associated multi-alternate search mechanism which allows the
106	all-offsets checksum search to proceed very quickly. These will be
107	discussed in greater detail below.
108
109	\section{Rolling checksum}
110
111	The weak rolling checksum used in the rsync algorithm needs to have
112	the property that it is very cheap to calculate the checksum of a
113	buffer $X_2 .. X_{n+1}$ given the checksum of buffer $X_1 .. X_n$ and
114	the values of the bytes $X_1$ and $X_{n+1}$.
115
116	The weak checksum algorithm we used in our implementation was inspired
117	by Mark Adler's adler-32 checksum. Our checksum is defined by
118	$$ a(k,l) = (\sum_{i=k}^l X_i) \bmod M $$
119	$$ b(k,l) = (\sum_{i=k}^l (l-i+1)X_i) \bmod M $$
120	$$ s(k,l) = a(k,l) + 2^{16} b(k,l) $$
121
122	where $s(k,l)$ is the rolling checksum of the bytes $X_k \ldots X_l$.
123	For simplicity and speed, we use $M = 2^{16}$.
124
125	The important property of this checksum is that successive values can
126	be computed very efficiently using the recurrence relations
127
128	$$ a(k+1,l+1) = (a(k,l) - X_k + X_{l+1}) \bmod M $$
129	$$ b(k+1,l+1) = (b(k,l) - (l-k+1) X_k + a(k+1,l+1)) \bmod M $$
130
131	Thus the checksum can be calculated for blocks of length S at all
132	possible offsets within a file in a ``rolling'' fashion, with very
133	little computation at each point.
134
135	Despite its simplicity, this checksum was found to be quite adequate as
14d43f1f	136	a first-level check for a match of two file blocks. We have found in
c627d613 AT	137	practice that the probability of this checksum matching when the
	138	blocks are not equal is quite low. This is important because the much
	139	more expensive strong checksum must be calculated for each block where
	140	the weak checksum matches.
	141
	142	\section{Checksum searching}
	143
	144	Once $\alpha$ has received the list of checksums of the blocks of $B$,
	145	it must search $A$ for any blocks at any offset that match the
	146	checksum of some block of $B$. The basic strategy is to compute the
	147	32-bit rolling checksum for a block of length $S$ starting at each
	148	byte of $A$ in turn, and for each checksum, search the list for a
	149	match. To do this our implementation uses a
	150	simple 3 level searching scheme.
	151
	152	The first level uses a 16-bit hash of the 32-bit rolling checksum and
	153	a $2^{16}$ entry hash table. The list of checksum values (i.e., the
	154	checksums from the blocks of $B$) is sorted according to the 16-bit
	155	hash of the 32-bit rolling checksum. Each entry in the hash table
	156	points to the first element of the list for that hash value, or
	157	contains a null value if no element of the list has that hash value.
	158
	159	At each offset in the file the 32-bit rolling checksum and its 16-bit
	160	hash are calculated. If the hash table entry for that hash value is
14d43f1f	161	not a null value, the second-level check is invoked.
c627d613	162
14d43f1f	163	The second-level check involves scanning the sorted checksum list
c627d613 AT	164	starting with the entry pointed to by the hash table entry, looking
	165	for an entry whose 32-bit rolling checksum matches the current value.
	166	The scan terminates when it reaches an entry whose 16-bit hash
14d43f1f	167	differs. If this search finds a match, the third-level check is
c627d613 AT	168	invoked.
c627d613 AT	169
14d43f1f	170	The third-level check involves calculating the strong checksum for the
c627d613 AT	171	current offset in the file and comparing it with the strong checksum
	172	value in the current list entry. If the two strong checksums match,
	173	we assume that we have found a block of $A$ which matches a block of
	174	$B$. In fact the blocks could be different, but the probability of
	175	this is microscopic, and in practice this is a reasonable assumption.
	176
	177	When a match is found, $\alpha$ sends $\beta$ the data in $A$ between
	178	the current file offset and the end of the previous match, followed by
	179	the index of the block in $B$ that matched. This data is sent
	180	immediately a match is found, which allows us to overlap the
	181	communication with further computation.
	182
	183	If no match is found at a given offset in the file, the rolling
	184	checksum is updated to the next offset and the search proceeds. If a
	185	match is found, the search is restarted at the end of the matched
	186	block. This strategy saves a considerable amount of computation for
	187	the common case where the two files are nearly identical. In
	188	addition, it would be a simple matter to encode the block indexes as
	189	runs, for the common case where a portion of $A$ matches a series of
	190	blocks of $B$ in order.
	191
	192	\section{Pipelining}
	193
	194	The above sections describe the process for constructing a copy of one
	195	file on a remote system. If we have a several files to copy, we can
	196	gain a considerable latency advantage by pipelining the process.
	197
	198	This involves $\beta$ initiating two independent processes. One of the
	199	processes generates and sends the checksums to $\alpha$ while the
	200	other receives the difference information from $\alpha$ and
	201	reconstructs the files.
	202
	203	If the communications link is buffered then these two processes can
	204	proceed independently and the link should be kept fully utilised in
	205	both directions for most of the time.
	206
	207	\section{Results}
	208
	209	To test the algorithm, tar files were created of the Linux kernel
	210	sources for two versions of the kernel. The two kernel versions were
	211	1.99.10 and 2.0.0. These tar files are approximately 24MB in size and
	212	are separated by 5 released patch levels.
	213
	214	Out of the 2441 files in the 1.99.10 release, 291 files had changed in
	215	the 2.0.0 release, 19 files had been removed and 25 files had been
	216	added.
	217
	218	A ``diff'' of the two tar files using the standard GNU diff utility
	219	produced over 32 thousand lines of output totalling 2.1 MB.
	220
	221	The following table shows the results for rsync between the two files
	222	with a varying block size.\footnote{All the tests in this section were
	223	carried out using rsync version 0.5}
	224
	225	\vspace*{5mm}
	226	\begin{tabular}{\|l\|l\|l\|l\|l\|l\|l\|} \hline
	227	{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
	228	{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
	229
	230	300 & 64247 & 3817434 & 948 & 5312200 & 5629158 & 1632284 \\ \hline
	231	500 & 46989 & 620013 & 64 & 1091900 & 1283906 & 979384 \\ \hline
	232	700 & 33255 & 571970 & 22 & 1307800 & 1444346 & 699564 \\ \hline
	233	900 & 25686 & 525058 & 24 & 1469500 & 1575438 & 544124 \\ \hline
	234	1100 & 20848 & 496844 & 21 & 1654500 & 1740838 & 445204 \\ \hline
235	\end{tabular}
236	\vspace*{5mm}
237
238	In each case, the CPU time taken was less than the
239	time it takes to run ``diff'' on the two files.\footnote{The wall
240	clock time was approximately 2 minutes per run on a 50 MHz SPARC 10
241	running SunOS, using rsh over loopback for communication. GNU diff
242	took about 4 minutes.}
243
244	The columns in the table are as follows:
245
246	\begin{description}
247	\item [block size] The size in bytes of the checksummed blocks.
248	\item [matches] The number of times a block of $B$ was found in $A$.
14d43f1f	249	\item [tag hits] The number of times the 16-bit hash of the rolling
c627d613	250	checksum matched a hash of one of the checksums from $B$.
14d43f1f	251	\item [false alarms] The number of times the 32-bit rolling checksum
c627d613 AT	252	matched but the strong checksum didn't.
c627d613 AT	253	\item [data] The amount of file data transferred verbatim, in bytes.
14d43f1f	254	\item [written] The total number of bytes written by $\alpha$,
c627d613	255	including protocol overheads. This is almost all file data.
14d43f1f	256	\item [read] The total number of bytes read by $\alpha$, including
c627d613 AT	257	protocol overheads. This is almost all checksum information.
	258	\end{description}
	259
	260	The results demonstrate that for block sizes above 300 bytes, only a
	261	small fraction (around 5\%) of the file was transferred. The amount
	262	transferred was also considerably less than the size of the diff file
	263	that would have been transferred if the diff/patch method of updating
	264	a remote file was used.
	265
	266	The checksums themselves took up a considerable amount of space,
	267	although much less than the size of the data transferred in each
	268	case. Each pair of checksums consumes 20 bytes: 4 bytes for the
	269	rolling checksum plus 16 bytes for the 128-bit MD4 checksum.
	270
	271	The number of false alarms was less than $1/1000$ of the number of
14d43f1f	272	true matches, indicating that the 32-bit rolling checksum is quite
c627d613 AT	273	good at screening out false matches.
	274
	275	The number of tag hits indicates that the second level of the
	276	checksum search algorithm was invoked about once every 50
	277	characters. This is quite high because the total number of blocks in
	278	the file is a large fraction of the size of the tag hash table. For
	279	smaller files we would expect the tag hit rate to be much closer to
	280	the number of matches. For extremely large files, we should probably
	281	increase the size of the hash table.
	282
	283	The next table shows similar results for a much smaller set of files.
	284	In this case the files were not packed into a tar file first. Rather,
	285	rsync was invoked with an option to recursively descend the directory
	286	tree. The files used were from two source releases of another software
	287	package called Samba. The total source code size is 1.7 MB and the
	288	diff between the two releases is 4155 lines long totalling 120 kB.
	289
	290	\vspace*{5mm}
	291	\begin{tabular}{\|l\|l\|l\|l\|l\|l\|l\|} \hline
	292	{\bf block} & {\bf matches} & {\bf tag} & {\bf false} & {\bf data} & {\bf written} & {\bf read} \\
	293	{\bf size} & & {\bf hits} & {\bf alarms} & & & \\ \hline \hline
	294
	295	300 & 3727 & 3899 & 0 & 129775 & 153999 & 83948 \\ \hline
	296	500 & 2158 & 2325 & 0 & 171574 & 189330 & 50908 \\ \hline
	297	700 & 1517 & 1649 & 0 & 195024 & 210144 & 36828 \\ \hline
	298	900 & 1156 & 1281 & 0 & 222847 & 236471 & 29048 \\ \hline
	299	1100 & 921 & 1049 & 0 & 250073 & 262725 & 23988 \\ \hline
	300	\end{tabular}
	301	\vspace*{5mm}
	302
	303
	304	\section{Availability}
	305
	306	An implementation of rsync which provides a convenient interface
	307	similar to the common UNIX command rcp has been written and is
9e3c856a	308	available for download from http://rsync.samba.org/
c627d613 AT	309
c627d613 AT	310	\end{document}