X-Git-Url: https://mattmccutchen.net/rsync/rsync.git/blobdiff_plain/50f2f002d90d6fc0209cd02fb1c1d52ec6a7ca45..43a4dc1053bd3bfec67f3cc6a6fa4edc1f394a82:/TODO diff --git a/TODO b/TODO index 0c6afa8e..42439806 100644 --- a/TODO +++ b/TODO @@ -32,14 +32,145 @@ use chroot for people who want to generate the file list using a find(1) command or a script. +File list structure in memory + + Rather than one big array, perhaps have a tree in memory mirroring + the directory tree. + + This might make sorting much faster! (I'm not sure it's a big CPU + problem, mind you.) + + It might also reduce memory use in storing repeated directory names + -- again I'm not sure this is a problem. + Performance Traverse just one directory at a time. Tridge says it's possible. - - Can possibly also be smarter about memory use while looking for hard - links by reducing the refcount as we find alternative names. In - fact at the moment the code seems to make a whole second copy of the - file list, which seems unnecessary. + + At the moment rsync reads the whole file list into memory at the + start, which makes us use a lot of memory and also not pipeline + network access as much as we could. + + +Handling duplicate names + + We need to be careful of duplicate names getting into the file list. + See clean_flist(). This could happen if multiple arguments include + the same file. Bad. + + I think duplicates are only a problem if they're both flowing + through the pipeline at the same time. For example we might have + updated the first occurrence after reading the checksums for the + second. So possibly we just need to make sure that we don't have + both in the pipeline at the same time. + + Possibly if we did one directory at a time that would be sufficient. + + Alternatively we could pre-process the arguments to make sure no + duplicates will ever be inserted. There could be some bad cases + when we're collapsing symlinks. + + We could have a hash table. + + The root of the problem is that we do not want more than one file + list entry referring to the same file. At first glance there are + several ways this could happen: symlinks, hardlinks, and repeated + names on the command line. + + If names are repeated on the command line, they may be present in + different forms, perhaps by traversing directory paths in different + ways, traversing paths including symlinks. Also we need to allow + for expansion of globs by rsync. + + At the moment, clean_flist() requires having the entire file list in + memory. Duplicate names are detected just by a string comparison. + + We don't need to worry about hard links causing duplicates because + files are never updated in place. Similarly for symlinks. + + I think even if we're using a different symlink mode we don't need + to worry. + + Unless we're really clever this will introduce a protocol + incompatibility, so we need to be able to accept the old format as + well. + + +Memory accounting + + At exit, show how much memory was used for the file list, etc. + + Also we do a wierd exponential-growth allocation in flist.c. I'm + not sure this makes sense with modern mallocs. At any rate it will + make us allocate a huge amount of memory for large file lists. + + +Hard-link handling + + At the moment hardlink handling is very expensive, so it's off by + default. It does not need to be so. + + Since most of the solutions are rather intertwined with the file + list it is probably better to fix that first, although fixing + hardlinks is possibly simpler. + + We can rule out hardlinked directories since they will probably + screw us up in all kinds of ways. They simply should not be used. + + At the moment rsync only cares about hardlinks to regular files. I + guess you could also use them for sockets, devices and other beasts, + but I have not seen them. + + When trying to reproduce hard links, we only need to worry about + files that have more than one name (nlinks>1 && !S_ISDIR). + + The basic point of this is to discover alternate names that refer to + the same file. All operations, including creating the file and + writing modifications to it need only to be done for the first name. + For all later names, we just create the link and then leave it + alone. + + If hard links are to be preserved: + + Before the generator/receiver fork, the list of files is received + from the sender (recv_file_list), and a table for detecting hard + links is built. + + The generator looks for hard links within the file list and does + not send checksums for them, though it does send other metadata. + + The sender sends the device number and inode with file entries, so + that files are uniquely identified. + + The receiver goes through and creates hard links (do_hard_links) + after all data has been written, but before directory permissions + are set. + + At the moment device and inum are sent as 4-byte integers, which + will probably cause problems on large filesystems. On Linux the + kernel uses 64-bit ino_t's internally, and people will soon have + filesystems big enough to use them. We ought to follow NFS4 in + using 64-bit device and inode identification, perhaps with a + protocol version bump. + + Once we've seen all the names for a particular file, we no longer + need to think about it and we can deallocate the memory. + + We can also have the case where there are links to a file that are + not in the tree being transferred. There's nothing we can do about + that. Because we rename the destination into place after writing, + any hardlinks to the old file are always going to be orphaned. In + fact that is almost necessary because otherwise we'd get really + confused if we were generating checksums for one name of a file and + modifying another. + + At the moment the code seems to make a whole second copy of the file + list, which seems unnecessary. + + We should have a test case that exercises hard links. Since it + might be hard to compare ./tls output where the inodes change we + might need a little program to check whether several names refer to + the same file. IPv6 @@ -89,10 +220,30 @@ Empty directories can end up with many empty directories. We might avoid this by lazily creating such directories. + zlib - Perhaps don't use our own zlib. Will we actually be incompatible, - or just be slightly less efficient? + Perhaps don't use our own zlib. + + Advantages: + + - will automatically be up to date with bugfixes in zlib + + - can leave it out for small rsync on e.g. recovery disks + + - can use a shared library + + - avoids people breaking rsync by trying to do this themselves and + messing up + + Should we ship zlib for systems that don't have it, or require + people to install it separately? + + Apparently this will make us incompatible with versions of rsync + that use the patched version of rsync. Probably the simplest way to + do this is to just disable gzip (with a warning) when talking to old + versions. + logging @@ -100,10 +251,52 @@ logging monitor progress in a log file can do so more easily. See http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=48108 + At the connections that just get a list of modules are not logged, + but they should be. + rsyncd over ssh There are already some patches to do this. +proxy authentication + + Allow RSYNC_PROXY to be http://user:pass@proxy.foo:3128/, and do + HTTP Basic Proxy-Authentication. + + Multiple schemes are possible, up to and including the insanity that + is NTLM, but Basic probably covers most cases. + +SOCKS + + Add --with-socks, and then perhaps a command-line option to put them + on or off. This might be more reliable than LD_PRELOAD hacks. + +Better statistics: + + mbp: hey, how about an rsync option that just gives you the + summary without the list of files? And perhaps gives more + information like the number of new files, number of changed, + deleted, etc. ? + Rasmus: nice idea + there is --stats + but at the moment it's very tridge-oriented + rather than user-friendly + it would be nice to improve it + that would also work well with --dryrun + +TDB: + + Rather than storing the file list in memory, store it in a TDB. + + This *might* make memory usage lower while building the file list. + + Hashtable lookup will mean files are not transmitted in order, + though... hm. + + This would neatly eliminate one of the major post-fork shared data + structures. + + PLATFORMS ------------------------------------------------------------ Win32 @@ -119,6 +312,31 @@ Win32 we are correct to call close(), because shutdown() discards untransmitted data. +DEVELOPMENT ---------------------------------------------------------- + +Splint + + Build rsync with SPLINT to try to find security holes. Add + annotations as necessary. Keep track of the number of warnings + found initially, and see how many of them are real bugs, or real + security bugs. Knowing the percentage of likely hits would be + really interesting for other projects. + +Torture test + + Something that just keeps running rsync continuously over a data set + likely to generate problems. + +Cross-testing + + Run current rsync versions against significant past releases. + +Memory debugger + + jra recommends: + + http://devel-home.kde.org/~sewardj/ + DOCUMENTATION -------------------------------------------------------- Update README @@ -137,10 +355,6 @@ Add machines NICE ----------------------------------------------------------------- -SIGHUP - - Re-read config file (just exec() ourselves) rather than exiting. - --no-detach and --no-fork options Very useful for debugging. Also good when running under a @@ -149,12 +363,13 @@ SIGHUP hang/timeout friendliness - On - verbose output Indicate whether files are new, updated, or deleted + At end of transfer, show how many files were or were not transferred + correctly. + internationalization Change to using gettext(). Probably need to ship this for platforms @@ -171,4 +386,3 @@ rsyncsh fairly directly into rsync commands: it just needs to remember the current host, directory and so on. We can probably even do completion of remote filenames. -