We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an outdated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice.
|
394
|
Communication Complexity
– Kushilevitz, Nisan
- 1997
|
|
201
|
Efficient randomized pattern-matching algorithms
– Karp, Rabin
- 1981
|
|
171
|
A Lowbandwidth Network File System
– Muthitacharoen, Chen, et al.
|
|
162
|
Rate of change and other metrics: a live study of the world wide web
– Douglis, Feldmann
- 1997
|
|
136
|
The evolution of the Web and implications for an incremental crawler
– Cho, Garcia-Molina
- 2000
|
|
129
|
Synchronizing a database to improve freshness
– Cho, Garcia-Molina
- 2000
|
|
110
|
The rsync algorithm
– Tridgell, Mackerras
- 1996
|
|
106
|
Pastiche: making backup cheap and easy
– Cox, Murray, et al.
|
|
71
|
WebBase: A Repository of Web Pages
– Hirai, Raghavan, et al.
- 2000
|
|
68
|
Efficient Algorithms for Sorting and Synchronization
– Tridgell
- 1999
|
|
67
|
A protocol-independent technique for eliminating redundant network traffic
– Spring, Wetherall
- 2000
|
|
55
|
An Adaptive Model for Optimizing Performance of an Incremental Web Crawler
– Edwards, McCurley, et al.
- 2001
|
|
51
|
File system support for delta compression
– MACDONALD
|
|
44
|
Communication complexity of document exchange
– Cormode, Paterson, et al.
- 2000
|
|
43
|
What is a file synchronizer
– Balasubramaniam, Pierce
- 1998
|
|
43
|
Delta algorithms: An empirical analysis
– HUNT, VO, et al.
- 1998
|
|
41
|
The detection of defective members of large populations
– Dorfman
- 1943
|
|
35
|
Keeping Up with the Changing Web
– Brewington, Cybenko
|
|
34
|
Value-based web caching
– Rhea, Liang, et al.
- 2003
|
|
31
|
Adventures of a Mathematician
– Ulam
- 1991
|
|
30
|
Interactive communication of balanced distributions and of correlated files
– Orlitsky
- 1993
|
|
27
|
Effective change detection using sampling
– Cho, Ntoulas
- 2002
|
|
24
|
Crawler-friendly web servers
– Brandman, Cho, et al.
- 2000
|
|
22
|
K.P.: Engineering a differencing and compression data format
– Korn, Vo
|
|
20
|
An algebraic approach to file synchronization
– Ramsey, Csirmaz
- 2001
|
|
19
|
A class of Randomized Strategies for Low-Cost Comparison of File Copies
– Barbara, Lipton
- 1991
|
|
19
|
Low cost comparison of file copies
– Schwarz, Bowdidge, et al.
- 1990
|
|
17
|
On the scalability of data synchronization protocols for PDAs and mobile devices
– Agarwal, Starobinski, et al.
- 2002
|
|
16
|
An optimal strategy for comparing file copies
– Abdel-Ghaffar, Abbadi
- 1994
|
|
15
|
Worst-case interactive communication II: Two messages are not optimal
– Orlitsky
- 1991
|
|
14
|
A parity structure for large remotely located replicated data files
– Metzner
- 1983
|
|
13
|
Efficient replicated remote file comparison
– Metzner
- 1991
|
|
13
|
Searching games with errors - fifty years of coping with liars, Theoretical Computer Science 270
– Pelc
- 2002
|
|
10
|
zdelta: a simple delta compression tool
– Trendafilov, Memon, et al.
- 2002
|
|
9
|
A probabilistic algorithm for updating files over a communication link
– Evfimievski
- 1998
|
|
9
|
An application of group testing to the file comparison problem
– Madej
- 1989
|
|
9
|
Algorithms for delta compression and remote file synchronization
– Suel, Memon
- 2002
|
|
8
|
Efficient PDA synchronization
– Starobinski, Trachtenberg, et al.
- 2003
|
|
7
|
Terascale sneakernet: Using inexpensive disks for backup, archiving, and data exchange
– Gray, Chong, et al.
- 2002
|
|
7
|
Multiround rsync
– Langford
- 2001
|
|
7
|
Practical algorithms for interactive communication
– Orlitsky, Viswanathan
- 2001
|
|
6
|
Set reconciliation with almost optimal communication complexity
– Minsky, Trachtenberg, et al.
- 2000
|
|
6
|
One-way communication and error-correcting codes,” p
– Orlitsky, Viswanathan
- 2001
|
|
5
|
Using the web efficiently: Mobile crawlers
– Fiedler, Hammer
- 1999
|
|
3
|
Remote file transfer method and apparatus
– Pyne
- 1995
|
|
2
|
Efficient location of discrepancies in multiple replicated large files
– Park, Metzner
- 2002
|
|
2
|
In-place rsync: File synchronization for mobile and wireless devices
– Rasch, Burns
- 2003
|
|
1
|
Webbase: Building a web warehouse
– Garcia-Molina
- 2003
|