Results 1 - 10
of
34
The Content and Access Dynamics of a Busy Web Site: Findings and Implications
, 2000
"... In this paper, we study the dynamics of the MSNBC news site, one of the busiest Web sites in the Internet today. Unlike many other efforts that have analyzed client accesses as seen by proxies, we focus on the server end. We analyze the dynamics of both the server content and client accesses made to ..."
Abstract
-
Cited by 104 (9 self)
- Add to MetaCart
In this paper, we study the dynamics of the MSNBC news site, one of the busiest Web sites in the Internet today. Unlike many other efforts that have analyzed client accesses as seen by proxies, we focus on the server end. We analyze the dynamics of both the server content and client accesses made to the server. The former considers the content creation and modification process while the latter considers page popularity and locality in client accesses. Some of our key results are: (a) files tend to change little when they are modified, (b) a small set of files tends to get modified repeatedly, (c) file popularity follows a Zipf-like distribution with a parameter ff that is much larger than reported in previous, proxy-based studies, and (d) there is significant temporal stability in file popularity but not much stability in the domains from which clients access the popular content. We discuss the implications of these findings for techniques such as Web caching (including cache consisten...
File System Support for Delta Compression
, 2000
"... Delta compression, which consists of compactly encoding one le version as the result of changes to another, can improve eciency in the use of network and disk resources. Delta compression techniques are readily available and can result in compression factors of ve to ten on typical data. Managing de ..."
Abstract
-
Cited by 53 (0 self)
- Add to MetaCart
Delta compression, which consists of compactly encoding one le version as the result of changes to another, can improve eciency in the use of network and disk resources. Delta compression techniques are readily available and can result in compression factors of ve to ten on typical data. Managing delta-compressed storage, however, is a dicult task. I will present a system that attempts to isolate the complexity of delta-compressed storage management by separating the task of version labeling from performance issues. I will show how the system integrates delta-compressed transport with delta-compressed storage. Existing tools for managing delta-compressed storage suer from weak le system support. Lack of transaction support is responsible for inecient application behavior. The only atomic operation in the traditional le system forces unnecessary disk activity due to copying costs. I will demonstrate that transaction support can improve application performance and extensibility wit...
Compactly Encoding Unstructured Inputs with Differential Compression
- JOURNAL OF THE ACM
, 2002
"... The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms ..."
Abstract
-
Cited by 35 (8 self)
- Add to MetaCart
The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains n# bits), as compared to
Cache-based Compaction: A New Technique for Optimizing Web Transfer
, 1999
"... In this paper, we propose and study a new technique, which we call cache-based compaction for reducing the latency of Web browsing over a slow link. Our compaction technique trades computation for bandwidth. The key observation is that an object can be coded in a highly compact form for transfer if ..."
Abstract
-
Cited by 30 (0 self)
- Add to MetaCart
In this paper, we propose and study a new technique, which we call cache-based compaction for reducing the latency of Web browsing over a slow link. Our compaction technique trades computation for bandwidth. The key observation is that an object can be coded in a highly compact form for transfer if similar objects that have been transferred earlier can be used as references. The contributions of this paper are: (1) an efficient selection algorithm for selecting similar objects as references, and (2) an encoding /decoding algorithm that reduces the size of a Web object by exploiting its similarities with the reference objects. We verify the efficacy of our proposal through detailed experimental evaluations. Our compaction technique significantly generalizes previous work on optimizing Web transfer using compression or differencing, and provides a systematic foundation that ties together caching, compression and prefetching.
Engineering a differencing and compression data format
- In Proceedings of the Usenix Annual Technical Conference
, 2002
"... Conference ..."
Improved File Synchronization Techniques for Maintaining Large Replicated Collections over Slow Networks
- IN PROC. OF THE INT. CONF. ON DATA ENGINEERING
, 2004
"... We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching netw ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
We study the problem of maintaining large replicated collections of files or documents in a distributed environment with limited bandwidth. This problem arises in a number of important applications, such as synchronization of data between accounts or devices, content distibution and web caching networks, web site mirroring, storage networks, and large scale web search and mining. At the core of the problem lies the following challenge, called the file synchronization problem: given two versions of a file on different machines, say an outdated and a current one, how can we update the outdated version with minimum communication cost, by exploiting the significant similarity between the versions? While a popular open source tool for this problem called rsync is used in hundreds of thousands of installations, there have been only very few attempts to improve upon this tool in practice. In this paper,
A Testbed for Configuration Management Policy Programming
- IEEE TRANSACTIONS ON SOFTWARE ENGINEERING
, 2002
"... Even though the number and variety of available configuration management systems has grown rapidly in the past few years, the need for new configuration management systems still remains. Driving this need are the emergence of situations requiring highly specialized solutions, the demand for manage ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Even though the number and variety of available configuration management systems has grown rapidly in the past few years, the need for new configuration management systems still remains. Driving this need are the emergence of situations requiring highly specialized solutions, the demand for management of artifacts other than traditional source code and the exploration of entirely new research questions in configuration management. Complicating the picture is the trend toward organizational structures that involve personnel working at physically separate sites. We have developed a testbed to support the rapid development of configuration management systems. The testbed separates configuration management repositories (i.e., the stores for versions of artifacts) from configuration management policies (i.e., the procedures, according to which the versions are manipulated) by providing a generic model of a distributed repository and an associated programmatic interface. Specific configuration management policies are programmed as unique extensions to the generic interface, while the underlying distributed repository is reused across different policies. In this paper, we describe the repository model and its interface and present our experience in using a prototype of the testbed, called NUCM, to implement a variety of configuration management systems.
Cluster-Based Delta Compression of a Collection of Files
- In Third Int. Conf. on Web Information Systems Engineering
, 2002
"... Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. In this paper, we study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by p ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
Delta compression techniques are commonly used to succinctly represent an updated version of a file with respect to an earlier one. In this paper, we study the use of delta compression in a somewhat different scenario, where we wish to compress a large collection of (more or less) related files by performing a sequence of pairwise delta compressions. The problem of finding an optimal delta encoding for a collection of files by taking pairwise deltas can be reduced to the problem of computing a branching of maximum weight in a weighted directed graph, but this solution is inefficient and thus does not scale to larger file collections. This motivates us to propose a framework for cluster-based delta compression that uses text clustering techniques to prune the graph of possible pairwise delta encodings. To demonstrate the efficacy of our approach, we present experimental results on collections of web pages. Our experiments show that cluster-based delta compression of collections provides significant improvements in compression ratio as compared to individually compressing each file or using tar+gzip, at a moderate cost in efficiency.
Algorithms for Delta Compression and Remote File Synchronization
- In Khalid Sayood, editor, Lossless Compression Handbook
, 2002
"... Delta compression and remote file synchronization techniques are concerned with efficient file transfer over a slow communication link in the case where the receiving party already has a similar file (or files). This problem arises naturally, e.g., when distributing updated versions of software o ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
Delta compression and remote file synchronization techniques are concerned with efficient file transfer over a slow communication link in the case where the receiving party already has a similar file (or files). This problem arises naturally, e.g., when distributing updated versions of software over a network or synchronizing personal files between different accounts and devices. More generally, the problem is becoming increasingly common in many networkbased applications where files and content are widely replicated, frequently modified, and cut and reassembled in different contexts and packagings.
Renaming Detection
- AUTOM. SOFTW. ENG
, 2000
"... Finding changed identifiers in programs is important for program comparison and merging. Comparing two versions of a program is complicated if renaming has occurred. Textual merging is highly unreliable if, in one version, identifiers were renamed, while in the other version, code using the old iden ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Finding changed identifiers in programs is important for program comparison and merging. Comparing two versions of a program is complicated if renaming has occurred. Textual merging is highly unreliable if, in one version, identifiers were renamed, while in the other version, code using the old identifiers was added or modified. A tool that

