Results 1 - 10
of
35
Recognizing Nepotistic Links on the Web
- In AAAI-2000 Workshop on Artificial Intelligence for Web Search
, 2000
"... The use of link analysis and page popularity in search engines has grown recently to improve query result rankings. Since the number of such links contributes to the value of the document in such calculations, we wish to recognize and eliminate nepotistic links --- links between pages that are p ..."
Abstract
-
Cited by 58 (8 self)
- Add to MetaCart
The use of link analysis and page popularity in search engines has grown recently to improve query result rankings. Since the number of such links contributes to the value of the document in such calculations, we wish to recognize and eliminate nepotistic links --- links between pages that are present for reasons other than merit. This paper explores some of the issues surrounding the question of what links to keep, and we report high accuracy in initial experiments to show the potential for using a machine learning tool to automatically recognize such links. Introduction Recently there has been growing interest in the research community in using analysis of link information of the Web (Bharat & Henzinger 1998; Brin & Page 1998; Chakrabarti et al. 1998b; 1998a; Chakrabarti, Dom, & Indyk 1998; Gibson, Kleinberg, & Raghavan 1998a; 1998b; Kleinberg 1998; Page et al. 1998; Chakrabarti et al. 1999a; Chakrabarti, van den Berg, & Dom 1999; Chakrabarti et al. 1999b; Henzinger et al...
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Collection Statistics for Fast Duplicate Document Detection
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2002
"... ..."
A Comparison of Techniques to Find Mirrored Hosts on the WWW
- Journal of the American Society of Information Science
, 1999
"... We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information easily available from web proxies and crawlers. Identification of mirrored hosts can improve web-based information retrieva ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
We compare several algorithms for identifying mirrored hosts on the World Wide Web. The algorithms operate on the basis of URL strings and linkage data: the type of information easily available from web proxies and crawlers. Identification of mirrored hosts can improve web-based information retrieval in several ways: First, by identifying mirrored hosts, search engines can avoid storing and returning duplicate documents. Second, several new information retrieval techniques for the Web make inferences based on the explicit links among hypertext documents -- mirroring perturbs their graph model and degrades performance. Third, mirroring information can be used to redirect users to alternate mirror sites to compensate for various failures, and can thus improve the performance of web browsers and proxies. # This work was presented at the Workshop on Organizing Web Space at the Fourth ACM Conference on Digital Libraries 1999. We evaluated 4 classes of "top-down" algorithms for detecting ...
What can you do with a web in your pocket
- Data Engineering Bulletin
, 1998
"... The amount of information available online has grown enormously over the past decade. Fortunately, computing power, disk capacity, and network bandwidth have also increased dramatically. It is currently possible for a university research project to store and process the entire World Wide Web. Since ..."
Abstract
-
Cited by 50 (1 self)
- Add to MetaCart
The amount of information available online has grown enormously over the past decade. Fortunately, computing power, disk capacity, and network bandwidth have also increased dramatically. It is currently possible for a university research project to store and process the entire World Wide Web. Since there is a limit on how much text humans can generate, it is plausible that within a few decades one will be able to store and process all the human-generated text on the Web in a shirt pocket. The Web is a very rich and interesting data source. In this paper, we describe the Stanford WebBase, a local repository of a significant portion of the Web. Furthermore, we describe a number of recent experiments that leverage the size and the diversity of the WebBase. First, we have largely automated the process of extracting a sizable relation of books (title, author pairs) from hundreds of data sources spread across the World Wide Web using a technique we call Dual Iterative Pattern Relation Extraction. Second, we have developed a global ranking of Web pages called PageRank based on the link structure of the Web that has properties that are useful for search and navigation. Third, we have used PageRank to develop a novel search engine called Google, which also makes heavy use of anchor text. All of these experiments rely significantly on the size and diversity of the WebBase. 1
Aliasing on the World Wide Web: Prevalence and Performance Implications
, 2002
"... Aliasing occurs in Web transactions when requests containing different URLs elicit replies containing identical data payloads. Aliasing can cause cache misses, and there is reason to suspect that offthe -shelf Web authoring tools might increase aliasing on the Web. Existing research literature, howe ..."
Abstract
-
Cited by 34 (3 self)
- Add to MetaCart
Aliasing occurs in Web transactions when requests containing different URLs elicit replies containing identical data payloads. Aliasing can cause cache misses, and there is reason to suspect that offthe -shelf Web authoring tools might increase aliasing on the Web. Existing research literature, however, says little about the prevalence of aliasing in user-initiated transactions or its impact on endto -end performance in large multi-level cache hierarchies.
Efficient Video Similarity Measurement with Video Signature
- IEEE Transactions on Circuits and Systems for Video Technology
, 2003
"... The proliferation of video content on the web makes similarity detection an indispensable tool in web data management, searching, and navigation. In this paper, we propose a number of algorithms to efficiently measure video similarity. We define video as a set of frames, which are represented as hig ..."
Abstract
-
Cited by 32 (5 self)
- Add to MetaCart
The proliferation of video content on the web makes similarity detection an indispensable tool in web data management, searching, and navigation. In this paper, we propose a number of algorithms to efficiently measure video similarity. We define video as a set of frames, which are represented as high dimensional vectors in a feature space. Our goal is to measure Ideal Video Similarity (IVS), defined as the percentage of clusters of similar frames shared between two video sequences. Since IVS is too complex to be deployed in large database applications, we approximate it with Voronoi Video Similarity (VVS), defined as the volume of the intersection between Voronoi Cells of similar clusters. We propose a class of randomized algorithms to estimate VVS by first summarizing each video with a small set of its sampled frames, called the Video Signature (ViSig), and then calculating the distances between corresponding frames from the two ViSig's. By generating samples with a probability distribution that describes the video statistics, and ranking them based upon their likelihood of making an error in the estimation, we show analytically that ViSig can provide an unbiased estimate of IVS. Experimental results on a large dataset of web video and a set of MPEG-7 test sequences with artificially generated similar versions are provided to demonstrate the retrieval performance of our proposed techniques.
Online Duplicate Document Detection: Signature Reliability in a Dynamic Retrieval Environment
- Proceedings of the twelfth international conference on Information and knowledge management, Pages: 443 - 452
, 2003
"... As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. A representative technique constructs a ‘fingerprint ’ of the rarest or richest features in a document using collection statistics as criteria for feature selection. One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with
Methods for Identifying Versioned and Plagiarised Documents
- Journal of the American Society for Information Science and Technology
, 2003
"... The widespread use of online publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarising the work of others. We evaluate two families of methods for searching a collection to find documents th ..."
Abstract
-
Cited by 20 (4 self)
- Add to MetaCart
The widespread use of online publishing of text promotes storage of multiple versions of documents and mirroring of documents in multiple locations, and greatly simplifies the task of plagiarising the work of others. We evaluate two families of methods for searching a collection to find documents that are co-derivative, that is, are versions or plagiarisms of each other. The first, the ranking family, uses information retrieval techniques; extending this family, we propose the identity measure, which is specifically designed for identification of co-derivative documents. The second, the fingerprinting family, uses hashing to generate a compact document description, which can then be compared to the fingerprints of the documents in the collection. We introduce a new method for evaluating the e#ectiveness of these techniques, and demonstrate it in practice. Using experiments on two collections, we demonstrate that the identity measure and the best fingerprinting technique are both able to accurately identify co-derivative documents. However, for fingerprinting parameters must be carefully chosen, and even so the identity measure is clearly superior.
Mirror, mirror on the web: a study of host pairs with replicated content
- In Proceedings of the Eighth International World Wide Web Conference
, 1999
"... Two previous studies, one done at Stanford in 1997 based on data collected by the Google search engine, and one done at Digital in 1996 based on AltaVista data, revealed that almost a third of the Web consists of duplicate pages. Both studies identified mirroring, that is, the systematic replication ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Two previous studies, one done at Stanford in 1997 based on data collected by the Google search engine, and one done at Digital in 1996 based on AltaVista data, revealed that almost a third of the Web consists of duplicate pages. Both studies identified mirroring, that is, the systematic replication of content over a pair of hosts, as the principal cause of duplication, but did not further investigate this phenomenon. The main aim of this paper is to present a clearer picture of mirroring on the Web. As input we used a set of 179 million URLs found during a Web crawl done in the summer of 1998. We looked at all hosts with more than 100 URLs in our input (about 238,000), and discovered that about 10 % were mirrored to varying degrees. The paper presents data about the prevalence of mirroring based on a mirroring classification scheme that we define. There are numerous reasons for mirroring: technical (e.g., to improve access time), commercial (e.g., different intermediaries offering the same products), cultural (e.g., same content in two languages), social (e.g., sharing of research

