• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

SCAM: A Copy Detection Mechanism for Digital Documents," D-Lib Magazine (1995)

by N Shivakumar, H Garcia-Molina
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 65
Next 10 →

On the Resemblance and Containment of Documents

by Andrei Z. Broder - In Compression and Complexity of Sequences (SEQUENCES’97 , 1997
"... Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be eas ..."
Abstract - Cited by 254 (5 self) - Add to MetaCart
Given two documents A and B we define two mathematical notions: their resemblance r(A, B)andtheircontainment c(A, B) that seem to capture well the informal notions of "roughly the same" and "roughly contained." The basic idea is to reduce these issues to set intersection problems that can be easily evaluated by a process of random sampling that can be done independently for each document. Furthermore, the resemblance can be evaluated using a fixed size sample for each document.

Winnowing: Local Algorithms for Document Fingerprinting

by Saul Schleimer - Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data 2003 , 2003
"... Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algor ..."
Abstract - Cited by 129 (2 self) - Add to MetaCart
Digital content is for copying: quotation, revision, plagiarism, and file sharing all create copies. Document fingerprinting is concerned with accurately identifying copying, including small partial copies, within large sets of documents. We introduce the class of local document fingerprinting algorithms, which seems to capture an essential property of any fingerprinting technique guaranteed to detect copies. We prove a novel lower bound on the performance of any local algorithm. We also develop winnowing, an efficient local fingerprinting algorithm, and show that winnowing’s performance is within 33 % of the lower bound. Finally, we also give experimental results on Web data, and report experience with MOSS, a widely-used plagiarism detection service. 1.

Indexing and retrieval of scientific literature

by Steve Lawrence, Kurt Bollacker, C. Lee Giles - Proceedings of the 8 th International Conference on Information and Knowledge Management , 1999
"... The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and t ..."
Abstract - Cited by 68 (14 self) - Add to MetaCart
The web has greatly improved access to scientific literature. However, scientific articles on the web are largely disorganized, with research articles being spread across archive sites, institution sites, journal sites, and researcher homepages. No index covers all of the available literature, and the major web search engines typically do not index the content of Postscript/PDF documents at all. This paper discusses the creation of digital libraries of scientific literature on the web, including the efficient location of articles, full-text indexing of the articles, autonomous citation indexing, information extraction, display of query-sensitive summaries and citation context, hubs and authorities computation, similar document detection, user profiling, distributed error correction, graph analysis, and detection of overlapping documents. The software for the system is available at no cost for non-commercial use. 1

Building a Scalable and Accurate Copy Detection Mechanism

by Narayanan Shivakumar, Hector Garcia-molina - In Proceedings of 1st ACM Conference on Digital Libraries (DL'96 , 1996
"... Often, publishers are reluctant to offer valuable digital documents on the Internet for fear that they will be re-transmitted or copied widely. A Copy Detection Mechanism can help identify such copying. For example, publishers may register their documents with a copy detection server, and the server ..."
Abstract - Cited by 59 (7 self) - Add to MetaCart
Often, publishers are reluctant to offer valuable digital documents on the Internet for fear that they will be re-transmitted or copied widely. A Copy Detection Mechanism can help identify such copying. For example, publishers may register their documents with a copy detection server, and the server can then automatically check public sources such as UseNet articles and Web sites for potential illegal copies. The server can search for exact copies, and also for cases where significant portions of documents have been copied. In this paper we study, for the first time, the performance of various copy detection mechanisms, including the disk storage requirements, main memory requirements, response times for registration, and response time for querying. We also contrast performance to the accuracy of the mechanisms (how well they detect partial copies). The results are obtained using SCAM, an experimental server we have implemented, and a collection of 50,000 netnews articles. 1 Introducti...

Scalable Document Fingerprinting

by Nevin Heintze - IN PROC. USENIX WORKSHOP ON ELECTRONIC COMMERCE , 1996
"... As more information becomes available electronically, document search based on textual similarity is becoming increasingly important, not only for locating documents online, but also for addressing internet variants of old problems such as plagiarism and copyright violation. This paper presents an ..."
Abstract - Cited by 59 (0 self) - Add to MetaCart
As more information becomes available electronically, document search based on textual similarity is becoming increasingly important, not only for locating documents online, but also for addressing internet variants of old problems such as plagiarism and copyright violation. This paper presents an online system that provides reliable search results using modest resources and scales up to data sets of the order of a million documents. Our system provides a practical compromise between storage requirements, immunity to noise introduced by document conversion and security needs for plagiarism applications. We present both quantitative analysis and empirical results to argue that our design is feasible and effective. A web-based prototype system is accessible via the URL http://www.cs.cmu.edu/afs/cs/user/nch/www/koala.html.

Finding near-replicas of documents on the Web

by Narayanan Shivakumar, Hector Garcia-molina - In International Workshop on the World Wide Web and Databases (WebDB’98 , 1998
"... We consider how to e ciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing ..."
Abstract - Cited by 54 (0 self) - Add to MetaCart
We consider how to e ciently compute the overlap between all pairs of web documents. This information can be used to improve web crawlers, web archivers and in the presentation of search results, among others. We report statistics on how common replication is on the web, and on the cost of computing the above information for a relatively large subset of the web { about 24 million web pages which corresponds to about 150 Gigabytes of textual information. 1

Exploiting Hierarchical Domain Structure to Compute Similarity

by Prasanna Ganesan , Hector Garcia-Molina, Jennifer Widom - ACM TRANSACTIONS ON INFORMATION SYSTEMS , 2003
"... ..."
Abstract - Cited by 50 (0 self) - Add to MetaCart
Abstract not found

Efficient Snapshot Differential Algorithms for Data Warehousing

by Wilburt Juan Labio, Hector Garcia-molina - In Proceedings of the International Conference on Very Large Data Bases , 1996
"... Detecting and extracting modifications from information sources is an integral part of data warehousing. For unsophisticated sources, in practice it is often necessary to infer modifications by periodically comparing snapshots of data from the source. Although this sapshot di/rem tial problem is ..."
Abstract - Cited by 39 (7 self) - Add to MetaCart
Detecting and extracting modifications from information sources is an integral part of data warehousing. For unsophisticated sources, in practice it is often necessary to infer modifications by periodically comparing snapshots of data from the source. Although this sapshot di/rem tial problem is closely related to traditional joins and outerjoins, there are significant differences, which lead to simple new algorithms. In particular, we present algorithms that perform (possibly lossy) compression of records. We also present a window algorithm that works very well if the snapshots are not "very different." The algorithms are studied via analysis and an implementation of two of them; the results illustrate the potential gains achievable with the new algorithms.

Identifying and Merging Related Bibliographic Records

by Jeremy A. Hylton, Jeremy A. Hylton - MIT LCS Masters Thesis , 1996
"... Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multi ..."
Abstract - Cited by 35 (0 self) - Add to MetaCart
Bibliographic records freely available on the Internet can be used to construct a highquality, digital finding aid that provides the ability to discover paper and electronic documents. The key challenge to providing such a service is integrating mixed-quality bibliographic records, coming from multiple sources and in multiple formats. This thesis describes an algorithm that automatically identifies records that refer to the same work and clusters them together; the algorithm clusters records for which both author and title match. It tolerates errors and cataloging variations within the records by using a full-text search engine and an n-gram-based approximate string matching algorithm to build the clusters. The algorithm identifies more than 90 percent of the related records and includes incorrect records in less than 1 percent of the clusters. It has been used to construct a 250,000-record collection of the computer science literature. This thesis also presents preliminary work on aut...

Making Trust Explicit in Distributed Commerce Transactions

by Steven P. Ketchpel, Hector Garcia-molina - In Proceedings of the International Conference on Distributed Computing Systems , 1995
"... In a distributed environment where nodes are independently motivated, many transactions or commercial exchanges may be stymied due to a lack of trust between the participants. The addition of trusted intermediaries may facilitate some exchanges, but others are still problematic. We introduce a langu ..."
Abstract - Cited by 28 (2 self) - Add to MetaCart
In a distributed environment where nodes are independently motivated, many transactions or commercial exchanges may be stymied due to a lack of trust between the participants. The addition of trusted intermediaries may facilitate some exchanges, but others are still problematic. We introduce a language for specifying these commercial exchange problems, and sequencing graphs, a formalism for determining whether a given exchange may occur. We also present an algorithm for generating a feasible execution sequence of pairwise exchanges between parties (when it exists). Indemnities may be offered to facilitate previously infeasible transactions. We show when and how this approach facilitates commercial transactions. 1 Introduction Electronic commerce is a rapidly expanding field, with modern and vast computer networks bringing together many potential customers and producers, increasing the number and range of contacts, and increasing the likelihood of matching the producers and consumers. ...
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University