MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Finding Replicated Web Collections (1999) [50 citations — 4 self]

by Junghoo Cho ,  Narayanan Shivakumar ,  Hector Garcia-molina
In ACM SIGMOD
Add To MetaCart

Abstract:

Many web dockfi+ ts(suc h as JAVA FAQs) are beingreplicCSk on the Internet. Often entire dockfiP t c+fifi:kHDCD (suc h as hyperlinked Linux manuals) are beingreplicC:k many times. In this paper, we make the cek for identifyingreplicPDD dockD" ts andckDCPfiPkHD to improve webcfi wlers,arc hivers, and rankingfuncngkP used insearc h engines. The paper descfifi es how toe#c"" tly identifyreplic"fik dockP" ts and hyperlinked dockDP tcC:fi+kHD:Pj Thec hallenge is to identify thesereplicD from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-lifecal studies where we used replicD"kH information to improve acP wler and asearc h engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) ca wled from the web. 1

Citations

No citations identified.