Results 1 - 10
of
10
IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.
Xpath lookup queries in p2p networks
- In WIDM’04: Proceedings of the 6th annual ACM international workshop on Web information and data management
, 2004
"... We address the problem of querying XML data over a P2P network. In P2P networks, the allowed kinds of queries are usually exact-match queries over file names. We discuss the extensions needed to deal with XML data and XPath queries. A single peer can hold a whole document or a partial/complete fragm ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
We address the problem of querying XML data over a P2P network. In P2P networks, the allowed kinds of queries are usually exact-match queries over file names. We discuss the extensions needed to deal with XML data and XPath queries. A single peer can hold a whole document or a partial/complete fragment of the latter. Each XML fragment/document is identified by a distinct path expression, which is encoded in a distributed hash table. Our framework differs from content-based routing mechanisms, biased towards finding the most relevant peers holding the data. We perform fragments placement and enable fragments lookup by solely exploiting few path expressions stored on each peer. By taking advantage of quasi-zero replication of global catalogs, our system supports fast full and partial XPath querying. To this purpose, we have extended the Chord simulator and performed an experimental evaluation of our approach.
Approximately detecting duplicates for streaming data using stable bloom filters
- In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
, 2006
"... Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environmen ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
Traditional duplicate elimination techniques are not applicable to many data stream applications. In general, precisely eliminating duplicates in an unbounded data stream is not feasible in many streaming scenarios. Therefore, we target at approximately eliminating duplicates in streaming environments given a limited space. Based on a well-known bitmap sketch, we introduce a data structure, Stable Bloom Filter, and a novel and simple algorithm. The basic idea is as follows: since there is no way to store the whole history of the stream, SBF continuously evicts the stale information so that SBF has room for those more recent elements. After finding some properties of SBF analytically, we show that a tight upper bound of false positive rates is guaranteed. In our empirical study, we compare SBF to alternative methods. The results show that our method is superior in terms of both accuracy and time efficiency when a fixed small space and an acceptable false positive rate are given. 1.
Spam, Damn Spam, and Statistics: Using statistical analysis to locate spam web pages
- In Proceedings of WebDB
, 2004
"... The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a ha ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The increasing importance of search engines to commercial web sites has given rise to a phenomenon we call "web spam", that is, web pages that exist only to mislead search engines into (mis)leading users to certain web sites. Web spam is a nuisance to users as well as search engines: users have a harder time finding the information they need, and search engines have to cope with an inflated corpus, which in turn causes their cost per query to increase. Therefore, search engines have a strong incentive to weed out spam web pages from their index.
Finding duplicates in a data stream
- in Proc. 20th Annual Symposium on Discrete Algorithms (SODA), 2009
"... Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this proble ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sub-linear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n> m, under which one can find duplicates efficiently. 1
Nursing services
- British Medical Journal
, 1984
"... We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming som ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We have developed a web-repository crawler that is used for reconstructing websites when backups are unavailable. Our crawler retrieves web resources from the Internet Archive, Google, Yahoo and MSN. We examine the challenges of crawling web repositories, and we discuss strategies for overcoming some of these obstacles. We propose three crawling policies which can be used to reconstruct websites. We evaluate the effectiveness of the policies by reconstructing 24 websites and comparing the results with live versions of the websites. We conclude with our experiences reconstructing lost websites on behalf of others and discuss plans for improving our web-repository crawler.
Automated Gathering of Web Information: An In-depth Examination of Agents Interacting with Search Engines
"... this paper, we refer to spiders, softbots, meta-search applications and other automated information gathering processes all as agents ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
this paper, we refer to spiders, softbots, meta-search applications and other automated information gathering processes all as agents
Trust Based Knowledge Acquisition for Conversation Agents
"... Abstract — This paper explores the study of conversation agent knowledge bases particularly, its relationship to the related study of information trustworthiness and knowledge acquisition. Some important methods for knowledge extraction from online documents are discussed in this paper. This is done ..."
Abstract
- Add to MetaCart
Abstract — This paper explores the study of conversation agent knowledge bases particularly, its relationship to the related study of information trustworthiness and knowledge acquisition. Some important methods for knowledge extraction from online documents are discussed in this paper. This is done in relation to the purpose of the design and development of domain-specific information for CAs’ knowledge bases. This paper focuses on a novel approach based on the proposed Web Knowledge Trust Model (WKTM) and the Automated Knowledge Extraction Agent (AKEA). The results indicate that WKTM is useful for evaluating the trustworthiness of web sites and it is useful for the developing of key criteria for a knowledge acquisition for conversation agents.
OPTIMIZATION ISSUES IN WEB SEARCH ENGINES
"... Crawlers are deployed by a Web search engine for collecting information from different Web servers in order to maintain the currency of its data base of Web pages. We present studies on the optimization of Web search engines from different perspectives. We first investigate the number of crawlers to ..."
Abstract
- Add to MetaCart
Crawlers are deployed by a Web search engine for collecting information from different Web servers in order to maintain the currency of its data base of Web pages. We present studies on the optimization of Web search engines from different perspectives. We first investigate the number of crawlers to be used by a search engine so as to maximize the currency of the data base without putting an unnecessary load on the network. Both the static setting, where crawlers are always active, and the dynamic setting where, crawlers may be activated/deactivated as a function of the state of the system, are addressed. We then consider the optimal scheduling of the visits of these crawlers to the Web pages assuming these pages are modified at different rates. Finally, we briefly discuss some other optimization issues of Web search engines, including page ranking and system optimization.
1 IRLbot: Scaling to 6 Billion Pages and Beyond
"... Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimi ..."
Abstract
- Add to MetaCart
Abstract—This paper shares our experience in designing a web crawler that can download billions of pages using a singleserver implementation and models its performance. We show that with the quadratically increasing complexity of verifying URL uniqueness, BFS crawl order, and fixed per-host ratelimiting, current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly-branching spam, legitimate multi-million-page blog sites, and infinite loops created by server-side scripts. We offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1, 789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the web graph with 41 billion unique nodes. I.

