Results 1 - 10
of
47
Detecting Near-Duplicates for Web Crawling
- WWW 2007
, 2007
"... Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a nea ..."
Abstract
-
Cited by 92 (0 self)
- Add to MetaCart
Near-duplicate web documents are abundant. Two such documents differ from each other in a very small portion that displays advertisements, for example. Such differences are irrelevant for web search. So the quality of a web crawler increases if it can assess whether a newly crawled web page is a near-duplicate of a previously crawled web page or not. In the course of developing a near-duplicate detection system for a multi-billion page repository, we make two research contributions. First, we demonstrate that Charikar’s fingerprinting technique is appropriate for this goal. Second, we present an algorithmic technique for identifying existing fbit fingerprints that differ from a given fingerprint in at most k bit-positions, for small k. Our technique is useful for both online queries (single fingerprints) and batch queries (multiple fingerprints). Experimental evaluation over real data confirms the practicality of our design.
Beyond pagerank: Machine learning for static ranking
- In WWW ’06: Proceedings of the 15th international conference on World Wide Web
, 2006
"... Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gai ..."
Abstract
-
Cited by 64 (2 self)
- Add to MetaCart
(Show Context)
Since the publication of Brin and Page’s paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are independent of the link structure of the Web. We gain a further boost in accuracy by using data on the frequency at which users visit Web pages. We use RankNet, a ranking machine learning algorithm, to combine these and other static features based on anchor text and domain characteristics. The resulting model achieves a static ranking pairwise accuracy of 67.3 % (vs. 56.7% for PageRank or 50 % for random).
The discoverability of the web. In
- WWW ’07: Proceedings of the 16th international conference on World Wide Web,
, 2007
"... ABSTRACT Previous studies have highlighted the rapidity with which new content arrives on the web. We study the extent to which this new content can be efficiently discovered in the crawling model. Our study has two parts. First, we employ a maximum cover formulation to study the inherent difficult ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
ABSTRACT Previous studies have highlighted the rapidity with which new content arrives on the web. We study the extent to which this new content can be efficiently discovered in the crawling model. Our study has two parts. First, we employ a maximum cover formulation to study the inherent difficulty of the problem in a setting in which we have perfect estimates of likely sources of links to new content. Second, we relax the requirement of perfect estimates into a more realistic setting in which algorithms must discover new content using historical statistics to estimate which pages are most likely to yield links to new content. We measure the overhead of discovering new content, defined as the average number of fetches required to discover one new page. We show first that with perfect foreknowledge of where to explore for links to new content, it is possible to discover 50% of all new content with under 3% overhead, and 100% of new content with 28% overhead. But actual algorithms, which do not have access to perfect foreknowledge, face a more difficult task: 26% of new content is accessible only by recrawling a constant fraction of the entire web. Of the remaining 74%, 80% of this content may be discovered within one week at discovery cost equal to 1.3X the cost of gathering the new content, in a model with full monthly recrawls.
Efficient Monitoring Algorithm for Fast News Alert
, 2005
"... use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
use of XML data to deliver information over the Web. Personal weblogs, news Web sites, and discussion forums are now publishing RSS feeds for their subscribers to retrieve new postings. While the subscribers rely on news feeders to regularly pull articles from the Web sites, the aggregated effect by all news feeders puts an enormous load on many sites. In this paper, we propose a blog aggregator approach where a central aggregator monitors and retrieves new postings from different data sources and subsequently disseminates them to the subscribers to alleviate such a problem.
iRobot: An intelligent crawler for Web forums
- In WWW
, 2008
"... We study in this paper the Web forum crawling problem, which is a very fundamental step in many Web applications, such as search engine and Web data mining. As a typical user-created content (UCC), Web forum has become an important resource on the Web due to its rich information contributed by milli ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
(Show Context)
We study in this paper the Web forum crawling problem, which is a very fundamental step in many Web applications, such as search engine and Web data mining. As a typical user-created content (UCC), Web forum has become an important resource on the Web due to its rich information contributed by millions of Internet users every day. However, Web forum crawling is not a trivial problem due to the in-depth link structures, the large amount of duplicate pages, as well as many invalid pages caused by login failure issues. In this paper, we propose and build a prototype of an intelligent forum crawler, iRobot, which has intelligence to understand the content and the structure of a forum site, and then decide how to choose traversal paths among different kinds of pages. To do this, we first randomly sample (download) a few pages from the target forum site, and introduce the page content layout as the characteristics to group those pre-sampled pages and re-construct the forum's sitemap. After that, we select an optimal crawling path which only traverses informative pages and skips invalid and duplicate ones. The extensive experimental results on several forums show the performance of our system in the follow-ing aspects: 1) Effectiveness – Compared to a generic crawler, iRobot significantly decreases the duplicate and invalid pages; 2) Efficiency – With a small cost of pre-sampling a few pages for learning the necessary knowledge, iRobot saves substantial net-work bandwidth and storage as it only fetches informative pages from a forum site; and 3) Long threads that are divided into mul-tiple pages can be re-concatenated and archived as a whole thread, which is of great help for further indexing and data mining.
RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee
, 2007
"... Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downl ..."
Abstract
-
Cited by 18 (1 self)
- Add to MetaCart
Crawling algorithms have been the subject of extensive research and optimizations, but some important questions remain open. In particular, given the unbounded number of pages available on the Web, search-engine operators constantly struggle with the following vexing questions: When can I stop downloading the Web? How many pages should I download to cover “most ” of the Web? How can I know I am not missing an important part when I stop? In this paper we provide an answer to these questions by developing, in the context of a system that is given a set of trusted pages, a family of crawling algorithms that (1) provide a theoretical guarantee on how much of the “important ” part of the Web it will download after crawling a certain number of pages and (2) give a high priority to important pages during a crawl, so that the search engine can index the most important part of the Web first. We prove the correctness of our algorithms by theoretical analysis and evaluate their performance experimentally based on 141 million URLs obtained from the Web. Our experiments demonstrate that even our simple algorithm is effective in downloading important pages early on and provides high “coverage of the Web with a relatively small number of pages.
Modeling and Managing Content Changes in Text Databases
, 2007
"... Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases f ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
(Show Context)
Large amounts of (often valuable) information are stored in web-accessible text databases. “Metasearchers” provide unified interfaces to query multiple such databases at once. For efficiency, metasearchers rely on succinct statistical summaries of the database contents to select the best databases for each query. So far, database selection research has largely assumed that databases are static, so the associated statistical summaries do not evolve over time. However, databases are rarely static and the statistical summaries that describe their contents need to be updated periodically to reflect content changes. In this article, we first report the results of a study showing how the content summaries of 152 real web databases evolved over a period of 52 weeks. Then, we show how to use “survival analysis ” techniques in general, and Cox’s proportional hazards regression in particular, to model database changes over time and predict when we should update each content summary. Finally, we exploit our change model to devise update schedules that keep the summaries up to date by contacting databases only when needed, and then we evaluate the
Archiving the Web using Page Changes Pattern: A Case Study
- In ACM/IEEE Joint Conference on Digital Libraries (JCDL ’11
, 2011
"... A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have bee ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend), or more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive web sites. We first define our pattern model that describes the changes of pages. Then, we present the strategy used to (i) extract the temporal evolution of page changes, to (ii) discover patterns and to (iii) exploit them to improve web archives. We choose the archive of French public TV channels France Télévisions as a case study 1 in order to validate our approach. Our experimental evaluation based on real web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
Board Forum Crawling: A Web Crawling Method for Web Forum
- Proc. IEEE/WIC/ACM Int’l Conf. Web Intelligence
, 2006
"... We present a new method of Board Forum Crawling to crawl Web forum. This method exploits the organized characteristics of the Web forum sites and simulates human behavior of visiting Web Forums. The method starts crawling from the homepage, and then enters each board of the site, and then crawls all ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
We present a new method of Board Forum Crawling to crawl Web forum. This method exploits the organized characteristics of the Web forum sites and simulates human behavior of visiting Web Forums. The method starts crawling from the homepage, and then enters each board of the site, and then crawls all the posts of the site directly. Board Forum Crawling can crawl most meaningful information of a Web forum site efficiently and simply. We experimentally evaluated the effectiveness of the method on real Web forum sites by comparing with the traditional breadth-first crawling. We also used this method in a real project, and 12000 Web forum sites have been crawled successfully. These results show the effectiveness of our method.
Looking at both the present and the past to efficiently update replicas of web content
- In Proc. of WIDM 2005
, 2005
"... Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important t ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Since Web sites are autonomous and independently updated, applications that keep replicas of Web data, such as Web warehouses and search engines, must periodically poll the sites and check for changes. Since this is a resource-intensive task, in order to keep the copies up-to-date, it is important to devise efficient update schedules that adapt to the change rate of the pages and avoid visiting pages not modified since the last visit. In this paper, we propose a new approach that learns to predict the change behavior of Web pages based both on the static features and change history of pages, and refreshes the copies accordingly. Experiments using real-world data show that our technique leads to substantial performance improvements compared to previously proposed approaches.