Results 1 - 10
of
15
Topical web crawlers: Evaluating adaptive algorithms
- ACM Transactions on Internet Technology
, 2004
"... Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficien ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
A General Evaluation Framework for Topical Crawlers
- INFORMATION RETRIEVAL
, 2005
"... Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through welldefined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of di#erent nature and di#culty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and e#ciency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing di#erent relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling e#ectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is e#ective at evaluating, comparing, di#erentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
Generalizing pagerank: Damping functions for linkbased ranking algorithms
- In Proceedings of ACM SIGIR
"... This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ran ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that Page-Rank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to Page-Rank’s using a fixed small number of iterations; comparisons were performed using Kendall’s τ on large domain datasets.
Adding Geographic Scopes to Web Resources
- CEUS - Computers, Environment and Urban Systems
, 2006
"... Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographi ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Many Web pages are rich in geographic information and primarily relevant to geographically limited communities. However, existing IR systems only recently began to offer local services and largely ignore geo-spatial information. This paper presents our work on automatically identifying the geographical scope of Web documents, which provides the means to develop retrieval tools that take the geographical context into consideration. Our approach makes extensive use of an ontology of geographical concepts, and includes a system architecture for extracting geographic information from large collections of Web documents. The proposed method involves recognising geographical references over the documents and assigning geographical scopes through a graph ranking algorithm. Initial evaluation results are encouraging, indicating the viability of this approach.
Folks in Folksonomies: Social Link Prediction from Shared Metadata
"... Web 2.0 applications have attracted a considerable amount of attention because their open-ended nature allows users to create lightweight semantic scaffolding to organize and share content. To date, the interplay of the social and semantic components of social media has been only partially explored. ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Web 2.0 applications have attracted a considerable amount of attention because their open-ended nature allows users to create lightweight semantic scaffolding to organize and share content. To date, the interplay of the social and semantic components of social media has been only partially explored. Here we focus on Flickr and Last.fm, two social media systems in which we can relate the tagging activity of the users with an explicit representation of their social network. We show that a substantial level of local lexical and topical alignment is observable among users who lie close to each other in the social network. We introduce a null model that preserves user activity while removing local correlations, allowing us to disentangle the actual local alignment between users from statistical effects due to the assortative mixing of user activity and centrality in the social network. This analysis suggests that users with
A spatial web graph model with local influence regions
- Internet Mathematics
"... Abstract. We present a new stochastic model for complex networks, based on a spatial embedding of the nodes, called the Spatial Preferred Attachment (SPA) model. In the SPA model, nodes have influence regions of varying size, and new nodes may only link to a node if they fall within its influence re ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract. We present a new stochastic model for complex networks, based on a spatial embedding of the nodes, called the Spatial Preferred Attachment (SPA) model. In the SPA model, nodes have influence regions of varying size, and new nodes may only link to a node if they fall within its influence region. The spatial embedding of the nodes models the background knowledge or identity of the node, which will influence its link environment. In our model, nodes can determine their link environment based only on local knowledge of the network. We prove that our model gives a power law in-degree distribution, with exponent in [2, ∞) depending on the parameters, and with concentration for a wide range of in-degree values. We show that the model allows for edges that span a large distance in the underlying space, modelling a feature often observed in real-world complex networks. 1.
On Compressing the Textual Web
"... Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing t ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Nowadays we know how to effectively compress most basic components of any modern search engine, such as, the graphs arising from the Web structure and/or its usage, the posting lists, and the dictionary of terms. But we are not aware of any study which has deeply addressed the issue of compressing the raw Web pages. Many Web applications use simple compression algorithms — e.g. gzip, or word-based Move-to-Front or Huffman coders — and conclude that, even compressed, raw data take more space than Inverted Lists. In this paper we investigate two typical scenarios of use of data compression for large Web collections. In the first scenario, the compressed pages are stored on disk and we only need to support the fast scanning of large parts of the compressed collection (such as for map-reduce paradigms). In the second scenario, we consider the fast access to individual pages of the compressed collection that is distributed among the RAMs of many PCs (such as for search engines and miners). For the first scenario, we provide a thorough experimental comparison among state-of-the-art compressors thus indicating pros and cons of the available solutions. For the second scenario, we compare compressed-storage solutions with the new technology of compressed self-indexes [45]. Our results show that Web pages are more compressible than expected and, consequently, that some common beliefs in this area should be reconsidered. Our results are novel for the large spectrum of tested approaches and the size of datasets, and provide a threefold contribution: a nontrivial baseline for designing new compressed-storage solutions, a guide for software developers faced with Web-page storage, and a natural complement to the recent figures on InvertedList-compression achieved by [57, 58].
Semi-Supervised Evaluation of Search Engines via Semantic Mapping
, 2003
"... Content and link information is used by virtually all search engines to crawl, index, retrieve, and rank Web pages. The correlations between similarity measures based on these cues and on semantic associations between pages is crucial in determining the performance of any search tool. A great deal o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Content and link information is used by virtually all search engines to crawl, index, retrieve, and rank Web pages. The correlations between similarity measures based on these cues and on semantic associations between pages is crucial in determining the performance of any search tool. A great deal of research is under way to understand how to effectively extract semantic information from Web pages by mining their text and links. A brute force approach has been used to build semantic maps, that given coordinates based on text and link similarity measures between two pages estimate how the two pages are related in meaning. In this paper we describe a semi-supervised methodology to evaluate the performance of search engines. The relevance of entire hit lists can be estimated by identifying a single relevant page for a given query. We illustrate the evaluation methodology by graphing precision-recall plots for three commercial search engines based on TREC queries. Preliminary results are in line with our intuition of how the different search engines use content and link information to rank hits.
A Comparative Evaluation of Different Link Types on Enhancing Document Clustering
"... With a growing number of works utilizing link information in enhancing document clustering, it becomes necessary to make a comparative evaluation of the impacts of different link types on document clustering. Various types of links between text documents, including explicit links such as citation li ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
With a growing number of works utilizing link information in enhancing document clustering, it becomes necessary to make a comparative evaluation of the impacts of different link types on document clustering. Various types of links between text documents, including explicit links such as citation links and hyperlinks, implicit links such as co-authorship links, and pseudo links such as content similarity links, convey topic similarity or topic transferring patterns, which is very useful for document clustering. In this study, we adopt a Relaxation Labeling (RL)based clustering algorithm, which employs both content and linkage information, to evaluate the effectiveness of the aforementioned types of links for document clustering on eight datasets. The experimental results show that linkage is quite effective in improving content-based document clustering. Furthermore, a series of interesting findings regarding the impacts of different link types on document clustering are discovered through our experiments.
Finding semantic needles in haystacks of Web text and links
"... Content and links are used to search, rank, cluster and classify Web pages. Here I analyze and visualize similarity relationships in massive Web datasets to identify how content and link analysis should be integrated for relevance approximation. Human-generated metadata from Web directories is us ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Content and links are used to search, rank, cluster and classify Web pages. Here I analyze and visualize similarity relationships in massive Web datasets to identify how content and link analysis should be integrated for relevance approximation. Human-generated metadata from Web directories is used to estimate semantic similarity. Highly heterogeneous topical maps point to a critical dependence on search context.

