Results 1 - 10
of
14
Deeper inside pagerank
- Internet Mathematics
, 2004
"... Abstract. This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existe ..."
Abstract
-
Cited by 107 (4 self)
- Add to MetaCart
Abstract. This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research. 1.
Blocking Blog Spam with Language Model Disagreement
- In Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb
, 2005
"... We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledg ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam show promising results.
Finding and Ranking Knowledge on the Semantic Web
- In Proceedings of the 4th International Semantic Web Conference
, 2005
"... Abstract. Swoogle helps software agents and knowledge engineers find Semantic Web knowledge encoded in RDF and OWL documents on the Web. Navigating such a Semantic Web on the Web is difficult due to the paucity of explicit hyperlinks beyond the namespaces in URIrefs and the few inter-document links ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
Abstract. Swoogle helps software agents and knowledge engineers find Semantic Web knowledge encoded in RDF and OWL documents on the Web. Navigating such a Semantic Web on the Web is difficult due to the paucity of explicit hyperlinks beyond the namespaces in URIrefs and the few inter-document links like rdfs:seeAlso and owl:imports. In order to solve this issue, this paper proposes a novel Semantic Web navigation model providing additional navigation paths through Swoogle’s search services such as the Ontology Dictionary. Using this model, we have developed algorithms for ranking the importance of Semantic Web objects at three levels of granularity: documents, terms and RDF graphs. Experiments show that Swoogle outperforms conventional web search engine and other ontology libraries in finding more ontologies, ranking their importance, and thus promoting the use and emergence of consensus ontologies. 1
Generalizing pagerank: Damping functions for linkbased ranking algorithms
- In Proceedings of ACM SIGIR
"... This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ran ..."
Abstract
-
Cited by 21 (8 self)
- Add to MetaCart
This paper introduces a family of link-based ranking algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. The main objective of this paper is to determine whether this family of ranking techniques has some interest per se, and how different choices for the damping function impact on rank quality and on convergence speed. Even though our results suggest that Page-Rank can be approximated with other simpler forms of rankings that may be computed more efficiently, our focus is of more speculative nature, in that it aims at separating the kernel of PageRank, that is, link-based importance propagation, from the way propagation decays over paths. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our presentation includes algorithms, analysis, comparisons and experiments that study their behavior under different parameters in real Web graph data. Among other results, we show how to calculate a linear approximation that induces a page ordering that is almost identical to Page-Rank’s using a fixed small number of iterations; comparisons were performed using Kendall’s τ on large domain datasets.
Effective Web Crawling
, 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the long-term goal of crawling important pages first, to the short-term goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
A Geographic Knowledge Base for Semantic Web Applications
- In Proceedings of SBBD-05, the 20th Brazilian Symposium on Databases
, 2005
"... Abstract. This paper introduces GKB, a repository based on a domain independent meta-model for integrating geographic knowledge collected from multiple sources. We present the architecture, the repository design and the data cleaning and knowledge integration processes. We also describe the rules de ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
Abstract. This paper introduces GKB, a repository based on a domain independent meta-model for integrating geographic knowledge collected from multiple sources. We present the architecture, the repository design and the data cleaning and knowledge integration processes. We also describe the rules developed to add new knowledge to GKB. GKB includes tools for generating ontologies, which are being used by multiple semantic web applications. To illustrate how it is being used, we present some of the applications that interact with the repository or load ontologies created with GKB.
Hierarchical Link Analysis for Ranking Web Data
"... Abstract. On the Web of Data, entities are often interconnected in a way similar to web documents. Previous works have shown how PageRank can be adapted to achieve entity ranking. In this paper, we propose to exploit locality on the Web of Data by taking a layered approach, similar to hierarchical P ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. On the Web of Data, entities are often interconnected in a way similar to web documents. Previous works have shown how PageRank can be adapted to achieve entity ranking. In this paper, we propose to exploit locality on the Web of Data by taking a layered approach, similar to hierarchical PageRank approaches. We provide justifications for a two-layer model of the Web of Data, and introduce DING (Dataset Ranking) a novel ranking methodology based on this two-layer model. DING uses links between datasets to compute dataset ranks and combines the resulting values with semantic-dependent entity ranking strategies. We quantify the effectiveness of the approach with other link-based algorithms on large datasets coming from the Sindice search engine. The evaluation which includes a user study indicates that the resulting rank is better than the other approaches. Also, the resulting algorithm is shown to have desirable computational properties such as parallelisation. 1
Chapter 4 Crawler Implementation
"... ur programs are run in cycles during the crawler's execution. 4.2.1 Manager: long-term scheduling The "manager" program generates the list of K URLs to be downloaded in this cycle (we used K = 100,000). The procedure for generating this list is outlined below. Figure 4.1: Operation of the manager ..."
Abstract
- Add to MetaCart
ur programs are run in cycles during the crawler's execution. 4.2.1 Manager: long-term scheduling The "manager" program generates the list of K URLs to be downloaded in this cycle (we used K = 100,000). The procedure for generating this list is outlined below. Figure 4.1: Operation of the manager program with K = 2. The current value of a page is IntrinsicQuality(p)P(Freshness(p) = 1)RepresentationalQuality(p), where RepresentationalQuality(p) equals 1 if the page has been visited, 0 if not. The value of the downloaded page is IntrinsicQuality(p)11. In the figure, the manager should select pages P 1 and P 3 for this cycle. 1. Filter out pages that were downloaded too recently In the configuration file, a criteria for the maximum frequency of re-visits to pages can be stated (e.g.: no more than once a day or once a week). This criteria is used to avoid accessing only a few elements of the collection, and is based on the observations by Cho and Garcia-Molina [CGM03]. 2. Estimate the
WIRE: an Open Source Web Information Retrieval Environment
- Workshop on Open Source Web Information Retrieval (OSWIR
, 2005
"... In this paper, we describe the WIRE (Web Information Retrieval Environment) project and focus on some details of its crawler component. The WIRE crawler is a scalable, highly configurable, high performance, open-source Web crawler which we have used to study the characteristics of large Web collecti ..."
Abstract
- Add to MetaCart
In this paper, we describe the WIRE (Web Information Retrieval Environment) project and focus on some details of its crawler component. The WIRE crawler is a scalable, highly configurable, high performance, open-source Web crawler which we have used to study the characteristics of large Web collections.
The Choice of a Damping Function for Propagating Importance in Link-Based Ranking
"... This paper studies a family of link-based algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with the distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking fu ..."
Abstract
- Add to MetaCart
This paper studies a family of link-based algorithms that propagate page importance through links. In these algorithms there is a damping function that decreases with the distance, so a direct link implies more endorsement than a link through a long path. PageRank is the most widely known ranking function of this family. We focus on three damping functions, having linear, exponential, and hyperbolic decay on the lengths of the paths. The exponential decay corresponds to PageRank, and the other functions are new. Our analysis includes a comparison among them and experiments for studying their behavior under different parameters.

