Results 1  10
of
52
Using PageRank to Characterize Web Structure
"... Recent work on modeling the web graph has dwelt on capturing the degree distributions observed on the web. Pointing out that this represents a heavy reliance on “local” properties of the web graph, we study the distribution of PageRank values on the web. Our measurements suggest that PageRank value ..."
Abstract

Cited by 94 (0 self)
 Add to MetaCart
Recent work on modeling the web graph has dwelt on capturing the degree distributions observed on the web. Pointing out that this represents a heavy reliance on “local” properties of the web graph, we study the distribution of PageRank values on the web. Our measurements suggest that PageRank values on the web follow a power law. We then develop generative models for the web graph that explain this observation and moreover remain faithful to previously studied degree distributions. We analyze these models and compare the analysis to both snapshots from the web and to graphs generated by simulations on the new models. To our knowledge this represents the first modeling of the web that goes beyond fitting degree distributions on the web.
A survey on pagerank computing
 Internet Mathematics
, 2005
"... Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. T ..."
Abstract

Cited by 64 (0 self)
 Add to MetaCart
Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing. 1.
On the Eigenvalue Power Law
, 2002
"... We show that the largest eigenvalues of graphs whose highest degrees are Zipflike distributed with slope are distributed according to a power law with slope =2. This follows as a direct and almost certain corollary of the degree power law. Our result has implications for the singular value deco ..."
Abstract

Cited by 50 (0 self)
 Add to MetaCart
We show that the largest eigenvalues of graphs whose highest degrees are Zipflike distributed with slope are distributed according to a power law with slope =2. This follows as a direct and almost certain corollary of the degree power law. Our result has implications for the singular value decomposition method in information retrieval.
The connectivity sonar: detecting site functionality by structural patterns
 In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
, 2003
"... Web sites today serve many different functions, such as corporate sites, search engines, estores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural pat ..."
Abstract

Cited by 49 (1 self)
 Add to MetaCart
Web sites today serve many different functions, such as corporate sites, search engines, estores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivitybased features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several searchengine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5 % and 59 % of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.
Kronecker Graphs: An Approach to Modeling Networks
 JOURNAL OF MACHINE LEARNING RESEARCH 11 (2010) 9851042
, 2010
"... How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in and outdegree distribution, heavy tails for the ei ..."
Abstract

Cited by 48 (2 self)
 Add to MetaCart
How can we generate realistic networks? In addition, how can we do so with a mathematically tractable model that allows for rigorous analysis of network properties? Real networks exhibit a long list of surprising properties: Heavy tails for the in and outdegree distribution, heavy tails for the eigenvalues and eigenvectors, small diameters, and densification and shrinking diameters over time. Current network models and generators either fail to match several of the above properties, are complicated to analyze mathematically, or both. Here we propose a generative model for networks that is both mathematically tractable and can generate networks that have all the above mentioned structural properties. Our main idea here is to use a nonstandard matrix operation, the Kronecker product, to generate graphs which we refer to as “Kronecker graphs”. First, we show that Kronecker graphs naturally obey common network properties. In fact, we rigorously prove that they do so. We also provide empirical evidence showing that Kronecker graphs can effectively model the structure of real networks. We then present KRONFIT, a fast and scalable algorithm for fitting the Kronecker graph generation model to large real networks. A naive approach to fitting would take superexponential
Large scale properties of the webgraph
 Eur. Phys. J. B
, 2004
"... In this paper we present an experimental study of the properties of web graphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford [17]. We report our experimental findings on the topological properties of such graphs, such as ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
In this paper we present an experimental study of the properties of web graphs. We study a large crawl from 2001 of 200M pages and about 1.4 billion edges made available by the WebBase project at Stanford [17]. We report our experimental findings on the topological properties of such graphs, such as the number of bipartite cores and the distribution of degree, PageRank values and strongly connected components.
Characterization of national Web domains
 ACM Transactions on Internet Technology
, 2005
"... During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web, because at the same time these are diverse (as they are written by se ..."
Abstract

Cited by 28 (9 self)
 Add to MetaCart
During the last few years, several studies on the characterization of the public Web space of various national domains have been published. The pages of a country are an interesting set for studying the characteristics of the Web, because at the same time these are diverse (as they are written by several authors) and yet rather similar (as they share a common geographical, historical and cultural context). This paper discusses the methodologies used for presenting the results of Web characterization studies, including the granularity at which different aspects are presented, and a separation of concerns between contents, links, and technologies. Based on this, we present a sidebyside comparison of the results of 12 Web characterization studies comprising over 120 million pages from 24 countries. The comparison unveils similarities and differences between the collections, and sheds light on how certain results of a single Web characterization study on a sample may be valid in the context of the full Web.
Effective Web Crawling
, 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the longterm goal of crawling important pages first, to the shortterm goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
Dynamics of Large Networks
, 2008
"... A basic premise behind the study of large networks is that interaction leads to complex collective behavior. In our work we found very interesting and counterintuitive patterns for time evolving networks, which change some of the basic assumptions that were made in the past. We then develop models ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
A basic premise behind the study of large networks is that interaction leads to complex collective behavior. In our work we found very interesting and counterintuitive patterns for time evolving networks, which change some of the basic assumptions that were made in the past. We then develop models that explain processes which govern the network evolution, fit such models to real networks, and use them to generate realistic graphs or give formal explanations about their properties. In addition, our work has a wide range of applications: it can help us spot anomalous graphs and outliers, forecast future graph structure and run simulations of network evolution. Another important aspect of our research is the study of “local ” patterns and structures of propagation in networks. We aim to identify building blocks of the networks and find the patterns of influence that these blocks have on information or virus propagation over the network. Our recent work included the study of the spread of influence in a large persontoperson
Ranking web sites with real user traffic
 INTERNATIONAL CONFERENCE ON WEB SEARCH AND WEB DATA MINING
, 2008
"... We analyze the trafficweighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the wellstudied boolean link host graph and others pointing to importa ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
We analyze the trafficweighted Web host graph obtained from a large sample of real Web users over about seven months. A number of interesting structural properties are revealed by this complex dynamic network, some in line with the wellstudied boolean link host graph and others pointing to important differences. We find that while search is directly involved in a surprisingly small fraction of user clicks, it leads to a much larger fraction of all sites visited. The temporal traffic patterns display strong regularities, with a large portion of future requests being statistically predictable by past ones. Given the importance of topological measures such as PageRank in modeling user navigation, as well as their role in ranking sites for Web search, we use the traffic data to validate the PageRank random surfing model. The ranking obtained by the actual frequency with which a site is visited by users differs significantly from that approximated by the uniform surfing/teleportation behavior modeled by PageRank, especially for the most important sites. To interpret this finding, we consider each of the fundamental assumptions underlying PageRank and show how each is violated by actual user behavior.