Results 1 - 10
of
34
Learning from labeled and unlabeled data on a directed graph
- in: Proceedings of the 22nd International Conference on Machine Learning (ICML
"... We propose a general framework for learning from labeled and unlabeled data on a directed graph in which the structure of the graph including the directionality of the edges is considered. The time complexity of the algorithm derived from this framework is nearly linear due to recently developed num ..."
Abstract
-
Cited by 75 (8 self)
- Add to MetaCart
We propose a general framework for learning from labeled and unlabeled data on a directed graph in which the structure of the graph including the directionality of the edges is considered. The time complexity of the algorithm derived from this framework is nearly linear due to recently developed numerical techniques. In the absence of labeled instances, this framework can be utilized as a spectral clustering method for directed graphs, which generalizes the spectral clustering approach for undirected graphs. We have applied our framework to real-world web classification problems and obtained encouraging results. 1.
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 58 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Effective Web Crawling
, 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the long-term goal of crawling important pages first, to the short-term goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
Hyperlink network analysis: a new method for the study of social structures on the web
- Connections
, 2003
"... This paper identifies hyperlink network analysis (HNA) as a newly emerging methodology. It suggests that social (or communication) structures on the web may be analyzed based on the hyperlinks among websites. Hyperlink network analysis has advantages in describing emerging structures among social ac ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
This paper identifies hyperlink network analysis (HNA) as a newly emerging methodology. It suggests that social (or communication) structures on the web may be analyzed based on the hyperlinks among websites. Hyperlink network analysis has advantages in describing emerging structures among social actors on the web. In order to examine what constitutes hyperlink network analysis, this paper reviews prior research on the topic. Further, it describes the data-gathering techniques for those interested in hyperlink network analysis.
Crawling the infinite Web: five levels are enough
- In Proceedings of the third Workshop on Web Graphs (WAW
, 2004
"... Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks ” away from the start page, to reach 90 % of the pages that users actually visit. 1
Web Spam, Propaganda and Trust
, 2005
"... Web spamming, the practice of introducing artificial text and links into web pages to a#ect the results of searches, has been recognized as a major problem for search engines. It is also a serious problem for users because they are not aware of it and they tend to confuse trusting the search engine ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Web spamming, the practice of introducing artificial text and links into web pages to a#ect the results of searches, has been recognized as a major problem for search engines. It is also a serious problem for users because they are not aware of it and they tend to confuse trusting the search engine with trusting the results of a search. In this paper, we first analyze the influence that web spam has on the evolution of the search engines and we identify the strong relationship of spamming methods to propagandistic techniques in society. Our analysis provides a foundation to understanding why spamming works and o#ers new insight on how to address it. In particular, it suggest that one could use anti-propagandistic techniques in the web to recognize spam. The second part of the paper demonstrates such a technique, called backwards propagation of distrust. In society, recognition of an untrustworthy message (in the opinion of a particular person or other social entity) is a reason for questioning the entities that recommend the message. Entities that are found to strongly support untrustworthy messages become untrustworthy themselves. So, social distrust is propagated backwards for a number of steps. Our algorithm simulates this social behavior on the web graph. In our algorithm, starting from an untrustworthy (according to the end user) site s, we examine its trust neighborhood, that is, the neighborhood of sites that link to s in a few steps. Evaluating the sites-members of the neighborhood we identify a biconnected component (BCCs) with a high percentage of untrustworthy sites. BCCs are formed when there are multiple paths to reach s, thus indicating a concerted e#ort to promote s. This is not the case when starting from a trustworthy site. Our tool explores thousands o...
A Stochastic Model for the Evolution of the Web
- Computer Networks
, 2002
"... Recently several authors have proposed stochastic models of the growth of the Web graph that give rise to power-law distributions. These models are based on the notion of preferential attachment leading to the "rich get richer" phenomenon. However, these models fail to explain several distributio ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Recently several authors have proposed stochastic models of the growth of the Web graph that give rise to power-law distributions. These models are based on the notion of preferential attachment leading to the "rich get richer" phenomenon. However, these models fail to explain several distributions arising from empirical results, due to the fact that the predicted exponent is not consistent with the data. To address this problem, we extend the evolutionary model of the Web graph by including a non-preferential component, and we view the stochastic process in terms of an urn transfer model. By making this extension, we can now explain a wider variety of empirically discovered power-law distributions provided the exponent is greater than two. These include: the distribution of incoming links, the distribution of outgoing links, the distribution of pages in a Web site and the distribution of visitors to a Web site. A by-product of our results is a formal proof of the convergence of the standard stochastic model (first proposed by Simon).
Web Dynamic
- Software Focus
, 2001
"... The global usage and continuing exponential growth of the World-Wide-Web poses a host of challenges to the research community. In particular, thereis an urgent need to understand and manage the dynamics of the Web, in order to develop new techniques which will make the Web tractable. We provide an o ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The global usage and continuing exponential growth of the World-Wide-Web poses a host of challenges to the research community. In particular, thereis an urgent need to understand and manage the dynamics of the Web, in order to develop new techniques which will make the Web tractable. We provide an overview of recent statistics relating to the size of the Web graph and its growth. We then briefly review some of the key areas relating to Webdynamics with reference to the recent literature. Finally, we summarise the talks given in a recent workshop devoted to Webdynamics which was held in the beginning of January 2001 at the University of London. Keywords. Web dynamics, Web graph, information retrieval, collaborative filtering, Web navigation,Website design, data-intensive Web applications, workflow management, e-commerce,mobile computation.
Web-crawling reliability
- Journal of the American Society for Information Science and Technology
, 2004
"... In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of li ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In this article, I investigate the reliability, in the social science sense, of collecting informetric data about the World Wide Web by Web crawling. The investigation includes a critical examination of the practice of Web crawling and contrasts the results of content crawling with the results of link crawling. It is shown that Web crawling by search engines is intentionally biased and selective. I also report the results of a large-scale experimental simulation of Web crawling that illustrates the effects of different crawling policies on data collection. It is concluded that the reliability of Web crawling as a data collection technique is improved by fuller reporting of relevant crawling policies.
Interpreting social science link analysis research: A theoretical framework
- Journal of the American Society for Information Science and Technology
, 2006
"... Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and that of the web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of th ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Link analysis in various forms is now an established technique in many different subjects, reflecting the perceived importance of links and that of the web. A critical but very difficult issue is how to interpret the results of social science link analyses. It is argued that the dynamic nature of the web, its lack of quality control and the online proliferation of copying and imitation mean that methodologies operating within a highly positivist, quantitative framework are ineffective. Conversely, the sheer variety of the web makes qualitative methodologies and pure reason very problematic to apply to large-scale studies. Methodology triangulation is consequently advocated, in combination with a warning that the web is incapable of giving definitive answers to large-scale link analysis research questions concerning social factors underlying link creation. Finally, it is claimed that whilst theoretical frameworks with which to guide research are appropriate, a Theory of Link Analysis is not possible.

