Results 1 - 10
of
438
Authoritative Sources in a Hyperlinked Environment
- JOURNAL OF THE ACM
, 1999
"... The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and repo ..."
Abstract
-
Cited by 3632 (12 self)
- Add to MetaCart
(Show Context)
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authoritative ” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages ” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
SimRank: A Measure of Structural-Context Similarity
- In KDD
, 2002
"... The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable in any domain with object-to- ..."
Abstract
-
Cited by 387 (3 self)
- Add to MetaCart
The problem of measuring "similarity" of objects arises in many applications, and many domain-specific measures have been developed, e.g., matching text across documents or computing overlap among item-sets. We propose a complementary approach, applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, we compute a measure that says "two objects are similar if they are related to similar objects." This general similarity measure, called SimRank, is based on a simple and intuitive graph-theoretic model. For a given domain, SimRank can be combined with other domain-specific similarity measures. We suggest techniques for efficient computation of SimRank scores, and provide experimental results on two application domains showing the computational feasibility and effectiveness of our approach.
Efficient Identification of Web Communities
- IN SIXTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2000
"... We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink ..."
Abstract
-
Cited by 293 (13 self)
- Add to MetaCart
We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eciently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. A focused crawler that crawls to a fixed depth can approximate community membership by augmenting the graph induced by the crawl with links to a virtual sink node. The effectiveness of the approximation algorithm is demonstrated with several crawl results that identify hubs, authorities, web rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused crawlers and search engines, automatic population of portal categories, and improved filtering.
Finding related pages in the World Wide Web
- IN INTERNATIONAL WORLD WIDE WEB CONFERENCE
, 1999
"... When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach toweb searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related ..."
Abstract
-
Cited by 178 (1 self)
- Add to MetaCart
(Show Context)
When using traditional search engines, users have to formulate queries to describe their information need. This paper discusses a different approach toweb searching where the input to the search process is not a set of query terms, but instead is the URL of a page, and the output is a set of related web pages. A related web page is one that addresses the same topic as the original page. For example, www.washingtonpost.com is a page related to www.nytimes.com, since both are online newspapers. We describe two algorithms to identify related web pages. These algorithms use only the connectivity information in the web (i.e., the links between pages) and not the content of pages or usage information. We haveimplemented both algorithms and measured their runtime performance. To evaluate the e ectiveness of our algorithms, we performed a user study comparing our algorithms with Netscape's \What's Related " service [12]. Our study showed that the precision at 10 for our two algorithms are 73 % better and 51 % better than that of Netscape, despite the fact that Netscape uses both content and usage pattern information in addition to connectivity information.
An Interactive System for Finding Complementary Literatures: a Stimulus to Scientific Discovery
- Artificial Intelligence
, 1997
"... An unintended consequence of specialization in science is poor communication across specialties. Information developed in one area of research may be of value in another without anyone becoming aware of the fact. We describe and evaluate interactive software and database search strategies that facil ..."
Abstract
-
Cited by 130 (8 self)
- Add to MetaCart
(Show Context)
An unintended consequence of specialization in science is poor communication across specialties. Information developed in one area of research may be of value in another without anyone becoming aware of the fact. We describe and evaluate interactive software and database search strategies that facilitate the discovery of previously unknown cross-specialty information of scientific interest. The user begins by searching MEDLINE for article titles that identify a problem or topic of interest. From downloaded titles the software constructs input for additional database searches and produces a series of heuristic aids that help the user select a second set of articles complementary to the first set and from a different area of research. The two sets are complementary if together they can reveal new useful information that cannot be inferred from either set alone. The software output further helps the user identify the new information and derive from it a novel testable hypothesis. We report several successful tests and applications of the system. 1. Introduction and
COMBINING APPROACHES TO INFORMATION RETRIEVAL
"... The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the W ..."
Abstract
-
Cited by 114 (3 self)
- Add to MetaCart
The combination of different text representations and search strategies has become a standard technique for improving the effectiveness of information retrieval. Combination, for example, has been studied extensively in the TREC evaluations and is the basis of the “meta-search” engines used on the Web. This paper examines the development of this technique, including both experimental results and the retrieval models that have been proposed as formal frameworks for combination. We show that combining approaches for information retrieval can be modeled as combining the outputs of multiple classifiers based on one or more representations, and that this simple model can provide explanations for many of the experimental results. We also show that this view of combination is very similar to the inference net model, and that a new approach to retrieval based on language models supports combination and can be integrated with the inference net model.
Information retrieval on the Web
- ACM Computing Surveys
, 2000
"... In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical ..."
Abstract
-
Cited by 95 (0 self)
- Add to MetaCart
In this paper we review studies of the growth of the Internet and technologies that are useful for information search and retrieval on the Web. We present data on the Internet from several different sources, e.g., current as well as projected number of users, hosts, and Web sites. Although numerical figures vary, overall trends cited
Mapping of science by combined cocitation and word analysis
- II. dynamical aspects, Journal American Society Information Science
, 1991
"... The claim that co-citation analysis is a useful tool to map subject-matter specialties of scientific research in a given period, is examined. A method has been devel-oped using quantitative analysis of content-words re-lated to publications in order to: (1) study coherence of research topics within ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
The claim that co-citation analysis is a useful tool to map subject-matter specialties of scientific research in a given period, is examined. A method has been devel-oped using quantitative analysis of content-words re-lated to publications in order to: (1) study coherence of research topics within sets of publications citing clus-ters, i.e., (part of) the “current work ” of a specialty; (2) to study differences in research topics between sets of publications citing different clusters; and (3) to evalu-ate recall of “current work ” publications concerning the specialties identified by co-citation analysis. Empirical support is found for the claim that co-citation analysis identifies indeed subject-matter specialties. However, different clusters may identify the same specialty, and results are far from complete concerning the identified “current work. ” These results are in accordance with the opinion of some experts in the fields. Low recall of co-citation analysis concerning the “current work ” of specialties is shown to be related to the way in which researchers build their work on earlier publications: the “missed ” publications equally build on very recent ear-lier work, but are less “consensual ” and/or less “atten-tive ” in their referencing practice. Evaluation of national research performance using co-citation analysis ap-pears to be biased by this “incompleteness.”
The connectivity sonar: detecting site functionality by structural patterns
- In Proceedings of the Fourteenth ACM Conference on Hypertext and Hypermedia
, 2003
"... Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural pat ..."
Abstract
-
Cited by 62 (1 self)
- Add to MetaCart
(Show Context)
Web sites today serve many different functions, such as corporate sites, search engines, e-stores, and so forth. As sites are created for different purposes, their structure and connectivity characteristics vary. However, this research argues that sites of similar role exhibit similar structural patterns, as the functionality of a site naturally induces a typical hyperlinked structure and typical connectivity patterns to and from the rest of the Web. Thus, the functionality of Web sites is reflected in a set of structural and connectivity-based features that form a typical signature. In this paper, we automatically categorize sites into eight distinct functional classes, and highlight several search-engine related applications that could make immediate use of such technology. We purposely limit our categorization algorithms by tapping connectivity and structural data alone, making no use of any content analysis whatsoever. When applying two classification algorithms to a set of 202 sites of the eight defined functional categories, the algorithms correctly classified between 54.5 % and 59 % of the sites. On some categories, the precision of the classification exceeded 85%. An additional result of this work indicates that the structural signature can be used to detect spam rings and mirror sites, by clustering sites with almost identical signatures.
Natural Communities in Large Linked Networks
, 2003
"... We are interested in finding natural communities in largescale linked networks. Our ultimate goal is to track changes over time in such communities. For such temporal tracking, we require a clustering algorithm that is relatively stable under small perturbations of the input data. We have developed ..."
Abstract
-
Cited by 61 (1 self)
- Add to MetaCart
We are interested in finding natural communities in largescale linked networks. Our ultimate goal is to track changes over time in such communities. For such temporal tracking, we require a clustering algorithm that is relatively stable under small perturbations of the input data. We have developed an e#cient, scalable agglomerative strategy and applied it to the citation graph of the NEC CiteSeer database (250,000 papers; 4.5 million citations). Agglomerative clustering techniques are known to be unstable on data in which the community structure is not strong. We find that some communities are essentially random and thus unstable while others are natural and will appear in most clusterings. These natural communities will enable us to track the evolution of communities over time.