Results 1 - 10
of
59
Authoritative Sources in a Hyperlinked Environment
- JOURNAL OF THE ACM
, 1999
"... The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and repo ..."
Abstract
-
Cited by 2222 (9 self)
- Add to MetaCart
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authoritative ” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages ” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Enhanced Hypertext Categorization Using Hyperlinks
, 1998
"... A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is ..."
Abstract
-
Cited by 326 (8 self)
- Add to MetaCart
A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain highquality semantic clues that are lost upon a purely termbased classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented ...
Inferring Web Communities from Link Topology
, 1998
"... The World Wide Web grows through a decentralized, almost anarchic process, and this has resulted in a large hyperlinked corpus without the kind of logical organization that can be built into more traditionally-created hypermedia. To extract meaningful structure under such circumstances, we develop a ..."
Abstract
-
Cited by 298 (4 self)
- Add to MetaCart
The World Wide Web grows through a decentralized, almost anarchic process, and this has resulted in a large hyperlinked corpus without the kind of logical organization that can be built into more traditionally-created hypermedia. To extract meaningful structure under such circumstances, we develop a notion of hyperlinked communities on the www through an analysis of the link topology. Byinvoking a simple, mathematically clean method for de ning and exposing the structure of these communities, we are able to derive anumber of themes: The communities can be viewed as containing a core of central, "authoritative" pages linked together by "hub pages"; and they exhibit a natural type of hierarchical topic generalization that can be inferred directly from the pattern of linkage. Our investigation shows that although the process by which users of the Web create pages and links is very di cult to understand at a "local" level, it results in a much greater degree of orderly high-level structure than has typically been assumed.
Web mining: Information and pattern discovery on the world wide web
, 1997
"... Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research e orts. The term Web mining has been used intwo distinc ..."
Abstract
-
Cited by 207 (18 self)
- Add to MetaCart
Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research e orts. The term Web mining has been used intwo distinct ways. The rst, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. In this paper we de ne Web mining and present an overview of the various research issues, techniques, and development e orts. We brie y describe WEBMINER, a system for Web usage mining, and conclude this paper by listing research issues. 1
Keyword searching and browsing in databases using BANKS
- In ICDE
, 2002
"... With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BA ..."
Abstract
-
Cited by 181 (11 self)
- Add to MetaCart
With the growth of the Web, there has been a rapid increase in the number of users who need to access online databases without having a detailed knowledge of the schema or of query languages; even relatively simple query languages designed for non-experts are too complicated for them. We describe BANKS, a system which enables keyword-based search on relational databases, together with data and schema browsing. BANKS enables users to extract information in a simple manner without any knowledge of the schema or any need for writing complex queries. A user can get information by typing a few keywords, following hyperlinks, and interacting with controls on the displayed results. BANKS models tuples as nodes in a graph, connected by links induced by foreign key and other relationships. Answers to a query are modeled as rooted trees connecting tuples that match individual keywords in the query. Answers are ranked using a notion of proximity coupled with a notion of prestige of nodes based on inlinks, similar to techniques developed for Web search. We present an efficient heuristic algorithm for finding and ranking query results. 1.
Query-Based Sampling of Text Databases
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1999
"... ... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive ..."
Abstract
-
Cited by 134 (13 self)
- Add to MetaCart
... This paper presents query-based sampling, a new technique for acquiring accurate resource descriptions. Query-based sampling does not require the cooperationof resource providers nor does it require that resource providers use a particular search engine or representation technique. An extensive set of experimental results demonstrates that accurate resource descriptions are created, that computation and communication costs are reasonable, and that the resource descriptions do in fact enable accurate automatic database selection.
SETS: Search Enhanced by Topic Segmentation
, 2003
"... We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are ..."
Abstract
-
Cited by 61 (4 self)
- Add to MetaCart
We present SETS, an architecture for building topic-segmented networks for efficient search. The key idea is to arrange participants in a topic-segmented topology where most of the links are short-distance links joining pairs of sites with similar content. The resulting topically focused regions are joined together into a single network by long-distance links. Queries are then matched and routed to only the topically closest regions. We draw on ideas from machine learning and social network theory to build an efficient search network. We discuss a variety of design issues and tradeoffs that an implementor of SETS would face. We show that SETS is ecient in network traffic and query processing load.
WebACE: A Web Agent for Document Categorization and Exploration
, 1998
"... We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and filtering the resulting documents to extrac ..."
Abstract
-
Cited by 57 (16 self)
- Add to MetaCart
We propose an agent for exploring and categorizing documents on the World Wide Web. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and filtering the resulting documents to extract the set of documents most closely related to the starting set. The document categories are not given a-priori. We present the overall architecture and describe two novel algorithms which provide significant improvement over traditional clustering algorithms and form the basis for the query generation and search component of the agent. 1 Introduction The World Wide Web is a vast resource of information and services that continues to grow rapidly. Powerful search engines have been developed to aid in locating unfamiliar documents by category, contents, or subject. Relying on large indexes to documents located on the Web, search engines determine the URLs of those documents satisfying a use...
Partitioning-based clustering for web document categorization. Decision Support Systems
, 1999
"... Clustering techniques have been used by manyintelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other simi ..."
Abstract
-
Cited by 56 (12 self)
- Add to MetaCart
Clustering techniques have been used by manyintelligent software agents in order to retrieve, lter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to de ne a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classi cation. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can e ectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-speci ed ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such ashierarchical agglomeration clustering, and Bayesian classi cation methods, such as AutoClass.
Improving Browsing in Digital Libraries with Keyphrase Indexes
, 1998
"... Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
Browsing accounts for much of people's interaction with digital libraries, but it is poorly supported by standard search engines. Conventional systems often operate at the wrong level, indexing words when people think in terms of topics, and returning documents when people want a broader view. As a result, users cannot easily determine what is in a collection, how well a particular topic is covered, or what kinds of queries will provide useful results. We have built

