Results 1 - 10
of
41
Web mining: Information and pattern discovery on the world wide web
, 1997
"... Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research e orts. The term Web mining has been used intwo distinc ..."
Abstract
-
Cited by 207 (18 self)
- Add to MetaCart
Application of data mining techniques to the World Wide Web, referred to as Web mining, has been the focus of several recent research projects and papers. However, there is no established vocabulary, leading to confusion when comparing research e orts. The term Web mining has been used intwo distinct ways. The rst, called Web content mining in this paper, is the process of information discovery from sources across the World Wide Web. The second, called Web usage mining, is the process of mining for user browsing and access patterns. In this paper we de ne Web mining and present an overview of the various research issues, techniques, and development e orts. We brie y describe WEBMINER, a system for Web usage mining, and conclude this paper by listing research issues. 1
Building efficient and effective metasearch engines
- ACM Computing Surveys
, 2002
"... Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a met ..."
Abstract
-
Cited by 107 (9 self)
- Add to MetaCart
Frequently a user's information needs are stored in the databases of multiple search engines. It is inconvenient and inefficient for an ordinary user to invoke multiple search engines and identify useful documents from the returned results. To support unified access to multiple search engines, a metasearch engine can be constructed. When a metasearch engine receives a query from a user, it invokes the underlying search engines to retrieve useful information for the user. Metasearch engines have other benefits as a search tool such as increasing the search coverage of the Web and improving the scalability of the search. In this article, we survey techniques that have been proposed to tackle several underlying challenges for building a good metasearch engine. Among the main challenges, the database selection problem is to identify search engines that are likely to return useful documents to a given query. The document selection problem is to determine what documents to retrieve from each identified search engine. The result merging problem is to combine the documents returned from multiple search engines. We will also point out some problems that need to be further researched.
HyPursuit: A hierarchical network search engine that exploits content-link hypertext clustering
- PROCEEDINGS OF THE SEVENTH ACM CONFERENCE ON HYPERTEXT
, 1996
"... HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search activities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPurs ..."
Abstract
-
Cited by 88 (2 self)
- Add to MetaCart
HyPursuit is a new hierarchical network search engine that clusters hypertext documents to structure a given information space for browsing and search activities. Our content-link clustering algorithm is based on the semantic information embedded in hyperlink structures and document contents. HyPursuit admits multiple, coexisting cluster hierarchies based on different principles for grouping documents, such as the Library of Congress catalog scheme and automatically created hypertext clusters. HyPursuit's abstraction functions summarize cluster contents to support scalable query processing. The abstraction functions satisfy system resource limitations with controlled information loss. The result of query processing operations on a cluster summary approximates the result of performing the operations on the entire information space. We constructed a prototype system comprising 100 leaf World Wide Web sites and a hierarchy of 42 servers that route queries to the leaf sites. Experience with our system suggests that abstraction functions based on hypertext clustering can be used to construct meaningful and scalable cluster hierarchies. We are also encouraged by preliminary results on clustering based on both document contents and hyperlink structures.
Server Ranking for Distributed Text Retrieval Systems on the Internet
- In Proceedings of the Fifth International Conference on Database Systems for Advanced Applications
, 1997
"... Keyword-based search services have become necessary tools for finding information resources on the Internet today. The rapid growth of information on the Internet renders centralized keyword index services incapable of collecting comprehensive resource meta-data in a timely manner. We argue that del ..."
Abstract
-
Cited by 68 (4 self)
- Add to MetaCart
Keyword-based search services have become necessary tools for finding information resources on the Internet today. The rapid growth of information on the Internet renders centralized keyword index services incapable of collecting comprehensive resource meta-data in a timely manner. We argue that delegating the task of meta-data collection to local index servers is a more scalable approach. We propose a mechanism for integrating distributed autonomous index servers into a cooperative resource discovery system. Focusing on the retrieval effectiveness of the system, we propose a set of methods, called CVV-based methods, for ranking and selecting index servers with respect to a query, and merging the results returned by the index servers. Through experiments, we evaluate the effectiveness of the CVV-based methods, and compare our server ranking method with methods proposed by other researchers. Keywords information retrieval, internet data-- bases. 1 Introduction With the rapid growth ...
Determining Text Databases to Search in the Internet
, 1998
"... Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accurately predict the usefulness of each database, because with such information, we only need to retrieve potentially useful documents from useful databases. In ..."
Abstract
-
Cited by 38 (5 self)
- Add to MetaCart
Text data in the Internet can be partitioned into many databases naturally. Efficient retrieval of desired data can be achieved if we can accurately predict the usefulness of each database, because with such information, we only need to retrieve potentially useful documents from useful databases. In this paper, we propose two new methods for estimating the usefulness of text databases. For a given query, the usefulness of a text database in this paper is defined to be the number of documents in the database that are sufficiently similar to the query. Such a usefulness measure enables naive-users to make informed decision about which databases to search. We also consider the collection fusion problem. Because local databases may employ similarity functions that are different from that used by the global database, the threshold used by a local database to determine whether a document is potentially useful may be different from that used by the global database. We provide techniques that ...
Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's Web
- First International Conference on the World Wide Web
, 1994
"... Most documents made available on the World-Wide Web can be considered part of an infostructure --- an information resource database with a specifically designed structure. Infostructures often contain a wide variety of information sources, in the form of interlinked documents at distributed sites, w ..."
Abstract
-
Cited by 36 (5 self)
- Add to MetaCart
Most documents made available on the World-Wide Web can be considered part of an infostructure --- an information resource database with a specifically designed structure. Infostructures often contain a wide variety of information sources, in the form of interlinked documents at distributed sites, which are maintained by a number of different document owners (usually, but not necessarily, the original document authors). Individual documents may also be shared by multiple infostructures. Since it is rarely static, the content of an infostructure is likely to change over time and may vary from the intended structure. Documents may be moved or deleted, referenced information may change, and hypertext links may be broken. As it grows, an infostructure becomes complex and difficult to maintain. Such maintenance currently relies upon the error logs of each server (often never relayed to the document owners), the complaints of users (often not seen by the actual document maintainers), and pe...
Ethical Web Agents
, 1994
"... As the Web continues to evolve, the sophistication of the programs that are employed in interacting with it will also increase in sophistication. Web agents, programs acting autonomously on some task, are already present in the form of spiders. Agents offer substantial benefits and hazards, and beca ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
As the Web continues to evolve, the sophistication of the programs that are employed in interacting with it will also increase in sophistication. Web agents, programs acting autonomously on some task, are already present in the form of spiders. Agents offer substantial benefits and hazards, and because of this, their development must involve not only attention to technical details, but also the ethical concerns relating to their resulting impact. These ethical concerns will differ for agents employed in the creation of a service and agents acting on behalf of a specific individual. An ethic is proposed that addresses both of these perspectives. The proposal is predicated on the assumption that agents are a reality on the Web, and that there are no reasonable means of preventing their proliferation. 1 -- Introduction The ease of construction and potential Internet-wide impact of autonomous software agents on the World Wide Web [1] has spawned a great deal of discussion and occasional c...
Concept Hierarchy Based Text Database Categorization
, 2000
"... Document categorization as a technique to improve the retrieval of useful documents has been extensively investigated. One important issue in a large-scale metasearch engine is to select text databases that are likely to contain useful documents for a given query. We believe that database categoriza ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
Document categorization as a technique to improve the retrieval of useful documents has been extensively investigated. One important issue in a large-scale metasearch engine is to select text databases that are likely to contain useful documents for a given query. We believe that database categorization can be a potentially effective technique for good database selection, especially in the Internet environment where short queries are usually submitted. In this paper, we propose and evaluate several database categorization algorithms. This study indicates that while some document categorization algorithms could be adopted for database categorization, algorithms that take into consideration the special characteristics of databases may be more effective. Preliminary experimental results are provided to compare the proposed database categorization algorithms. A prototype database categorization system based on one of the proposed algorithms has been developed.
Search and Ranking Algorithms for Locating Resources on the World Wide Web
- In Proceedings of the 12th International Conference on Data Engineering
, 1996
"... Applying information retrieval techniques to the World Wide Web (WWW) environment is a unique challenge, mostly because of its hypertext/hypermedia nature and the richness of the meta-information it provides. We present four keyword-based search and ranking algorithms for locating relevant WWW pages ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
Applying information retrieval techniques to the World Wide Web (WWW) environment is a unique challenge, mostly because of its hypertext/hypermedia nature and the richness of the meta-information it provides. We present four keyword-based search and ranking algorithms for locating relevant WWW pages with respect to user queries. The first algorithm, Boolean Spread Activation, extends the notion of word occurrence in Boolean retrieval model by propagating the occurrence of a query word in a page to other pages linked to it. The second algorithm, Most-cited, is based on the number of citing hyperlinks between potentially relevant WWW pages to increase the relevance scores of the referenced pages over the referencing pages. The third algorithm, TFxIDF or vector space model, is based on word distribution statistics. The last algorithm, Vector Spread Activation, combines vector space model and spread activation model. We conducted an experiment to evaluate the retrieval effectiveness of the...
Estimating the Usefulness of Search Engines
, 1999
"... In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combinati ..."
Abstract
-
Cited by 32 (14 self)
- Add to MetaCart
In this paper, we present a statistical method to estimate the usefulness of a search engine for any given query. The estimates can be used by a metasearch engine to choose local search engines to invoke. For a given query, the usefulness of a search engine in this paper is defined to be a combination of the number of documents in the search engine that are sufficiently similar to the query and the average similarity of these documents. Experimental results indicate that the proposed estimation method is quite accurate. 1 Introduction Many search engines have been created on the Internet to help ordinary users find desired data. Each search engine has a corresponding database that defines the set of documents that can be searched by the search engine. Usually, an index for all documents in the database is created and stored in the search engine to speed up query processing. The amount of data in the Internet is huge (it is believed that by the end of 1997, there were more than 300 mil...

