Results 11 - 20
of
120
Performance and Scalability of a Large-Scale N-gram Based Information Retrieval System
- JOURNAL OF DIGITAL INFORMATION
, 2000
"... Information retrieval has become more and more important due to the rapid growth of all kinds of information. However, there are few suitable systems available. This paper presents a few approaches that enable large-scale information retrieval for the TELLTALE system. TELLTALE is a dynamic hypert ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Information retrieval has become more and more important due to the rapid growth of all kinds of information. However, there are few suitable systems available. This paper presents a few approaches that enable large-scale information retrieval for the TELLTALE system. TELLTALE is a dynamic hypertext information retrieval environment. It provides full-text search for text corpora that may be garbled by OCR (Optical Character Recognition) or transmission errors, and that may contain multiple languages by using several techniques based on n-grams (n character sequences of text). It can find similar documents against a 1KB query from 1G text data in 45 seconds. This remarkable performance is achieved by integrating new data structures and gamma compression into the TELLTALE framework. This paper also compares several different types of query methods such as TF/IDF and incremental similarity to the original technique of centroid subtraction. The new similarity techniques give better performance but less accuracy.
Coverage, Relevance, and Ranking: The Impact of Query Operators on . . .
- ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 2003
"... ..."
Simulating A File-Sharing P2P Network
"... Assessing the performance of peer-to-peer algorithms such as topology construction protocols, distributed trust or search algorithms is impossible without simulations since testing new algorithms by deploying them in an existing P2P network is prohibitively expensive. However, some P2P algorithms ar ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
Assessing the performance of peer-to-peer algorithms such as topology construction protocols, distributed trust or search algorithms is impossible without simulations since testing new algorithms by deploying them in an existing P2P network is prohibitively expensive. However, some P2P algorithms are sensitive to the network and traffic models that are used in the simulations. In order to produce realistic results, we therefore require models that resemble real-world P2P networks as closely as possible. In this paper, we describe a model for P2P file-sharing networks, link it to measurements on existing P2P networks and discuss open issues in modeling these networks.
OntoMiner: Bootstrapping and Populating Ontologies from Domain Specific Websites
- Proceedings of the First International Workshop on Semantic Web and Databases (SWDB 2003
, 2003
"... Abstract. RDF/XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. In order to enable widespread usability for the Semantic Web there is a need to bootstrap large, rich and upto-date domain ontologies that ..."
Abstract
-
Cited by 19 (2 self)
- Add to MetaCart
Abstract. RDF/XML has been widely recognized as the standard for annotating online Web documents and for transforming the HTML Web to the so called Semantic Web. In order to enable widespread usability for the Semantic Web there is a need to bootstrap large, rich and upto-date domain ontologies that organize most relevant concepts, their relationships and instances. In this paper, we present automated techniques for bootstrapping and populating specialized domain ontologies by organizing and mining a set of relevant Web sites provided by the user. We develop algorithms that detect and utilize HTML regularities in the Web documents to turn them into hierarchical semantic structures encoded as XML. Next, we present tree-mining algorithms that identify key domain concepts and their taxonomical relationships. We also extract semi-structured concept instances annotated with their labels whenever they are available. Experimental evaluation for the News and Hotels domain indicates that our algorithms can bootstrap and populate domain specific ontologies with high precision and recall. 1
Machine Learning in Automated Text Categorisation
, 1999
"... The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
The automated categorisation (or classification) of texts into topical categories has a long history, dating back at least to the early ’60s. Until the late ’80s, the most effective approach to the problem seemed to be that of manually building automatic classifiers by means of knowledgeengineering techniques, i.e. manually defining a set of rules encoding expert knowledge on how to classify documents under a given set of categories. In the ’90s, with the booming production and availability of on-line documents, automated text categorisation has witnessed an increased and renewed interest, prompted by which the machine learning paradigm to automatic classifier construction has emerged and definitely superseded the knowledge-engineering approach. Within the machine learning paradigm, a general inductive process (called the learner) automatically builds a classifier (also called the rule, or the hypothesis) by “learning”, from a set of previously classified documents, the characteristics of one or more categories. The advantages of this approach are a very good effectiveness, a considerable savings in terms of expert manpower, and domain independence. In this survey we look at the main approaches that have been taken towards automatic text categorisation within the general machine learning paradigm. Issues pertaining to document indexing, classifier construction, and classifier evaluation, will be discussed in detail. A final section will be devoted to the techniques that have specifically been devised for an emerging application such as the automatic classification of Web pages into “Yahoo!-like ” hierarchically structured sets of categories.
Search log analysis: What it is, what's been done, how to do it
- Library & Information Science Research
, 2006
"... The use of data stored in transaction logs of Web search engines, Intranets, and Web sites can provide valuable insight into understanding the information-searching process of online searchers. This understanding can enlighten information system design, interface development, and devising the inform ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
The use of data stored in transaction logs of Web search engines, Intranets, and Web sites can provide valuable insight into understanding the information-searching process of online searchers. This understanding can enlighten information system design, interface development, and devising the information architecture for content collections. This article presents a review and foundation for conducting Web search transaction log analysis. A methodology is outlined consisting of three stages, which are collection, preparation, and analysis. The three stages of the methodology are presented in detail with discussions of goals, metrics, and processes at each stage. Critical terms in transaction log analysis for Web searching are defined. The strengths and limitations of transaction log analysis as a research method are presented. An application to log client-side interactions that supplements transaction logs is reported on, and the application is made available for use by the research community. Suggestions are provided on ways to leverage the strengths of, while addressing the limitations of, transaction log analysis for Web-searching research. Finally, a complete flat text transaction log from a commercial search engine is available as supplementary material with this
Algorithmic computation and approximation of semantic similarity
- WWW Journal
, 2006
"... Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Automatic extraction of semantic information from text and links in Web pages is key to improving the quality of search results. However, the assessment of automatic semantic measures is limited by the coverage of user studies, which do not scale with the size, heterogeneity, and growth of the Web. Here we propose to leverage human-generated metadata — namely topical directories — to measure semantic relationships among massive numbers of pairs of Web pages or topics. The Open Directory Project classifies millions of URLs in a topical ontology, providing a rich source from which semantic relationships between Web pages can be derived. While semantic similarity measures based on taxonomies (trees) are well studied, the design of well-founded similarity measures for objects stored in the nodes of arbitrary ontologies (graphs) is an open problem. This paper defines an information-theoretic measure of semantic similarity that exploits both the hierarchical and non-hierarchical structure of an ontology. An experimental study shows that this measure improves significantly on the traditional taxonomy-based approach. This novel measure allows us to address the general question of how text and link analyses can be combined to derive measures of relevance that are in good agreement with semantic similarity. Surprisingly, the traditional use of text similarity turns out to be ineffective for relevance ranking. 1
Information Retrieval: A Survey
, 2000
"... Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. T ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
Information Retrieval (IR) is the discipline that deals with retrieval of unstructured data, especially textual documents, in response to a query or topic statement, which may itself be unstructured, e.g., a sentence or even another document, or which may be structured, e.g., a boolean expression. The need for effective methods of automated IR has grown in importance because of the tremendous explosion in the amount of unstructured data, both internal, corporate document collections, and the immense and growing number of document sources on the Internet. This report is a tutorial and survey of the state of the art, both research and commercial, in this dynamic field. The topics covered include: formulation of structured and unstructured queries and topic statements, indexing (including term weighting) of document collections, methods for computing the similarity of queries and documents, classification and routing of documents in an incoming stream to users on the basis of topic or nee...
A Fast Algorithm for Hierarchical Text Classification
, 2000
"... . Text classification is becoming more important with the proliferation of the Internet and the huge amount of data it transfers. We present an efficient algorithm for text classification using hierarchical classifiers based on a concept hierarchy. The simple TFIDF classifier is chosen to train ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
. Text classification is becoming more important with the proliferation of the Internet and the huge amount of data it transfers. We present an efficient algorithm for text classification using hierarchical classifiers based on a concept hierarchy. The simple TFIDF classifier is chosen to train sample data and to classify other new data. Despite its simplicity, results of experiments on Web pages and TV closed captions demonstrate high classification accuracy. Application of feature subset selection techniques improves the performance. Our algorithm is computationally efficient being bounded by O(n log n) for n samples. 1 Introduction As the amount of on-line data increases by leaps and bounds, the design of an efficient algorithm or an approach to accessing the data (e.g. through classification, clustering, filtering, etc.) has become of great interest. Two important aspects motivate such design. First, the data needs to be arranged efficiently. For example, instead of placi...

