Results 1 - 10
of
14
Advances in Domain Independent Linear Text Segmentation
, 2000
"... This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering. ..."
Abstract
-
Cited by 100 (1 self)
- Add to MetaCart
This paper describes a method for linear text seg- mc. ntation which is twice as accurate and over seven times as fast as the state-of-the-art (Reynar, 1998). Inter-sentence similarity is replaced by rank in the local context. Boundary locations are discovered by divisive clustering.
Evaluating Topic-Driven Web Crawlers
, 2001
"... Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies t ..."
Abstract
-
Cited by 72 (19 self)
- Add to MetaCart
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
Recognizing Structure in Web Pages using Similarity Queries
- In AAAI-99
, 1999
"... We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, t ..."
Abstract
-
Cited by 35 (5 self)
- Add to MetaCart
We present general-purpose methods for recognizing certain types of structure in HTML documents. The methods are implemented using WHIRL, a "soft" logic that incorporates a notion of textual similarity developed in the information retrieval community. In an experimental evaluation on 82 Web pages, the structure ranked first by our method is "meaningful"---i.e., a structure that was used in a hand-coded "wrapper", or extraction program, for the page---nearly 70% of the time. This improves on a value of 50% obtained by an earlier method. With appropriate background information, the structure-recognition methods we describe can also be used to learn a wrapper from examples, or for maintaining a wrapper as a Web page changes format. In these settings, the top-ranked structure is meaningful nearly 85% of the time. Introduction Web-based information integration systems allow a user to query structured information that has been extracted from the Web (Levy, Rajaraman, & Ordille 1996; Garcia...
Linguistically Motivated Information Retrieval
, 2000
"... Information retrieval (IR) has been developed to provide practical solutions to people's need to find the desired information... ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Information retrieval (IR) has been developed to provide practical solutions to people's need to find the desired information...
A Repetition Based Measure for Verification of Text Collections and for Text Categorization
, 2003
"... We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Add ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We suggest a way for locating duplicates and plagiarisms in a text collection using an R-measure, which is the normalized sum of the lengths of all suffixes of the text repeated in other documents of the collection. The R-measure can be effectively computed using the suffix array data structure. Additionally, the computation procedure can be improved to locate the sets of duplicate or plagiarised documents. We applied the technique to several standard text collections and found that they contained a significant number of duplicate and plagiarised documents. Another reformulation of the method leads to an algorithm that can be applied to supervised multi-class categorization. We illustrate the approach using the recently available Reuters Corpus Volume 1 (RCV1). The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.
Information Retrieval Based on Context Distance and Morphology
- SIGIR '99
, 1999
"... We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
We present an approach to information retrieval based on context distance and morphology. Context distance is a measure we use to assess the closeness of word meanings. This context distance model measures semantic distances between words using the local contexts of words within a single document as well as the lexical co-occurrence information in the set of documents to be retrieved. We also propose to integrate the context distance model with morphological analysis in determining word similarity so that the two can enhance each other. Using the standard vector-space model, we evaluated the proposed method on a subset of TREC-4 corpus (AP88 and AP90 collection, 158,240 documents, 49 queries). Results show that this method improves the 11-point average precision by 8.6%.
Semantic Similarity in Content-based Filtering
, 2002
"... In content-based filtering systems, content of items is used to recommend new items to the users. It is usually represented by words in natural language where meanings of words are often ambiguous. We studied clustering of words based on their semantic similarity. Then we used word clusters to repre ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
In content-based filtering systems, content of items is used to recommend new items to the users. It is usually represented by words in natural language where meanings of words are often ambiguous. We studied clustering of words based on their semantic similarity. Then we used word clusters to represent items for recommending new items by content-based filtering. In the paper we present our empirical results.
Query and Data Mapping across Heterogeneous Information Sources
, 2001
"... The Internet has brought together information sources worldwide. Integrating such heterogeneous and autonomous sources is challenging because of their non-uniform query languages and data representations. To help users uniformly query over different sources, we have developed an integration system o ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
The Internet has brought together information sources worldwide. Integrating such heterogeneous and autonomous sources is challenging because of their non-uniform query languages and data representations. To help users uniformly query over different sources, we have developed an integration system or a mediator for optimally mapping queries and data across disparate contexts. Such a translation technique is essential for many important applications that require querying sources and analyzing data on the web, such as meta-searching, e-commerce, and web mining. This thesis presents our solutions...
Linear Text Segmentation: Approaches, Advances, and Applications
- Proceedings of CLUK3
, 2000
"... This paper presents a new algorithm for domain independent linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art [22]. The algorithm and statistical summarisation techniques were applied to a practical problem, improving document navigation for the ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper presents a new algorithm for domain independent linear text segmentation which is twice as accurate and over seven times as fast as the state-of-the-art [22]. The algorithm and statistical summarisation techniques were applied to a practical problem, improving document navigation for the visually disabled.

