Results 1 - 10
of
20
Concept Based Query Expansion
, 1993
"... Query expansion methods have been studied for a long time - with debatable success in many instances. In this paper we present a probabilistic query expansion model based on a similarity thesaurus which was constructed automatically. A similarity thesaurus reflects domain knowledge about the particu ..."
Abstract
-
Cited by 147 (2 self)
- Add to MetaCart
Query expansion methods have been studied for a long time - with debatable success in many instances. In this paper we present a probabilistic query expansion model based on a similarity thesaurus which was constructed automatically. A similarity thesaurus reflects domain knowledge about the particular collection from which it is constructed. We address the two important issues with query expansion: the selection and the weighting of additional search terms. In contrast to earlier methods, our queries are expanded by adding those terms that are most similar to the concept of the query, rather than selecting terms that are similar to the query terms. Our experiments show that this kind of query expansion results in a notable improvement in the retrieval effectiveness when measured using both recall-precision and usefulness.
A probabilistic framework for vague queries and imprecise information in databases
- PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES
, 1990
"... A probabilistic learning model for vague queries and missing or imprecise information in databases is described. Instead of retrieving only a set of answers, our approach yields a ranking of objects from the database in response to a query. By using relevance judgements from the user about the objec ..."
Abstract
-
Cited by 51 (11 self)
- Add to MetaCart
A probabilistic learning model for vague queries and missing or imprecise information in databases is described. Instead of retrieving only a set of answers, our approach yields a ranking of objects from the database in response to a query. By using relevance judgements from the user about the objects retrieved, the ranking for the actual query as well as the overall retrieval quality of the system can be further improved. For specifying different kinds of conditions in vague queries, the notion of vague pred-icates is introduced. Based on the underlying probabilistic model, also imprecise or missing attribute values can be treated easily. In addition, the corresponding formulas can be applied in combination with standard predicates (from two-valued logic), thus extending standard database systems for coping with missing or imprecise data.
AIR/X - a Rule-Based Multistage Indexing System for Large Subject Fields
- Proceedings of RIAO'91
, 1991
"... AIR/X is a rule-based system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the ..."
Abstract
-
Cited by 46 (5 self)
- Add to MetaCart
AIR/X is a rule-based system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the Darmstadt Indexing Approach, the indexing task is divided into a description step and a decision step. First, terms (single words or phrases) are identified in the document text. With term-descriptor rules from the dictionary, descriptor indications are formed. The set of all indications from a document leading to the same descriptor is called a relevance description. A probabilistic classification procedure computes indexing weights for each relevance description. Since the whole system is rule-based, it can be adapted to different subject fields by appropriate modifications of the rule bases. A major application of AIR/X is the AIR/PHYS system developed for a large physics database. This application is described in more detail along with experimental results.
Translingual Information Retrieval: Learning from Bilingual Corpora
- Artificial Intelligence
, 1997
"... Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones i ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
Translingual information retrieval (TLIR) consists of providing a query in one language and searching document collections in one or more different languages. This paper introduces new TLIR methods and reports on comparative TLIR experiments with these new methods and with previously reported ones in a realistic setting. Methods fall into two categories: query translation and statistical-IR approaches establishing translingual associations. The results show that using bilingual corpora for automated extraction of term equivalences in context outperforms dictionary-based methods. Translingual versions of the Generalized Vector Space Model (GVSM) and Latent Semantic Indexing (LSI) also perform well, as does translingual pseudo relevance feedback (PRF) and Example-Based Term-in-context Translation (EBT). All showed relatively small performance loss between monolingual and translingual versions, ranging between 87% to 101% of monolingual IR performance. Query translation based on a general...
Solving The Word Mismatch Problem Through Automatic Text Analysis
, 1997
"... Information Retrieval (IR) is concerned with locating documents that are relevant for a user's information need or query from a large collection of documents. A fundamental problem for information retrieval is word mismatch. A query is usually a short and incomplete description of the underlying inf ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Information Retrieval (IR) is concerned with locating documents that are relevant for a user's information need or query from a large collection of documents. A fundamental problem for information retrieval is word mismatch. A query is usually a short and incomplete description of the underlying information need. The users of IR systems and the authors of the documents often use different words to refer to the same concepts. This thesis addresses the word mismatch problem through automatic text analysis. We investigate two text analysis techniques, corpus analysis and local context analysis, and apply them in two domains of word mismatch, stemming and general query expansion. Experimental results show that these techniques ca...
Efficient Semantic-Based Content Search in P2P Network
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract — Most existing Peer-to-Peer (P2P) systems support only title-based searches and are limited in functionality when compared to today’s search engines. In this paper, we present the design of a distributed P2P information sharing system that supports semantic-based content searches of releva ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
Abstract — Most existing Peer-to-Peer (P2P) systems support only title-based searches and are limited in functionality when compared to today’s search engines. In this paper, we present the design of a distributed P2P information sharing system that supports semantic-based content searches of relevant documents. First, we propose a general and extensible framework for searching similar documents in P2P network. The framework is based on the novel concept of Hierarchical Summary Structure. Second, based on the framework, we develop our efficient document searching system, by effectively summarizing and maintaining all documents within the network with different granularity. Finally, an experimental study is conducted on a real P2P prototype, and a large-scale network is further simulated. The results show the effectiveness, efficiency and scalability of the proposed system. Index Terms — content-based, similarity search, peer-to-peer, hierarchical summary, indexing
Order-Theoretical Ranking
- JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCES (JASIS
, 2000
"... Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretic ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
Current best-match ranking (BMR) systems perform well but cannot handle word mismatch between a query and a document. The best known alternative ranking method, hierarchical clustering-based ranking (HCR), seems to be more robust than BMR with respect to this problem, but it is hampered by theoretical and practical limitations. We present an approach to document ranking that explicitly addresses the word mismatch problem by exploiting interdocument similarity information in a novel way. Document ranking is seen as a querydocument transformation driven by a conceptual representation of the whole document collection, into which the query is merged. Our approach is based on the theory of concept (or Galois) lattices, which, we argue, provides a powerful, well-founded, and computationallytractable framework to model the space in which documents and query are represented and to compute such a transformation. We compared information retrieval using concept lattice-based ranking (CLR) to BMR and HCR. The results showed that HCR was outperformed by CLR as well as by BMR, and suggested that, of the two best methods, BMR achieved better performance than CLR on the whole document set while CLR compared more favorably when only the first retrieved documents were used for evaluation. We also evaluated the three methods' specific ability to rank documents that did not match the query, in which case the superiority of CLR over BMR and HCR (and that of HCR over BMR) was apparent.
Set-based vector model: An efficient approach for correlation-based ranking
- ACM Transactions on Information Systems
, 2005
"... This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. Th ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
This work presents a new approach for ranking documents in the vector space model. The novelty lies in two fronts. First, patterns of term co-occurrence are taken into account and are processed efficiently. Second, term weights are generated using a data mining technique called association rules. This leads to a new ranking mechanism called the set-based vector model. The components of our model are no longer index terms but index termsets, where a termset is a set of index terms. Termsets capture the intuition that semantically related terms appear close to each other in a document. They can be efficiently obtained by limiting the computation to small passages of text. Once termsets have been computed, the ranking is calculated as a function of the termset frequency in the document and its scarcity in the document collection. Experimental results show that the set-based vector model improves average precision for all collections and query types evaluated, while keeping computational costs small. For the 2 gigabyte TREC-8 collection, the set-based vector model leads to a gain in average precision figures of 14.7 % and 16.4 % for disjunctive and conjunctive queries, respectively, with respect to the standard vector space model. These gains increase to 24.9 % and 30.0%, respectively, when proximity information is taken into account. Query processing times are larger but, on average, still comparable to those obtained
A Graph Method for Keyword-based Selection of the top-K Databases
"... While database management systems offer a comprehensive solution to data storage, they require deep knowledge of the schema, as well as the data manipulation language, in order to perform effective retrieval. Since these requirements pose a problem to lay or occasional users, several methods incorpo ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
While database management systems offer a comprehensive solution to data storage, they require deep knowledge of the schema, as well as the data manipulation language, in order to perform effective retrieval. Since these requirements pose a problem to lay or occasional users, several methods incorporate keyword search (KS) into relational databases. However, most of the existing techniques focus on querying a single DBMS. On the other hand, the proliferation of distributed databases in several conventional and emerging applications necessitates the support for keyword-based data sharing and querying over multiple DMBSs. In order to avoid the high cost of searching in numerous, potentially irrelevant, databases in such systems, we propose G-KS, a novel method for selecting the top-K candidates based on their potential to contain results for a given query. G-KS summarizes each database by a keyword relationship graph, where nodes represent terms and edges describe relationships between them. Keyword relationship graphs are utilized for computing the similarity between each database and a KS query, so that, during query processing, only the most promising databases are searched. An extensive experimental evaluation demonstrates that G-KS outperforms the current state-of-the-art technique on all aspects, including precision, recall, efficiency, space overhead and flexibility of accommodating different semantics. 1.
Combination of semantic and phonetic term similarity for spoken document retrieval and spoken query processing
- In Proceedings of the 8th Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems (IPMU
, 2000
"... Abstract In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spo ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Abstract In classical Information Retrieval systems a relevant document will not be retrieved in response to a query if the document and query representations do not share at least one term. This problem is known as “term mismatch”. A similar problem can be found in spoken document retrieval and spoken query processing, where terms misrecognized by the speech recognition process can hinder the retrieval of potentially relevant documents. I will call this problem “term misrecognition”, by analogy to the term mismatch problem. This paper presents two classes of retrieval models that attempt to tackle both the term mismatch and the term misrecognition problems at retrieval time using term similarity information. The models make effective use of complete or partial knowledge of semantic and phonetic term similarity evaluated using statistical methods for the corpus. 1

