Results 11 - 20
of
25
Berkeley's TREC-8 Interactive Track Entry: Cheshire and Zprise
- In
, 2000
"... This paper briefly discusses the UC Berkeley entry in the TREC8 Interactive Track. In this year’s study twelve searchers conducted six searches each, half on the Cheshire II system and the other half on the Zprise system, for a total of 72 searches. Questionnaires were administered to each participa ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
This paper briefly discusses the UC Berkeley entry in the TREC8 Interactive Track. In this year’s study twelve searchers conducted six searches each, half on the Cheshire II system and the other half on the Zprise system, for a total of 72 searches. Questionnaires were administered to each participant to gather information about basic demographic and searching experience, about each search, about each of the systems, and finally, about the user’s perceptions of the systems. In this paper I will briefly describe the systems used in the study and how they differ in design goals and implementation. The results of the interactive track evaluations and the information derived from the questionnaires are then discussed and future improvements to the Cheshire II system are considered.
Automatic Indexing in Operation: The Rule-Based System AIR/X for Large Subject Fields
, 1993
"... AIR/X is a rule-based system for automatic indexing with a controlled vocabulary. The indexing process consists of several stages, with specific rule bases involved in each stage. Most of these rule bases are constructed automatically, especially the large number of term-descriptor rules. We describ ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
AIR/X is a rule-based system for automatic indexing with a controlled vocabulary. The indexing process consists of several stages, with specific rule bases involved in each stage. Most of these rule bases are constructed automatically, especially the large number of term-descriptor rules. We describe the different stages and the overall architecture of the system. Then we present a specific application, the AIR/PHYS system developed for a large physics database. We illustrate the system by giving a detailed example and present experimental results for different system parameter settings. 1 Introduction The AIR/X system described in this paper performs an automatic indexing with index terms (called descriptors here) from a controlled vocabulary. The texts to be indexed are abstracts written in English. The indexing process consists of several stages, with specific rule bases involved in each stage. In order to cope with large subject fields, appropriate rule bases have to be developed....
EVA: Extraction, Visualization and Analysis of the Telecommunications
"... We present EVA, a prototype system for extracting, visualizing, and analyzing corporate ownership information as a social network. Using probabilistic information retrieval and extraction techniques, we automatically extract ownership relationships from heterogeneous sources of online text, inclu ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present EVA, a prototype system for extracting, visualizing, and analyzing corporate ownership information as a social network. Using probabilistic information retrieval and extraction techniques, we automatically extract ownership relationships from heterogeneous sources of online text, including corporate annual reports (1 O-Ks) filed with the U.S. Securities and Exchange Commission (SEC). A browser-based visualization interface allows users to query the relationship database and explore large networks of companies. Applying the system and methodology to the telecommunications and media industries, we construct an ownership network with 6,726 relationships among 8,343 companies. Analysis reveals a highly clustered network, with over 50% of all companies connected to one another in a single component. Furthermore, ownership activity is highly skewed: 90% of companies are involved in no more than one relationship, but the top ten companies are parents for over 24% of all relationships. We are also able to identify the most influential companies in the network using social network analysis metrics such as degree, betweenness, cutpoints, and cliques. We believe this methodology and tool can aid government regulators, policy researchers, and the general public to interpret complex corporate ownership structures, thereby bringing greater transparency to the public disclosure of corporate inter-relationships.
Linked Relevance Feedback for the ImageCLEF Photo Task
- In Working
"... In this paper we will describe Berkeley’s approach to the ImageCLEFphoto task for CLEF 2007. Once again (as in ImageCLEFphoto for CLEF 2006) we used entirely text-based methods for retrieval. For some runs this year, however, we exploited the basic similarity of the topics and database from 2006 to ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In this paper we will describe Berkeley’s approach to the ImageCLEFphoto task for CLEF 2007. Once again (as in ImageCLEFphoto for CLEF 2006) we used entirely text-based methods for retrieval. For some runs this year, however, we exploited the basic similarity of the topics and database from 2006 to acquire the metadata descriptions of the “example images ” in the 2007 queries, and used that metadata to expand the query content for each topic. The results speak for themselves: use of what amounts to relevance feedback based on image metadata is much more effective than use of unexpanded queries, and even provides a method of cross-language retrieval for unknown languages when parallel topics and example images can be established. We submitted 19 runs for ImageCLEFphoto this year, of which 8 where monolingual English, German and Spanish, and the remaining 11 where bilingual from various languages to English, German and Spanish.
hsql database engine home page. http://hsqldb.sourceforge.net
- In: Working Notes of the 6 th Workshop of the Cross-Language Evaluation Forum, CLEF. Sep. 2005
, 2001
"... In this paper I will describe the Berkeley (group 1) approach to the GeoCLEF task for CLEF 2005. The main technique we are testing is the fusion of multiple probabilistic searches against different XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorit ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper I will describe the Berkeley (group 1) approach to the GeoCLEF task for CLEF 2005. The main technique we are testing is the fusion of multiple probabilistic searches against different XML components using both Logistic Regression (LR) algorithms and a version of the Okapi BM-25 algorithm. We also combine multiple translations of queries in cross-language searching. Since this is the first time that the Cheshire system has been used for CLEF this approach can, at best, be considered a very preliminary base testing of some retrieval algorithms and approaches. The primary geographically based approaches taken for GeoCLEF were to georeference proper nouns in the text using a gazetteer derived from the World Gazetteer with both English and German names for each place, and to expand place names for regions or countries in the queries by the names of the countries or cities in those regions or countries.
Score Distributions in Information Retrieval
"... Abstract. We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. We review the history of modeling score distributions, focusing on the mixture of normal-exponential by investigating the theoretical as well as the empirical evidence supporting its use. We discuss previously suggested conditions which valid binary mixture models should satisfy, such as the Recall-Fallout Convexity Hypothesis, and formulate two new hypotheses considering the component distributions under some limiting conditions of parameter values. From all the mixtures suggested in the past, the current theoretical argument points to the two gamma as the most-likely universal model, with the normal-exponential being a usable approximation. Beyond the theoretical contribution, we provide new experimental evidence showing vector space or geometric models, and BM25, as being “friendly ” to the normal-exponential, and that the non-convexity problem that the mixture possesses is practically not severe. 1
Empirical Studies of Query/Document Characteristics as Evidence in Favor of Relevance
- Proceedings of ACM SIGIR
, 1998
"... Query/document characteristics known to be useful for information retrieval are analyzed for a specific collection/query-set pair. These features are analyzed in terms of the weight of evidence in favor of relevance provided by values assumed by the feature variables. Weight of evidence, a measure o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Query/document characteristics known to be useful for information retrieval are analyzed for a specific collection/query-set pair. These features are analyzed in terms of the weight of evidence in favor of relevance provided by values assumed by the feature variables. Weight of evidence, a measure of how much more likely a hypothesis is believed to hold after evidence is considered than before it is observed, is formally defined; and a technique for the analysis of weight of evidence as a function of features of interest is presented. The method is exemplified by showing how it has been applied to analyze evidence in the form of: the coordination level, and the inverse document frequencies and term frequencies for all of the query terms. The result of data analysis is a model of weight of evidence that can be used as the foundation of a retrieval ranking formula. Results of preliminary evaluation of the derived formula are presented and discussed. 1 Introduction This paper presents ...
PIRE: An extensible IR engine based on probabilistic
"... Abstract. This paper introduces PIRE, a probabilistic IR engine. For both document indexing and retrieval, PIRE makes heavy use of probabilistic Datalog, a probabilistic extension of predicate Horn logics. Using such a logical framework together with probability theory allows for defining and using ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. This paper introduces PIRE, a probabilistic IR engine. For both document indexing and retrieval, PIRE makes heavy use of probabilistic Datalog, a probabilistic extension of predicate Horn logics. Using such a logical framework together with probability theory allows for defining and using data types (e.g. text, names, numbers), different weighting schemes (e.g. normalised tf, tf.idf or BM25) and retrieval functions (e.g. uncertain inference, language models). Extending the system thus is reduced to adding new rules. Furthermore, this logical framework provide a powerful tool for including additional background knowledge into the retrieval process. 1
Knowledgescapes: A Probabilistic Model for Mining Tacit Knowledge for Information Retrieval
, 2000
"... Most existing information retrieval systems attempt to analyze the content and structural properties of documents, without explicitly considering the actual information needs of users. However, a vast amount of taskspecific knowledge is implicitly encoded in the behavior of users accessing an online ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Most existing information retrieval systems attempt to analyze the content and structural properties of documents, without explicitly considering the actual information needs of users. However, a vast amount of taskspecific knowledge is implicitly encoded in the behavior of users accessing an online information collection. In this paper we present Knowledgescapes, a novel probabilistic framework for supporting general-purpose information retrieval by mining this tacit knowledge from web server access logs. We formulate a Bayesian probabilistic model for reasoning about the short-term information needs of users and use this model to support the dynamic reranking of query results based on the user's recent browsing history. We discuss our experiences with a realistic prototype search engine based on this model that we developed for users of the Berkeley Digital Library document collection. We analyze the capabilities and limitations of the Knowledgescapes model and identify several avenues for future research on the problem of mining implicit knowledge to support information retrieval applications.
Modeling and Predicting Term Mismatch for Full-Text Retrieval
, 2011
"... The probability that a term appears in a relevant document is a fundamental quantity in the theory of probabilistic information retrieval, however prior research provided few clues about how to estimate it reliably. Since this probability measures how likely it is that a term has to appear in a docu ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The probability that a term appears in a relevant document is a fundamental quantity in the theory of probabilistic information retrieval, however prior research provided few clues about how to estimate it reliably. Since this probability measures how likely it is that a term has to appear in a document in order for the document to be relevant, in this thesis, it is called term necessity. Equivalently, it is also the proportion of relevant documents that contain the term, thus measures term recall, or the complement of term mismatch. This thesis uses exploratory data analysis to identify common reasons that user-specified query terms fail to match relevant documents, develops features correlated with each reason, and integrates them into a model that can be trained from data. The resulting term necessity predictions can be used as term weights in state-of-the-art retrieval models to improve retrieval accuracy substantially. Feature-based necessity prediction also supports diagnosis and improvement of query components. The thesis research will develop several forms of diagnosis and intervention. The simplest form is interactive feedback in which potential problems with query components are identified for a person to fix. More nuanced approaches to automatic formulations

