Results 1 - 10
of
46
Information extraction: Identifying protein names from biological papers
- In Proceedings of the Pacific Symposium on Biocomputing '98 (PSB'98
, 1998
"... To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent infor ..."
Abstract
-
Cited by 195 (5 self)
- Add to MetaCart
To solve the mystery of the life phenomenon, we must clarify when genes are expressed and how their products interact with each other. But since the amount of continuously updated knowledge on these interactions is massive and is only available in the form of published articles, an intelligent information extraction (IE) system is needed. To extract these information directly from articles, the system must rstly identify the material names. However, medical and biological documents often include proper nouns newly made by the authors, and conventional methods based on domain speci c dictionaries cannot detect such unknown words or coinages. In this study, we propose a new method of extracting material names, PROPER, using surface clue on character strings. It extracts material names in the sentence with 94.70 % precision and 98.84 % recall, regardless of whether it is already known or newly de ned. 1
Conceptual-Model-Based Data Extraction from Multiple-Record Web Pages
- Data & Knowledge Engineering
, 1999
"... Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the esse ..."
Abstract
-
Cited by 101 (43 self)
- Add to MetaCart
Electronically available data on the Web is exploding at an ever increasing pace. Much of this data is unstructured, which makes searching hard and traditional database querying impossible. Many Web documents, however, contain an abundance of recognizable constants that together describe the essence of a document's content. For these kinds of data-rich, multiple-record documents (e.g. advertisements, movie reviews, weather reports, travel information, sports summaries, financial statements, obituaries, and many others) we can apply a conceptual-modeling approach to extract and structure data automatically. The approach is based on an ontology---a conceptual model instance---that describes the data of interest, including relationships, lexical appearance, and context keywords. By parsing the ontology, we can automatically produce a database scheme and recognizers for constants and keywords, and then invoke routines to recognize and extract data from unstructured documents an...
Inferring descriptions and similarity for music from community metadata
- In Proceedings of the 2002 International Computer Music Conference
, 2002
"... We propose methods for unsupervised learning of text profiles for music from unstructured text obtained from the web. The profiles can be used for classification, recommendation, and understanding, and may be used in conjunction with existing methods such as audio analysis and collaborative filterin ..."
Abstract
-
Cited by 71 (4 self)
- Add to MetaCart
We propose methods for unsupervised learning of text profiles for music from unstructured text obtained from the web. The profiles can be used for classification, recommendation, and understanding, and may be used in conjunction with existing methods such as audio analysis and collaborative filtering to improve performance. A formal method for analyzing the quality of the learned profiles is given, and results indicate that they perform well when used to find similar artists. 1
Automatic Discovery of Non-Compositional Compounds in Parallel Data
, 1997
"... Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word- ..."
Abstract
-
Cited by 58 (1 self)
- Add to MetaCart
Automatic segmentation of text into minimal content-bearing units is an unsolved problem even for languages like English. Spaces between words offer an easy first approximation, but this approximation is not good enough for machine translation (MT), where many word sequences are not translated word-for-word. This paper presents an efficient automatic method for discover- ing sequences of words that are translated as a unit. The method proceeds by comparing pairs of statistical translation models induced from parallel texts in two languages. It can discover hundreds of noncompositional compounds on each iteration, and constructs longer compounds out of shorter ones. Objective evaluation on a simple machine translation task has shown the method's potential to improve the quality of MT output. The method makes few assumptions about the data, so it can be applied to parallel data other than parallel texts, such as word spellings and pronunci- ations.
A Risk Minimization Framework for Information Retrieval
- IN PROCEEDINGS OF THE ACM SIGIR 2003 WORKSHOP ON MATHEMATICAL/FORMAL METHODS IN IR. ACM
, 2003
"... This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preference ..."
Abstract
-
Cited by 36 (1 self)
- Add to MetaCart
This paper presents a novel probabilistic information retrieval framework in which the retrieval problem is formally treated as a statistical decision problem. In this framework, queries and documents are modeled using statistical language models (i.e., probabilistic models of text), user preferences are modeled through loss functions, and retrieval is cast as a risk minimization problem. We discuss how this framework can unify existing retrieval models and accommodate the systematic development of new retrieval models. As an example of using the framework to model non-traditional retrieval problems, we derive new retrieval models for subtopic retrieval, which is concerned with retrieving documents to cover many different subtopics of a general query topic. These new models differ from traditional retrieval models in that they go beyond independent topical relevance.
Fast Statistical Parsing of Noun Phrases for Document Indexing
, 1997
"... Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques hav ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
Information Retrieval (IR) is an important application area of Natural Language Processing (NLP) where one encounters the genuine challenge of processing large quantities of unrestricted natural language text. While much effort has been made to apply NLP techniques to IR, very few NLP techniques have been evaluated on a document collection larger than several megabytes. Many NLP techniques are simply not efficient enough, and not robust enough, to handle a large amount of text. This paper proposes a new probabilistic model for noun phrase parsing, and reports on the application of such a parsing technique to enhance document indexing. The effectiveness of using syntactic phrases provided by the parser to supplement single words for indexing is evaluated with a 250 megabytes document collection. The experiment's resuits show that supplementing single words with syntactic phrases for indexing consistently and significantly improves retrieval performance.
Term Extraction and Automatic Indexing
, 2003
"... This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any applicatio ..."
Abstract
-
Cited by 22 (0 self)
- Add to MetaCart
This chapter presents a new domain of research and development in Natural Language Processing (NLP) that is concerned with the representation, acquisition, and recognition of terms. Terms are pervasive in scientific and technical documents; their identification is a crucial issue for any application dealing with the analysis, understanding, generation, or translation of such documents. In particular, the ever-growing mass of specialized documentation available on-line, in industrial and governmental archives or in digital libraries, calls for advances in terminology processing for such purposes as information retrieval, cross-language querying, indexing of multimedia documents, translation aids, document routing and summarization, etc. This chapter introduces the basic linguistic characteristics of terms. It presents the main methods in NLP for recognizing or discovering terms and their interrelationships in large corpora. It is divided into three sections: an introduction to the bas...
A Layered Approach To Nlp-Based Information Retrieval
, 1998
"... A layered approach to information retrieval permits the inclusion of multiple search engines as well as multiple databases, with a natural language layer to convert English queries for use by the various search engines. The NLP layer incorporates morphological analysis, noun phrase syntax, and seman ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
A layered approach to information retrieval permits the inclusion of multiple search engines as well as multiple databases, with a natural language layer to convert English queries for use by the various search engines. The NLP layer incorporates morphological analysis, noun phrase syntax, and semantic expansion based on WordNet. 1 Introduction This paper describes a layered approach to information retrieval, and the natural language component that is a major element in that approach. The layered approach, packaged as Intermezzo TM , was deployed in a pre-product form at a government site. The NLP component has been installed, with a proprietary IR engine, PhotoFile, (Flank, Martin, Balogh and Rothey, 1995), (Flank, Garfield, and Norkin, 1995), at several commercial sites, including Picture Network International (PNI), Simon and Schuster, and John Deere. Intermezzo employs an abstraction layer to permit simultaneous querying of multiple databases. A user enters a query into a clien...

