Results 1 - 10
of
14
Mining the Biomedical Literature in the Genomic Era: An Overview
- JOURNAL OF COMPUTATIONAL BIOLOGY
, 2003
"... The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last f ..."
Abstract
-
Cited by 72 (2 self)
- Add to MetaCart
The past decade has seen a tremendous growth in the amount of experimental and computational biomedical data, specifically in the areas of Genomics and Proteomics. This growth is accompanied by an accelerated increase in the number of biomedical publications discussing the findings. In the last few years there is a lot of interest within the scientific community in literature-mining tools to help sort through this abundance of literature, and find the nuggets of information most relevant and useful for specific analysis tasks. This paper
The SOMLib Digital Library System
- In Proc. Europ. Conf. on Research and Advanced Technology for Digital Libraries (ECDL99
, 1999
"... . Digital Libraries have gained tremendous interest with several research projects addressing the wealth of challenges in this field. While computational intelligence systems are being used for specific tasks in this arena, the majority of projects relies on conventional techniques for the basic str ..."
Abstract
-
Cited by 35 (16 self)
- Add to MetaCart
. Digital Libraries have gained tremendous interest with several research projects addressing the wealth of challenges in this field. While computational intelligence systems are being used for specific tasks in this arena, the majority of projects relies on conventional techniques for the basic structure of the library itself. With the SOMLib project we created a digital library system that uses a neural network-based core for the representation of the library. The self-organizing map, a popular unsupervised neural network model, is used to topically structure a document collection similar to the organization of real-world libraries. Based on this core, additional modules provide information retrieval features, integrate distributed libraries, and automatically label the various topical sections in the document collection. A metaphor graphics based interface further assists the user in intuitively understanding the library providing an instant overview. Keywords: Self-Organizing Map ...
Restructuring Sparse High Dimensional Data for Effective Retrieval
- in Advances in Neural Information Processing Systems
, 1998
"... The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot produ ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
The task in text retrieval is to find the subset of a collection of documents relevant to a user's information request, usually expressed as a set of words. Classically, documents and queries are represented as vectors of word counts. In its simplest form, relevance is defined to be the dot product between a document and a query vector--a measure of the number of common terms. A central difficulty in text retrieval is that the presence or absence of a word is not sufficient to determine relevance to a query. Linear dimensionality reduction has been proposed as a technique for extracting underlying structure from the document collection. In some domains (such as vision) dimensionality reduction reduces computational complexity. In text retrieval it is more often used to improve retrieval performance. We propose an alternative and novel technique that produces sparse representations constructed from sets of highly-related words. Documents and queries are represented by their distance to these sets. and relevance is measured by the number of common clusters. This technique significantly improves retrieval performance, is efficient to compute and shares properties with the optimal linear projection operator and the independent components of documents.
Model-based overlapping clustering
- In KDD
, 2005
"... While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model prop ..."
Abstract
-
Cited by 18 (4 self)
- Add to MetaCart
While the vast majority of clustering algorithms are partitional, many real world datasets have inherently overlapping clusters. Several approaches to finding overlapping clusters have come from work on analysis of biological datasets. In this paper, we interpret an overlapping clustering model proposed by Segal et al. [23] as a generalization of Gaussian mixture models, and we extend it to an overlapping clustering model based on mixtures of any regular exponential family distribution and the corresponding Bregman divergence. We provide the necessary algorithm modifications for this extension, and present results on synthetic data as well as subsets of 20-Newsgroups and EachMovie datasets.
Exploiting Thesaurus Knowledge in Rule Induction for Text Classification
- In: RANLP’97 - Recent Advances in NLP
, 1997
"... Systems for learning text classifiers recently gained considerable interest. One technique to implement such systems is rule induction. While most other approaches rely on a relatively simple document representation and do not make use of any background knowledge, rule induction algorithms offer a g ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Systems for learning text classifiers recently gained considerable interest. One technique to implement such systems is rule induction. While most other approaches rely on a relatively simple document representation and do not make use of any background knowledge, rule induction algorithms offer a good potential for improvements in both of these areas. In this paper, we show how an operator-based view of rule induction enables the easy integration of a thesaurus as background knowledge. Results with an algorithm extended by thesaurus knowledge are presented and interpreted. The interpretation shows the strengths and weaknesses of using thesaurus knowledge and gives hints for future research. 1 Introduction Text classification deals with the task of assigning a label out of a set of predefined classes to a given text document. Example applications include classifying technical reports according to their subject research area for archiving, or analyzing incoming newswire articles wrt. ...
Applications of Machine Learning in Information Retrieval
, 1997
"... Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can uti ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Information retrieval systems provide access to collections of thousands, or millions, of documents, from which, by providing an appropriate description, users can recover any one. Typically, users iteratively refine the descriptions they provide to satisfy their needs, and retrieval systems can utilize user feedback on selected documents to indicate the accuracy of
Finding Themes in Medline Documents - Probabilistic Similarity Search
, 2000
"... Large on-line document databases, such as Medline, pose a major challenge of retrieving the few documents most relevant to the user’s needs, while minimizing the return rate of nonrelevant documents. Retrieval of documents similar to a userprovided example document is a promising query paradigm towa ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
Large on-line document databases, such as Medline, pose a major challenge of retrieving the few documents most relevant to the user’s needs, while minimizing the return rate of nonrelevant documents. Retrieval of documents similar to a userprovided example document is a promising query paradigm towards meeting this goal. We present a new theme-based probabilistic approach for finding documents relevant to a given query document, and summarizing their contents. Preliminary experiments conducted over a subset of Medline documents related to AIDS demonstrate the effectiveness of our approach.
Automatic Labeling of Document Clusters
, 2000
"... Automatically labeling document clusters with words which indicate their topics is difficult to do well. The most commonly used method, labeling with the most frequent words in the clusters, ends up using many words that are virtually void of descriptive power even after traditional stop words are r ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Automatically labeling document clusters with words which indicate their topics is difficult to do well. The most commonly used method, labeling with the most frequent words in the clusters, ends up using many words that are virtually void of descriptive power even after traditional stop words are removed. Another method, labeling with the most predictive words, often includes rather obscure words. We present two methods of labeling document clusters motivated by the model that words are generated by a hierarchy of mixture components of varying generality. The first method assumes existence of a document hierarchy (manually constructed or resulting from a hierarchical clustering algorithm) and uses a 2 test of significance to detect different word usage across categories in the hierarchy. The second method selects words which both occur frequently in a cluster and effectively discriminate the given cluster from the other clusters. We compare these methods on abstracts of documents sel...
A Thematic Analysis Of The Aids Literature
- PSB
, 2002
"... ive the results of applying the method to a database of over fifty thousand PubMed documents dealing with the subject of AIDS. How themes may improve access to a document collection is also discussed. 1 Introduction There are at least two reasons for interest in clustering a set of documents. On ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
ive the results of applying the method to a database of over fifty thousand PubMed documents dealing with the subject of AIDS. How themes may improve access to a document collection is also discussed. 1 Introduction There are at least two reasons for interest in clustering a set of documents. One is to improve retrieval efficiency and the other is to improve human understanding of the data in the collection. The first of these goals proved elusive historically because the quality of the retrieval degraded due to the clustering. l, 2 With the much greater speed and memory of current computers the interest in clustering for efficiency has waned. However, the need for improved human understanding of large data sets has reached critical proportions with the advent of the Internet as well as the many large databases of documents that are now becoming available in different specialty areas. Improved human understanding of data through clustering may consist of graphical aids in visualiz
Topic Extraction from Text Documents Using Multiple-Cause Networks
"... Abstract. This paper presents an approach to the topic extraction from text documents using probabilistic graphical models. Multiple-cause networks with latent variables are used and the Helmholtz machines are utilized to ease the learning and inference. The learning in this model is conducted in a ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. This paper presents an approach to the topic extraction from text documents using probabilistic graphical models. Multiple-cause networks with latent variables are used and the Helmholtz machines are utilized to ease the learning and inference. The learning in this model is conducted in a purely data-driven way and does not require prespecified categories of the given documents. Topic words extraction experiments on the TDT-2collection are presented. Especially, document clustering results on a subset of TREC-8 ad-hoc task data show the substantial reduction of the inference time without significant deterioration of performance. 1

