Results 1 - 10
of
25
Conceptual language models for domain-specific retrieval.
- Information Processing and Management,
, 2010
"... a b s t r a c t Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to quer ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
a b s t r a c t Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model. Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms. Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.
P (2011) MeSH: a window into full text for document summarization
- Bioinformatics
"... Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the real ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
Motivation: Previous research in the biomedical text-mining domain has historically been limited to titles, abstracts and metadata available in MEDLINE records. Recent research initiatives such as TREC Genomics and BioCreAtIvE strongly point to the merits of moving beyond abstracts and into the realm of full texts. Full texts are, however, more expensive to process not only in terms of resources needed but also in terms of accuracy. Since full texts contain embellishments that elaborate, contextualize, contrast, supplement, etc., there is greater risk for false positives. Motivated by this, we explore an approach that offers a compromise between the extremes of abstracts and full texts. Specifically, we create reduced versions of full text documents that contain only important portions. In the long-term, our goal is to explore the use of such summaries for functions such as document retrieval and information extraction. Here, we focus on designing summarization strategies. In particular, we explore the use of MeSH terms, manually assigned to documents by trained annotators, as clues to select important text segments from the full text documents. Results: Our experiments confirm the ability of our approach to pick the important text portions. Using the ROUGE measures for evaluation, we were able to achieve maximum ROUGE-1, ROUGE-2 and ROUGE-SU4 F-scores of 0.4150, 0.1435 and 0.1782, respectively, for our MeSH term-based method versus the maximum baseline scores of 0.3815, 0.1353 and 0.1428, respectively. Using a MeSH profile-based strategy, we were able to achieve maximum ROUGE F-scores of 0.4320, 0.1497 and 0.1887, respectively. Human evaluation of the baselines and our proposed strategies further corroborates the ability of our method to select important sentences from the full texts. Contact:
A Cross-lingual Framework for Monolingual Biomedical Information Retrieval
"... An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed todeal with this challenge. In thispap ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
An important challenge for biomedical information retrieval (IR) is dealing with the complex, inconsistent and ambiguous biomedical terminology. Frequently, a concept-based representation defined in terms of a domain-specific terminological resource is employed todeal with this challenge. In thispaper, weapproachtheincorporationofaconcept-based representation in monolingual biomedical IR from a crosslingual perspective. In the proposed framework, this is realized by translating and matching between text and conceptbased representations. The approach allows for deployment of a rich set of techniques proposed and evaluated in traditional cross-lingual IR. We compare six translation models and measure their effectiveness in the biomedical domain. We demonstrate that the approach can result in significant improvements in retrieval effectiveness over word-based retrieval. Moreover, we demonstrate increased effectiveness of a CLIR framework for monolingual biomedical IR if basic translations models are combined.
2012a) MEDLINE MeSH indexing: lessons learned from machine learning and future directions
- In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium, ACM
"... Due to the large yearly growth of MEDLINE, MeSH in-dexing is becoming a more difficult task for a relatively small group of highly qualified indexing staff at the US Na-tional Library of Medicine (NLM). The Medical Text In-dexer (MTI) is a support tool for assisting indexers; this tool relies on Met ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Due to the large yearly growth of MEDLINE, MeSH in-dexing is becoming a more difficult task for a relatively small group of highly qualified indexing staff at the US Na-tional Library of Medicine (NLM). The Medical Text In-dexer (MTI) is a support tool for assisting indexers; this tool relies on MetaMap and a k-NN approach called PubMed Related Citations (PRC). Our motivation is to improve the quality of MTI based on machine learning. Typical machine learning approaches fit this indexing task into text catego-rization. In this work, we have studied some Medical Subject Headings (MeSH) recommended by MTI and analyzed the issues when using standard machine learning algorithms. We show that in some cases machine learning can improve the annotations already recommended by MTI, that machine
Explicit Extraction of Topical Context
"... This article studies one of the main bottlenecks in providing more effective information access: the poverty on the query end. We explore whether users can classify keyword queries into categories from the DMOZ directory on different levels and whether this topical context can help retrieval perform ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This article studies one of the main bottlenecks in providing more effective information access: the poverty on the query end. We explore whether users can classify keyword queries into categories from the DMOZ directory on different levels and whether this topical context can help retrieval performance. We have conducted a user study to let participants classify queries into DMOZ categories, either by freely searching the directory or by selection from a list of suggestions. Results of the study show that DMOZ categories are suitable for topic categorization. Both free search and list selection can be used to elicit topical context. Free search leads to more specific categories than the list selections. Participants in our study show moderate agreement on the categories they select, but broad agreement on the higher levels of chosen categories. The free search categories significantly improve retrieval effectiveness. The more general list selection categories and the top-level categories do not lead to significant improvements. Combining topical context with blind relevance feedback leads to better results than applying either of them separately. We conclude that DMOZ is a suitable resource for interacting with users on topical categories applicable to their query, and can lead to better search results.
Mining the Transcriptomic Landscape of Human Tissue and Disease
, 2012
"... Although there are a variety of high-throughput technologies used to perform bio-logical experiments, DNA microarrays have become a standard tool in the modern biologist’s arsenal. Microarray experiments provide measurements of thousands of genes simultaneously, and offer a snapshot view of transcri ..."
Abstract
- Add to MetaCart
(Show Context)
Although there are a variety of high-throughput technologies used to perform bio-logical experiments, DNA microarrays have become a standard tool in the modern biologist’s arsenal. Microarray experiments provide measurements of thousands of genes simultaneously, and offer a snapshot view of transcriptomic activity. With the rapid growth of public availability of transcriptomic data, there is increasing recog-nition that large sets of such data can be mined to better understand disease states and mechanisms. Unfortunately, several challenges arise when attempting to perform such large-scale analyses. For instance, public repositories to which the data is being submitted to were designed around the simple task of storage rather than that of data mining. As such, the seemingly simple task of obtaining all data relating to a particular disease becomes an arduous task. Furthermore, prior gene expression anal-yses, both large and small, have been dichotomous in nature, in which phenotypes are compared using clearly defined controls. Such approaches may require arbitrary
Article Reference Using binary classification to prioritize and curate articles for the Comparative Toxicogenomics Database
"... We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). T ..."
Abstract
- Add to MetaCart
(Show Context)
We report on the original integration of an automatic text categorization pipeline, so-called ToxiCat (Toxicogenomic Categorizer), that we developed to perform biomedical documents classification and prioritization in order to speed up the curation of the Comparative Toxicogenomics Database (CTD). The task can be basically described as a binary classification task, where a scoring function is used to rank a selected set of articles. Then components of a question-answering system are used to extract CTD-specific annotations from the ranked list of articles. The ranking function is generated using a Support Vector Machine, which combines three main modules: an information retrieval engine for MEDLINE (EAGLi), a gene normalization service (NormaGene) developed for a previous BioCreative campaign and finally, a set of answering components and entity recognizer for diseases and chemicals. The main components of the pipeline are publicly available both as web application and web services. The specific integration performed for the BioCreative competition is available via a web user interface at [...]
The Role of Hubs in Cross-lingual Supervised Document Retrieval
"... Abstract. Information retrieval in multi-lingual document repositories is of high importance in modern text mining applications. Analyzing textual data is, however, not without associated difficulties. Regardless of the particular choice of feature representation, textual data is high-dimensional i ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Information retrieval in multi-lingual document repositories is of high importance in modern text mining applications. Analyzing textual data is, however, not without associated difficulties. Regardless of the particular choice of feature representation, textual data is high-dimensional in its nature and all inference is bound to be somewhat affected by the well known curse of dimensionality. In this paper, we have focused on one particular aspect of the dimensionality curse, known as hubness. Hubs emerge as influential points in the k-nearest neighbor (kNN) topology of the data. They have been shown to affect the similarity based methods in severely negative ways in high-dimensional data, interfering with both retrieval and classification. The issue of hubness in textual data has already been briefly addressed, but not in the context that we are presenting here, namely the multi-lingual retrieval setting. Our goal was to gain some insights into the crosslingual hub structure and exploit it for improving the retrieval and classification performance. Our initial analysis has allowed us to devise a hubness-aware instance weighting scheme for canonical correlation analysis procedure which is used to construct the common semantic space that allows the cross-lingual document retrieval and classification. The experimental evaluation indicates that the proposed approach outperforms the baseline. This shows that the hubs can indeed be exploited for improving the robustness of textual feature representations.
DeepMeSH: deep semantic representation for improving large-scale MeSH indexing
"... Abstract Motivation: Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings to citations, is crucial for many important tasks in biomedical text mining and information retrieval. Large-scale MeSH indexing has two challenging aspects: the citation side and MeSH side ..."
Abstract
- Add to MetaCart
Abstract Motivation: Medical Subject Headings (MeSH) indexing, which is to assign a set of MeSH main headings to citations, is crucial for many important tasks in biomedical text mining and information retrieval. Large-scale MeSH indexing has two challenging aspects: the citation side and MeSH side. For the citation side, all existing methods, including Medical Text Indexer (MTI) by National Library of Medicine and the state-of-the-art method, MeSHLabeler, deal with text by bag-of-words, which cannot capture semantic and context-dependent information well. Methods: We propose DeepMeSH that incorporates deep semantic information for large-scale MeSH indexing. It addresses the two challenges in both citation and MeSH sides. The citation side challenge is solved by a new deep semantic representation, D2V-TFIDF, which concatenates both sparse and dense semantic representations. The MeSH side challenge is solved by using the 'learning to rank' framework of MeSHLabeler, which integrates various types of evidence generated from the new semantic representation. Results: DeepMeSH achieved a Micro F-measure of 0.6323, 2% higher than 0.6218 of MeSHLabeler and 12% higher than 0.5637 of MTI, for BioASQ3 challenge data with 6000 citations.
Grading the Quality of Medical Evidence
"... Evidence Based Medicine (EBM) is the practice of using the knowledge gained from the best medical evidence to make decisions in the effective care of patients. This medical evidence is extracted from medical documents such as research papers. The increasing number of available medical documents has ..."
Abstract
- Add to MetaCart
Evidence Based Medicine (EBM) is the practice of using the knowledge gained from the best medical evidence to make decisions in the effective care of patients. This medical evidence is extracted from medical documents such as research papers. The increasing number of available medical documents has imposed a challenge to identify the appropriate evidence and to access the quality of the evidence. In this paper, we present an approach for the automatic grading of evidence using the dataset provided by the 2011 Australian Language Technology Association (ALTA) shared task competition. With the feature sets extracted from publication types, Medical Subject Headings (MeSH), title, and body of the abstracts, we obtain a 73.77% grading accuracy with a stacking based approach, a considerable improvement over previous work. 1