Results 1 -
5 of
5
Automatic document indexing in large medical collections
- In HIKM
, 2006
"... Term extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows for faster and better understanding of the cont ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Term extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows for faster and better understanding of the contents of a document collection without first browsing through the contents of its documents. This paper presents AMTEX, an automatic term extraction method, specifically designed for the automatic indexing of documents in large medical collections such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEX combines MeSH, the terminological thesaurus resource of NLM, with a well-established method for extraction of domain terms, the C/NC-value method. The performance evaluation of various AMTEX configurations in the indexing task is measured against the current state-of-the-art, the MMTx method. The experimental results on a subset of MEDLINE documents demonstrate that AMTEX achieves better precision and recall than MMTx.
Medical Document Indexing and Retrieval: AMTEx vs. NLM MMTx
"... AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of N ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of NLM, with a wellestablished method for term extraction, the C/NC-value method. The performance evaluation of two AMTEx configurations is measured against the current state-of-theart, the MMTx method in indexing and retrieval tasks in three experiments. In the first, a subset of MEDLINE (PMC) full document corpus was used for the indexing task. In the second and third, a subset of MEDLINE (OHSUMED) abstracts was used for indexing and retrieval respectively. The experimental results demonstrate that AMTEx achieves better precision in all tasks, in 50-20 % of the processing time compared to MMTx.
Language-independent Pre-processing of Large Documentbases for Text Classification
"... A thesis submitted in accordance with the requirements of the ..."
The AMTEx Approach in the Medical Document Indexing and Retrieval Application
"... AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of N ..."
Abstract
- Add to MetaCart
AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of NLM, with a wellestablished method for extraction of terminology, the C/NC-value method. The performance evaluation of two AMTEx configurations is measured against the current state-of-the-art, the MetaMap Transfer (MMTx) method in four experiments, using two types of corpora: a subset of MEDLINE (PMC) full document corpus and a subset of MEDLINE (OHSUMED) abstracts, for each of the indexing and retrieval tasks respectively. The experimental results demonstrate that AMTEx performs better in indexing in 20-50 % of the processing time compared to MMTx, while for the retrieval task, AMTEx performs better in the full text (PMC) corpus.
A Framework for Summarization of Multi-topic Web Sites
, 2008
"... Web site summarization, which identifies the essential content covered in a given Web site, plays an important role in Web information management. However, straightforward summarization of an entire Web site with diverse content may lead to a summary heavily biased to the dominant topics covered in ..."
Abstract
- Add to MetaCart
Web site summarization, which identifies the essential content covered in a given Web site, plays an important role in Web information management. However, straightforward summarization of an entire Web site with diverse content may lead to a summary heavily biased to the dominant topics covered in the target Web site. In this paper, we propose a two-stage framework for effective summarization of multi-topic Web sites. The first stage identifies the main topics covered in a Web site and the second stage summarizes each topic separately. In order to identify the different topics covered in a Web site, we perform coupled text- and link-based clustering. In text-based clustering, we investigate the impact of document representation and feature selection on the clustering quality. In link-based clustering, we study co-citation and bibliographic coupling. We demonstrate that text-based clustering based on the selection of features with high variance over Web pages is reliable and that outgoing links can be used to improve the clustering quality if a rich set of cross links is available. Each individual cluster computed above is summarized using an extraction-based summarization system, which extracts key phrases and key sentences from source documents to generate a summary. We design and develop a classification approach in the cluster summarization stage. The classifier uses statistical and linguistic features to determine the topical significance of each sentence. Finally, we evaluate the proposed system via a user study. We demonstrate that the proposed clustering summarization approach significantly outperforms the single-topic summarization approach. 1

