Results 1 -
9 of
9
Narrative Text Classification and Automatic Key Phrase Extraction in Web Document Corpora
- In 7th ACM Intern. Workshop on Web Information and Data Management (WIDM 2005
"... Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contai ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
Automatic key phrase extraction is a useful tool in many text related applications such as clustering and summarization. State-of-the-art methods are aimed towards extracting key phrases from traditional text such as technical papers. Application of these methods on Web documents, which often contain diverse and heterogeneous contents, is of particular interest and challenge in the information age. In this work, we investigate the significance of narrative text classification in the task of automatic key phrase extraction in Web document corpora. We benchmark three methods, TFIDF, KEA, and Keyterm, used to extract key phrases from all the plain text and from only the narrative text of Web pages. ANOVA tests are used to analyze the ranking data collected in a user study using quantitative measures of acceptable percentage and quality value. The evaluation shows that key phrases extracted from the narrative text only are significantly better than those obtained from all plain text of Web pages. This demonstrates that narrative text classification is indispensable for effective key phrase extraction in Web document corpora.
Term-Based Clustering and Summarization of Web Page Collections
- In Advances in Artificial Intelligence, Proceedings of the Seventeenth Conference of the Canadian Society for Computational Studies of Intelligence
, 2004
"... Abstract. Effectively summarizing Web page collections becomes more and more critical as the amount of information continues to grow on the World Wide Web. A concise and meaningful summary of a Web page collection, which is generated automatically, can help Web users understand the essential topics ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. Effectively summarizing Web page collections becomes more and more critical as the amount of information continues to grow on the World Wide Web. A concise and meaningful summary of a Web page collection, which is generated automatically, can help Web users understand the essential topics and main contents covered in the collection quickly without spending much browsing time. However, automatically generating coherent summaries as good as human-authored summaries is a challenging task since Web page collections often contain diverse topics and contents. This research aims towards clustering of Web page collections using automatically extracted topical terms, and automatic summarization of the resulting clusters. We experiment with word- and term-based representations of Web documents and demonstrate that term-based clustering significantly outperforms word-based clustering with much lower dimensionality. The summaries of computed clusters are informative and meaningful, which indicates that clustering and summarization of large Web page collections is promising for alleviating the information overload problem. 1
Automatic document indexing in large medical collections
- In HIKM
, 2006
"... Term extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows for faster and better understanding of the cont ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Term extraction relates to extracting the most characteristic or important terms (words or phrases) in a document. This information is commonly used for improving the accuracy of document indexing and retrieval in large text collections. It also allows for faster and better understanding of the contents of a document collection without first browsing through the contents of its documents. This paper presents AMTEX, an automatic term extraction method, specifically designed for the automatic indexing of documents in large medical collections such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEX combines MeSH, the terminological thesaurus resource of NLM, with a well-established method for extraction of domain terms, the C/NC-value method. The performance evaluation of various AMTEX configurations in the indexing task is measured against the current state-of-the-art, the MMTx method. The experimental results on a subset of MEDLINE documents demonstrate that AMTEX achieves better precision and recall than MMTx.
A Comparison of Keyword- and Keyterm-Based Methods for Automatic Web Site Summarization
- In Technical Report WS-04-01, Papers from the AAAI’04 Workshop on Adaptive Text Extraction and Mining
, 2004
"... ..."
A Comparative Study on Key Phrase Extraction Methods in Automatic Web Site Summarization
- Journal of Digital Information Management, Special Issue on Web Information Retrieval
, 2007
"... Web Site Summarization is the process of automatically generating a concise and informative summary for a given Web site. It has gained more and more attention in recent years as effective summarization could lead to enhanced Web information retrieval systems such as searching for Web sites. Extract ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Web Site Summarization is the process of automatically generating a concise and informative summary for a given Web site. It has gained more and more attention in recent years as effective summarization could lead to enhanced Web information retrieval systems such as searching for Web sites. Extraction-based approaches to Web site summarization rely on the extraction of the most significant sentences from the target Web site based on the density of a list of key phrases that best describe the entire Web site. In this work, we benchmark five alternative key phrase extraction methods, TFIDF, KEA, Keyword, Keyterm, and Mixture, in an automatic Web site summarization framework we previously developed. We investigate the performance of these underlying methods via a formal user study and demonstrate that Keyterm is the best choice for key phrase extraction while Mixture should be used to obtain key sentences. We also discuss why one method performs better than another and what could be done to further improve the summarization system. 1
Knowledge Management in Pediatric Pain: Mapping On-Line Expert Discussions to
, 2004
"... availability of the right medical knowledge at the right time. This concept paper presents a knowledge management research program to (a) identify, capture and organize the tacit knowledge inherent within on-line problem-solving discussions between pediatric pain practitioners; (b) establish linkage ..."
Abstract
- Add to MetaCart
availability of the right medical knowledge at the right time. This concept paper presents a knowledge management research program to (a) identify, capture and organize the tacit knowledge inherent within on-line problem-solving discussions between pediatric pain practitioners; (b) establish linkages between topic-specific pediatric pain discussions and corresponding published medical literature on children's pain available at PubMed---i.e. linking tacit expert knowledge to explicit medical literature; and (c) make these knowledge resources available to pediatric pain practitioners via the WWW for timely access to various modalities of clinical knowledge.
Mysore
"... Text classification is one of the important research issues in the field of text mining, where the documents are classified with supervised knowledge. In literature we can find many text representation schemes and classifiers/learning algorithms used to classify text documents to the predefined cate ..."
Abstract
- Add to MetaCart
Text classification is one of the important research issues in the field of text mining, where the documents are classified with supervised knowledge. In literature we can find many text representation schemes and classifiers/learning algorithms used to classify text documents to the predefined categories. In this paper, we present various text representation schemes and compare different classifiers used to classify text documents to the predefined classes. The existing methods are compared and contrasted based on qualitative parameters viz., criteria used for classification, algorithms adopted and classification time complexities.
The AMTEx Approach in the Medical Document Indexing and Retrieval Application
"... AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of N ..."
Abstract
- Add to MetaCart
AMTEx is a medical document indexing method, specifically designed for the automatic indexing of documents in large medical collections, such as MEDLINE, the premier bibliographic database of the U.S. National Library of Medicine (NLM). AMTEx combines MeSH, the terminological thesaurus resource of NLM, with a wellestablished method for extraction of terminology, the C/NC-value method. The performance evaluation of two AMTEx configurations is measured against the current state-of-the-art, the MetaMap Transfer (MMTx) method in four experiments, using two types of corpora: a subset of MEDLINE (PMC) full document corpus and a subset of MEDLINE (OHSUMED) abstracts, for each of the indexing and retrieval tasks respectively. The experimental results demonstrate that AMTEx performs better in indexing in 20-50 % of the processing time compared to MMTx, while for the retrieval task, AMTEx performs better in the full text (PMC) corpus.
A Framework for Summarization of Multi-topic Web Sites
, 2008
"... Web site summarization, which identifies the essential content covered in a given Web site, plays an important role in Web information management. However, straightforward summarization of an entire Web site with diverse content may lead to a summary heavily biased to the dominant topics covered in ..."
Abstract
- Add to MetaCart
Web site summarization, which identifies the essential content covered in a given Web site, plays an important role in Web information management. However, straightforward summarization of an entire Web site with diverse content may lead to a summary heavily biased to the dominant topics covered in the target Web site. In this paper, we propose a two-stage framework for effective summarization of multi-topic Web sites. The first stage identifies the main topics covered in a Web site and the second stage summarizes each topic separately. In order to identify the different topics covered in a Web site, we perform coupled text- and link-based clustering. In text-based clustering, we investigate the impact of document representation and feature selection on the clustering quality. In link-based clustering, we study co-citation and bibliographic coupling. We demonstrate that text-based clustering based on the selection of features with high variance over Web pages is reliable and that outgoing links can be used to improve the clustering quality if a rich set of cross links is available. Each individual cluster computed above is summarized using an extraction-based summarization system, which extracts key phrases and key sentences from source documents to generate a summary. We design and develop a classification approach in the cluster summarization stage. The classifier uses statistical and linguistic features to determine the topical significance of each sentence. Finally, we evaluate the proposed system via a user study. We demonstrate that the proposed clustering summarization approach significantly outperforms the single-topic summarization approach. 1

