Results 1 - 10
of
25
Part-of-Speech Tagging and Partial Parsing
- Corpus-Based Methods in Language and Speech
, 1996
"... m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the va ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
m we can carve o# next. `Partial parsing' is a cover term for a range of di#erent techniques for recovering some but not all of the information contained in a traditional syntactic analysis. Partial parsing techniques, like tagging techniques, aim for reliability and robustness in the face of the vagaries of natural text, by sacrificing completeness of analysis and accepting a low but non-zero error rate. 1 Tagging The earliest taggers [35, 51] had large sets of hand-constructed rules for assigning tags on the basis of words' character patterns and on the basis of the tags assigned to preceding or following words, but they had only small lexica, primarily for exceptions to the rules. TAGGIT [35] was used to generate an initial tagging of the Brown corpus, which was then hand-edited. (Thus it provided the data that has since been used to train other taggers [20].) The tagger described by Garside [56, 34], CLAWS, was a probabilistic version of TAGGIT, and the DeRose tagger improved on
Dependency-based construction of semantic space models
- Computational Linguistics
, 2007
"... Traditionally, vector-based semantic space models use word co-occurrence counts from large corpora to represent lexical meaning. In this article we present a novel framework for constructing semantic spaces that take syntactic relations into account. We introduce a formalization for this class of mo ..."
Abstract
-
Cited by 79 (6 self)
- Add to MetaCart
Traditionally, vector-based semantic space models use word co-occurrence counts from large corpora to represent lexical meaning. In this article we present a novel framework for constructing semantic spaces that take syntactic relations into account. We introduce a formalization for this class of models which allows linguistic knowledge to guide the construction process. We evaluate our framework on a range of tasks relevant for cognitive science and natural language processing: semantic priming, synonymy detection and word sense disambiguation. In all cases, our framework obtains results that are comparable or superior to the state of the art. 1.
Term Clustering of Syntactic Phrases
- Proceedings of ACM SIGIR-90
, 1990
"... Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combini ..."
Abstract
-
Cited by 56 (5 self)
- Add to MetaCart
Term clustering and syntactic phrase formation are methods for transforming natural language text. Both have had only mixed success as strategies for improving the quality of text representations for document retrieval. Since the strengths of these methods are complementary, we have explored combining them to produce superior representations. In this paper we discuss our implementation of a syntactic phrase generator, as well as our preliminary experiments with producing phrase clusters. These experiments show small improvements in retrieval effectiveness resulting from the use of phrase clusters, but it is clear that corpora much larger than standard information retrieval test collections will be required to thoroughly evaluate the use of this technique.
An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering
- In Proceedings of the 23rd ACM SIGIR Conference on Research and Development in Information Retrieval
, 2000
"... We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information fro ..."
Abstract
-
Cited by 33 (4 self)
- Add to MetaCart
We investigate four hierarchical clustering methods (single-link, complete-link, groupwise-average, and single-pass) and two linguistically motivated text features (noun phrase heads and proper names) in the context of document clustering. A statistical model for combining similarity information from multiple sources is described and applied to DARPA's Topic Detection and Tracking phase 2 (TDT2) data. This model, based on log-linear regression, alleviates the need for extensive search in order to determine optimal weights for combining input features. Through an extensive series of experiments with more than 40,000 documents from multiple news sources and modalities, we establish that both the choice of clustering algorithm and the introduction of the additional features have an impact on clustering performance. We apply our optimal combination of features to the TDT2 test data, obtaining partitions of the documents that compare favorably with the results obtained by participants in th...
Automatically Generating Hypertext By Computing Semantic Similarity
, 1997
"... We describe a novel method for automatically generating hypertext links within and between newspaper articles. The method is based on lexical chaining, a technique for extracting the sets of related words that occur in texts. Links between the paragraphs of a single article are built by considering ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We describe a novel method for automatically generating hypertext links within and between newspaper articles. The method is based on lexical chaining, a technique for extracting the sets of related words that occur in texts. Links between the paragraphs of a single article are built by considering the distribution of the lexical chains in that article. Links between articles are built by considering how the chains in the two articles are related. By using lexical chaining we mitigate the problems of synonymy and polysemy that plague traditional information retrieval approaches to automatic hypertext generation. In order to motivate our research, we discuss the results of a study that shows that humans are inconsistent when assigning hypertext links within newspaper articles. Even if humans were consistent, the time needed to build a large hypertext and the costs associated with the production of such a hypertext make relying on human linkers an untenable decision. Thus we are left to ...
Using English to Retrieve Software
- The Journal of Systems and Software
, 1995
"... This paper describes ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. Lexical, syntactic and semantic analysis of software descriptions is performed to automatically extract both verbal and nominal phrases from descriptions and use thi ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
This paper describes ROSA, a software reuse system based on the processing of the natural language descriptions of software artifacts. Lexical, syntactic and semantic analysis of software descriptions is performed to automatically extract both verbal and nominal phrases from descriptions and use this information to create frame-based indexing units for software components. Retrieval similarity measures provide good retrieval effectiveness by supporting semantic matching and processing of lexical relationships between terms. Some results from an experiment evaluating retrieval effectiveness are discussed. 1 Introduction This paper describes ROSA (Reuse Of Software Artifacts) a software reuse system based on the processing of the natural language descriptions of software artifacts [9][10][11][12]. The system aims at being cost-effective, domain independent and providing good retrieval effectiveness. Automatic indexing is required to turn software retrieval systems cost-effective. Reuse ...
Relevance and retrieval evaluation: Perspectives from medicine
- Journal of The American Society for Information Science
, 1994
"... The traditional notion of topical relevance has al-lowed much useful work to be done in the evaluation of retrieval systems, but has limltations for com-plete assessment of retrieval systems. While topical relevance can be effective In evaluating various ln-dexing and retrieval approaches, It is ine ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
The traditional notion of topical relevance has al-lowed much useful work to be done in the evaluation of retrieval systems, but has limltations for com-plete assessment of retrieval systems. While topical relevance can be effective In evaluating various ln-dexing and retrieval approaches, It is ineffective for measuring the impact that systems have on users. An alternative is to use a more situational definition of relevance, which takes account of the impact of the system on the user. Both types of relevance are examined from the standpoint of the medical domain, concluding that each have their appropriate use. But in medicine there is increasing emphasis on outcomes-oriented research which, when applied to information science, requires that the impact of an information system on the activities which prompt its use be assessed. An iterative model of retrieval evaluation is proposed, starting first with the use of topical relevance to insure documents on the subject can be retrieved. This is followed by the use of situational relevance to show the user can interact positively with the system. The final step is to study how the system impacts the user in the purpose for which the system was consulted, which can be done by methods such as protocol analysis and simulation. These diverse types of studies are neces-sary to increase our understanding of the nature of retrieval systems.
Information Access Tools for Software Reuse
- Journal of Systems and Software
, 1995
"... Software reuse has long been touted as an effective means to develop software products. But reuse technologies for software have not lived up to expectations. Among the barriers are high costs of building software repositories and the need for effective tools to help designers locate re-usable softw ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
Software reuse has long been touted as an effective means to develop software products. But reuse technologies for software have not lived up to expectations. Among the barriers are high costs of building software repositories and the need for effective tools to help designers locate re-usable software. While many design-forreuse and software classification efforts have been proposed, these methods are cost-intensive and cannot effectively take advantage of large stores of design artifacts that many development organizations have accumulated. Methods are needed that take advantage of these valuable resources in a cost-effective manner. This paper describes an approach to the design of tools to help software designers build repositories of software components and locate potentially re-usable software in those repositories. The approach is investigated with a retrieval tool, named CodeFinder, which supports the process of retrieving software components when information needs are ill-defi...
Comparing the Effect of Syntactic vs. Statistical Phrase Indexing Strategies for Dutch
"... . In this paper we describe the results of experiments contrasting syntactic phrase indexing with statistical phrase indexing for Dutch texts. Our results showed that we at least need a compound splitting algorithm for good quality retrieval for Dutch texts. If we then add either syntactic or statis ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
. In this paper we describe the results of experiments contrasting syntactic phrase indexing with statistical phrase indexing for Dutch texts. Our results showed that we at least need a compound splitting algorithm for good quality retrieval for Dutch texts. If we then add either syntactic or statistical phrases, performance generally improves, but this effect is never statistically significant. If we compare syntactic vs. statistical phrase indexing, syntactic phrases are slightly superior to statistical phrases, particularly at high precision. At higher recall levels syntactic and statistical phrases are equally effective. However, since a compound splitting algorithm requires a dictionary and knowledge about constraints on compound formation, a purely non-linguistic indexing strategy, with or without phrases, does not seem to be very effective for Dutch. 1 Introduction It is common practice in Information Retrieval (IR) to use phrases as indexing terms in order to enhance precision...
Text-Based Approaches for the Categorization of Images
- In Proceedings of the European Conference of Digital Libraries (ECDL), published as Research and Advanced Technology for Digital Libraries, lecture Notes in Computer Science 1696
, 1999
"... Abstract. The rapid expansion of multimedia digital collections brings to the fore the need for classifying not only text documents but their embedded non-textual parts as well. We propose a model for basing classification of multimedia on broad, non-topical features, and show how information on tar ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Abstract. The rapid expansion of multimedia digital collections brings to the fore the need for classifying not only text documents but their embedded non-textual parts as well. We propose a model for basing classification of multimedia on broad, non-topical features, and show how information on targeted nearby pieces of text can be used to effectively classify photographs on a first such feature, distinguishing between indoor and outdoor images. We examine several variations to a TF*IDF-based approach for this task, empirically analyze their effects, and evaluate our system on a large collection of images from current news newsgroups. In addition, we investigate alternative classification and evaluation methods, and the effect that a secondary feature can have on indoor/outdoor classification. We obtain a classification accuracy of 82%, a number that clearly outperforms baseline estimates and competing image-based approaches and nears the accuracy of humans who perform the same task with access to comparable information. 1

