Results 1 - 10
of
21
Class-Based n-gram Models of Natural Language
- Computational Linguistics
, 1992
"... We address the problem of predicting a word from previous words in a sample of text. In particular we discuss n-gram models based on calsses of words. We also discuss several statistical algoirthms for assigning words to classes based on the frequency of their co-occurrence with other words. We find ..."
Abstract
-
Cited by 540 (4 self)
- Add to MetaCart
We address the problem of predicting a word from previous words in a sample of text. In particular we discuss n-gram models based on calsses of words. We also discuss several statistical algoirthms for assigning words to classes based on the frequency of their co-occurrence with other words. We find that we are able to extract classes that have the flavor of either syntactically based groupings or semantically based groupings, depending on the nature of the underlying statistics.
An Improved Error Model for Noisy Channel Spelling Correction
, 2000
"... The noisy channel model has been applied to a wide range of problems, including spelling correction. These models consist of two components: a source model and a channel model. Very little research has gone into improving the channel model for spelling correction. This paper describes a new c ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
The noisy channel model has been applied to a wide range of problems, including spelling correction. These models consist of two components: a source model and a channel model. Very little research has gone into improving the channel model for spelling correction. This paper describes a new channel model for spelling correction, based on generic string to string edits. Using this model gives significant performance improvements compared to previously proposed models.
Universal Discrete Denoising: Known Channel
- IEEE Trans. Inform. Theory
, 2003
"... A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given single-letter fidelity criterion, we pr ..."
Abstract
-
Cited by 55 (23 self)
- Add to MetaCart
A discrete denoising algorithm estimates the input sequence to a discrete memoryless channel (DMC) based on the observation of the entire output sequence. For the case in which the DMC is known and the quality of the reconstruction is evaluated with a given single-letter fidelity criterion, we propose a discrete denoising algorithm that does not assume knowledge of statistical properties of the input sequence. Yet, the algorithm is universal in the sense of asymptotically performing as well as the optimum denoiser that knows the input sequence distribution, which is only assumed to be stationary and ergodic. Moreover, the algorithm is universal also in a semi-stochastic setting, in which the input is an individual sequence, and the randomness is due solely to the channel noise.
Correcting Real-Word Spelling Errors by Restoring Lexical Cohesion
, 2001
"... Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words tha ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Spelling errors that happen to result in a real word in the lexicon cannot be detected by a conventional spelling checker. We present a method for detecting and correcting many such errors by identifying tokens that are semantically unrelated to their context and are spelling variations of words that would be related to the context. Relatedness to context is determined by a measure of semantic distance initially proposed by Jiang and Conrath (1997). We tested the method on an artificial corpus of errors; it achieved recall of up to 50% and precision of 18 to 25% -- levels that approach practical usability.
Compressing Trigram Language Models With Golomb Coding
"... Trigram language models are compressed using a Golomb coding method inspired by the original Unix spell program. Compression methods trade off space, time and accuracy (loss). The proposed HashTBO method optimizes space at the expense of time and accuracy. Trigram language models are normally consid ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Trigram language models are compressed using a Golomb coding method inspired by the original Unix spell program. Compression methods trade off space, time and accuracy (loss). The proposed HashTBO method optimizes space at the expense of time and accuracy. Trigram language models are normally considered memory hogs, but with HashTBO, it is possible to squeeze a trigram language model into a few megabytes or less. HashTBO made it possible to ship a trigram contextual speller in Microsoft Office 2007.
Approximate personal name-matching through finite-state graphs
- Journal of the American Society for Information Science and Technology
, 2006
"... This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of in ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
This article shows how finite-state methods can be employed in a new and different task: the conflation of personal name variants in standard forms. In bibliographic databases and citation index systems, variant forms create problems of inaccuracy that affect information retrieval, the quality of information from databases, and the citation statistics used for the evaluation of scientists’ work. A number of approximate string matching techniques have been developed to validate variant forms, based on similarity and equivalence relations. We classify the personal name variants as nonvalid and valid forms. In establishing an equivalence relation between valid variants and the standard form of its equivalence class, we defend the application of finite-state transducers. The process of variant identification requires the elaboration of: (a) binary matrices and (b) finite-state graphs. This procedure was tested on samples of author names from bibliographic records, selected from the Library and Information Science Abstracts and Science Citation Index Expanded databases. The evaluation involved calculating the measures of precision and recall, based on completeness and accuracy. The results demonstrate the usefulness of this approach, although it should be complemented with methods based on similarity relations for the recognition of spelling variants and misspellings.
Clearshot: Eavesdropping on keyboard input from video
- In IEEE Symposium on Security and Privacy (2008
"... Eavesdropping on electronic communication is usually preventedby using cryptography-basedmechanisms. However, these mechanisms do not prevent one from obtaining privateinformationthroughsidechannels,suchastheelectromagneticemissionsofmonitorsorthesoundproducedby keyboards. Whileextractingthesameinfo ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Eavesdropping on electronic communication is usually preventedby using cryptography-basedmechanisms. However, these mechanisms do not prevent one from obtaining privateinformationthroughsidechannels,suchastheelectromagneticemissionsofmonitorsorthesoundproducedby keyboards. Whileextractingthesameinformationbywatchingsomebodytypingonakeyboardmightseemtobeaneasy task, it becomes extremely challenging if it has to be automated. However,anautomatedtoolisneededinthe caseof long-lasting surveillance procedures or long user activity, as a human being is able to reconstruct only a few charactersperminute. Thispaperpresentsanovelapproachtoautomatically recovering the text being typed on a keyboard, based solely on a video of the user typing. As part of the approach, we developed a number of novel techniques for motiontracking,sentencereconstruction,anderror correction. The approach has been implemented in a tool, called ClearShot, which has been tested in a number of realistic settings where it was able to reconstruct a substantial part ofthe typedinformation. 1
Introduction to Corpus-based Statistics-oriented (CBSO) Techniques
- Pre-Conference Workshop on Corpus-based NLP, ROCLING VII, National Tsing-Hua Univ
, 1994
"... A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the nat ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
A Corpus-Based Statistics-Oriented (CBSO) methodology, which is an attempt to avoid the drawbacks of traditional rule-based approaches and purely statistical approaches, is introduced in this paper. Rule-based approaches, with rules induced by human experts, had been the dominant paradigm in the natural language processing community. Such approaches, however, suffer from serious difficulties in knowledge acquisition in terms of cost and consistency. Therefore, it is very difficult for such systems to be scaled-up. Statistical methods, with the capability of automatically acquiring knowledge from corpora, are becoming more and more popular, in part, to amend the shortcomings of rule-based approaches. However, most simple statistical models, which adopt almost nothing from existing linguistic knowledge, often result in a large parameter space and, thus, require an unaffordably large training corpus for even well-justified linguistic phenomena. The corpus-based statistics-oriented (CBSO) approach is a compromise between the two extremes of the spectrum for knowledge acquisition. CBSO approach
Automated Email Answering by Text Pattern Matching, in
- Proc. 7th International Conference on Natural Language Processing (IceTAL 2010
, 2010
"... Abstract. Answering email by standard answers is a common practice at contact centers. Our research assists this process by creating reply messages that contain one or several standard answers. Our standard answers are linked to representative text patterns that match incoming messages. The system w ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. Answering email by standard answers is a common practice at contact centers. Our research assists this process by creating reply messages that contain one or several standard answers. Our standard answers are linked to representative text patterns that match incoming messages. The system works in three languages. The performance was evaluated on two email sets; the main advantage of our email answering technique is good correctness of the delivered replies.

