Results 1  10
of
37
Applications of Finite Automata Representing Large Vocabularies
, 1992
"... The construction of minimal acyclic deterministic partial finite automata to represent large natural language vocabularies is described. Applications of such automata include: spelling checkers and advisers, multilanguage dictionaries, thesauri, minimal perfect hashing and text compression. Part of ..."
Abstract

Cited by 21 (2 self)
 Add to MetaCart
The construction of minimal acyclic deterministic partial finite automata to represent large natural language vocabularies is described. Applications of such automata include: spelling checkers and advisers, multilanguage dictionaries, thesauri, minimal perfect hashing and text compression. Part of this research was supported by a grant awarded by the Brazilian National Council for Scientific and Technological Development (CNPq) to the second author. Authors' Address: Cl'audio L. Lucchesi and Tomasz Kowaltowski, Department of Computer Science, University of Campinas, Caixa Postal 6065, 13081 Campinas, SP, Brazil. Email: lucchesi@dcc.unicamp.br and tomasz@dcc.unicamp.br. 1 Introduction The use of finite automata (see for instance [5]) to represent sets of words is a well established technique. Perhaps the most traditional application is found in compiler construction where such automata can be used to model and implement efficient lexical analyzers (see [1]). Applications of finit...
Bonsai: A Compact Representation of Trees
, 1993
"... This paper shows how trees can be stored in a very compact form, called `Bonsai', using hash tables. A method is described that is suitable for large trees that grow monotonically within a predefined maximum size limit. Using it, pointers in any tree can be represented within 6 +log 2 n bits per nod ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
This paper shows how trees can be stored in a very compact form, called `Bonsai', using hash tables. A method is described that is suitable for large trees that grow monotonically within a predefined maximum size limit. Using it, pointers in any tree can be represented within 6 +log 2 n bits per node where n is the maximum number of children a node can have. We first describe a general way of storing trees in hash tables, and then introduce the idea of compact hashing which underlies the Bonsai structure. These two techniques are combined to give a compact representation of trees, and a practical methodology is set out to permit the design of these structures. The new representation is compared with two conventional tree implementations in terms of the storage required per node. Examples of programs that must store large trees within a strict maximum size include those that operate on trie structures derived from natural language text. We describe how the Bonsai technique has been applied to the trees that arise in text compression and adaptive prediction, and include a discussion of the design parameters that work well in practice
A broadcoverage normalization system for social media language
, 2012
"... Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Social media language contains huge amount and wide variety of nonstandard tokens, created both intentionally and unintentionally by the users. It is of crucial importance to normalize the noisy nonstandard tokens before applying other NLP techniques. A major challenge facing this task is the system coverage, i.e., for any usercreated nonstandard term, the system should be able to restore the correct word within its top n output candidates. In this paper, we propose a cognitivelydriven normalization system that integrates different human perspectives in normalizing the nonstandard tokens, including the enhanced letter transformation, visual priming, and string/phonetic similarity. The system was evaluated on both word and messagelevel using four SMS and Twitter data sets. Results show that our system achieves over 90 % wordcoverage across all data sets (a 10 % absolute increase compared to stateoftheart); the broad wordcoverage can also successfully translate into messagelevel performance gain, yielding 6 % absolute increase compared to the best prior approach. 1
Compressed Storage of Sparse FiniteState Transducers
 Workshop on Implementing Automata WIA99  PreProceedings
, 1999
"... This paper presents an eclectic approach for compressing weighted finitestate automata and transducers, with minimal impact on performance. The approach is eclectic in the sense that various complementary methods have been employed: rowindexed storage of sparse matrices, dictionary compression, bi ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
This paper presents an eclectic approach for compressing weighted finitestate automata and transducers, with minimal impact on performance. The approach is eclectic in the sense that various complementary methods have been employed: rowindexed storage of sparse matrices, dictionary compression, bit manipulation, and lossless omission of data. The compression rate is over 83% with respect to the current Bell Labs FSM library.
ETEX: Guidelines for Future TEX Extensions
, 1993
"... With the announcement of T E X 3.0, Don Knuth acknowledged the need of the (ever growing) T E X community for an even better system. But at the same time, he made it clear, that he will not get involved in any further enhancements that would change The T E Xbook. T E X started out originally as ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
With the announcement of T E X 3.0, Don Knuth acknowledged the need of the (ever growing) T E X community for an even better system. But at the same time, he made it clear, that he will not get involved in any further enhancements that would change The T E Xbook. T E X started out originally as a system designed to typeset its author's own publications. In the meantime it serves hundreds of thousands of users. Now it is time, after ten years' experience, to step back and consider whether or not T E X 3.0 is an adequate answer to the typesetting requirements of the nineties. Output produced by T EX has higher standards than output generated automatically by most other typesetting systems. Therefore, in this paper we will focus on the quality standards set by typographers for handtypeset documents and ask to what extent they are achieved by T E X. Limitations of T E X's algorithms are analyzed; and missing features as well as new concepts are outlined. 1 Introduction La...
Hyphenation in TEX  Quo Vadis?
 PROCEEDINGS OF THE 9 TH EUROPEAN TEX CONFERENCE, GDAŃSK, 1994, EDITED BY W. BZYLANDT.PRZECHLEWSKI
, 1994
"... Significant progress has been made in the hyphenation ability of TEX since its first version in 1978. However, in practice, we still face problems in many languages such as Czech, German, Swedish etc. when trying to adopt local typesetting industry standards. In this paper we discuss problems of hyp ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Significant progress has been made in the hyphenation ability of TEX since its first version in 1978. However, in practice, we still face problems in many languages such as Czech, German, Swedish etc. when trying to adopt local typesetting industry standards. In this paper we discuss problems of hyphenation in multilingual documents in general, we show how we’ve made Czech and Slovak hyphenation patterns, and we describe our results achieved using the program PATGEN for hyphenation pattern generation. We show that hyphenation of compound words may be partially solved even within the scope of TEX82. We discuss possible enhancements of the process of hyphenation pattern generation and describe features that might be reasonableto think about to be incorporated in Ω or another successor to TEX82.
Learning Hierarchical Rule Sets
 In Proc. of the 5th Annual ACM Workshop on Computational Learning Theory
, 1993
"... We present an algorithm for learning sets of rules that are organized into up to k levels. Each level can contain an arbitrary number of rules "if c then l" where l is the class associated to the level and c is a concept from a given class of basic concepts. The rules of higher levels have preceden ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We present an algorithm for learning sets of rules that are organized into up to k levels. Each level can contain an arbitrary number of rules "if c then l" where l is the class associated to the level and c is a concept from a given class of basic concepts. The rules of higher levels have precedence over the rules of lower levels and can be used to represent exceptions. As basic concepts we can use Boolean attributes in the infinite attribute space model, or certain concepts defined in terms of substrings. Given a sample of m examples, the algorithm runs in polynomial time and produces a consistent concept representation of size O((log m) k n k ), where n is the size of the smallest consistent representation with k levels of rules. This implies that the algorithm learns in the PAC model. The algorithm repeatedly applies the greedy heuristics for weighted set cover. The weights are obtained from approximate solutions to previous set cover problems. Key words: computational learni...
Finite State Methods for Hyphenation
 NATURAL LANGUAGE ENGINEERING
, 2002
"... Hyphenation is the task of identifying potential hyphenation points in words. In this paper, three finitestate hyphenation methods for Dutch are presented and compared in terms of accuracy and size of the resulting automata. ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Hyphenation is the task of identifying potential hyphenation points in words. In this paper, three finitestate hyphenation methods for Dutch are presented and compared in terms of accuracy and size of the resulting automata.
HFST—Framework for Compiling and Applying Morphologies*
"... Abstract. HFST–Helsinki FiniteState Technology (hfst.sf.net) is a framework for compiling and applying linguistic descriptions with finitestate methods. HFST currently connects some of the most important finitestate tools for creating morphologies and spellers into one opensource platform and supp ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Abstract. HFST–Helsinki FiniteState Technology (hfst.sf.net) is a framework for compiling and applying linguistic descriptions with finitestate methods. HFST currently connects some of the most important finitestate tools for creating morphologies and spellers into one opensource platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each finitestate library.
Word Hyphenation by Neural Networks
, 1996
"... We are discussing our experiments we made to learn feedforward neural network for task of finding valid hyphenation points in all words of given language. Multilayer neural networks were succesfully used for solving of this difficult problem. The structure of the network used is given, together ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We are discussing our experiments we made to learn feedforward neural network for task of finding valid hyphenation points in all words of given language. Multilayer neural networks were succesfully used for solving of this difficult problem. The structure of the network used is given, together with a discussion about training sets, influence of input coding and results of experiments done for the Czech language. We end up with pros and cons of tested approachhybrid architecture suitable for a multilingual system. Keywords: neural networks, hyphenation, back propagation, generalisation, typesetting 1 Introduction "The invention of the alphabet was one of the greatest advances in the history of civilisation. However, the ancient Phoenicians probably did not anticipate the fact that, centuries later, the problem of word hyphenation would become a major headache for computer typesetters all over the world." (Liang, 1983) (Liang 1983, page 39) The problem of finding all va...