Results 1 -
2 of
2
Textual Data Mining through the Synergistic Combination of Classifiers and Linguistic Processors
, 1999
"... Numerical data mining tools are generally quite robust but only provide coarse-granularity results; such tools can handle very large inputs. Computational linguistic tools are able to provide fine-granularity results but are less robust; such tools, often semi-automatic, usually handle relatively sh ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Numerical data mining tools are generally quite robust but only provide coarse-granularity results; such tools can handle very large inputs. Computational linguistic tools are able to provide fine-granularity results but are less robust; such tools, often semi-automatic, usually handle relatively short inputs. A synergistic combination of both types of tools is the basis of our hybrid approach. First, a connectionist classifier is used to locate potentially interesting documents, or segments thereof. Second, the user selects segments that will be forwarded to the linguistic processor in order to semi-automatically analyse their textual data and extract relevant information or knowledge elements. We present the main characteristics of our hybrid approach to textual data mining, plus a methodology by which it can be put to use. We also report on the results of a first evaluation involving a corpus made up of two texts pertaining to two different domains. Keywords: Data mining, connectio...
Estimating Sparse Events using Probabilistic Logic: Application to Word n-Grams
"... In several tasks from different fields, we are encountering sparse events. In order to provide with probabilities for such events, researchers commonly perform a maximum likelihood (ML) estimation. However, it is well-known that the ML estimator is sensitive to extreme values. In other words, config ..."
Abstract
- Add to MetaCart
In several tasks from different fields, we are encountering sparse events. In order to provide with probabilities for such events, researchers commonly perform a maximum likelihood (ML) estimation. However, it is well-known that the ML estimator is sensitive to extreme values. In other words, configurations with low or high frequencies are respectively underestimated or overestimated and therefore nonreliable. In order to solve this problem and to better evaluate these probability values, we propose a novel approach based on the probabilistic logic (PL) paradigm. For a sake of illustration, we focuss on this paper on events such as word trigrams (w 3 ; w 1 ; w 2 ) or word/pos-tag trigrams ((w 3 ; t 3 ); (w 1 ; t 1 ); (w 2 ; t 2 )). These latter entities are the basic objects used in speech or handwriting recognition. In order to distinguish between for example: "replace the fun" and "replace the floor" an accurate estimation of these two trigrams is needed. The ML estimation is equival...

