Results 1  10
of
36
A Probabilistic Learning Approach for Document Indexing
 ACM TRANSACTIONS ON INFORMATION SYSTEMS
, 1991
"... We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information fo ..."
Abstract

Cited by 98 (13 self)
 Add to MetaCart
We describe a method for probabilistic document indexing using relevance feedback data that has been collected from a set of queries. Our approach is based on three new concepts: (1) Abstraction from specific terms and documents, which overcomes the restriction of limited relevance information for parameter estimation. (2) Flexibility of the representation, which allows the integration of new text analysis and knowledgebased methods in our approach as well as the consideration of document structures or different types of terms. (3) Probabilistic learning or classification methods for the estimation of the indexing weights making better use of the available relevance information. Our approach can be applied under restrictions that hold for real applications. We give experimental results for five test collections which show improvements over other indexing methods.
A probabilistic framework for vague queries and imprecise information in databases
 PROCEEDINGS OF THE 16TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES
, 1990
"... A probabilistic learning model for vague queries and missing or imprecise information in databases is described. Instead of retrieving only a set of answers, our approach yields a ranking of objects from the database in response to a query. By using relevance judgements from the user about the objec ..."
Abstract

Cited by 60 (13 self)
 Add to MetaCart
A probabilistic learning model for vague queries and missing or imprecise information in databases is described. Instead of retrieving only a set of answers, our approach yields a ranking of objects from the database in response to a query. By using relevance judgements from the user about the objects retrieved, the ranking for the actual query as well as the overall retrieval quality of the system can be further improved. For specifying different kinds of conditions in vague queries, the notion of vague predicates is introduced. Based on the underlying probabilistic model, also imprecise or missing attribute values can be treated easily. In addition, the corresponding formulas can be applied in combination with standard predicates (from twovalued logic), thus extending standard database systems for coping with missing or imprecise data.
AIR/X  a RuleBased Multistage Indexing System for Large Subject Fields
 PROCEEDINGS OF RIAO'91
, 1991
"... AIR/X is a rulebased system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the ..."
Abstract

Cited by 53 (5 self)
 Add to MetaCart
AIR/X is a rulebased system for indexing with terms (descriptors) from a prescribed vocabulary. For this task, an indexing dictionary with rules for mapping terms from the text onto descriptors is required, which can be derived automatically from a set of manually indexed documents. Based on the Darmstadt Indexing Approach, the indexing task is divided into a description step and a decision step. First, terms (single words or phrases) are identified in the document text. With termdescriptor rules from the dictionary, descriptor indications are formed. The set of all indications from a document leading to the same descriptor is called a relevance description. A probabilistic classification procedure computes indexing weights for each relevance description. Since the whole system is rulebased, it can be adapted to different subject fields by appropriate modifications of the rule bases. A major application of AIR/X is the AIR/PHYS system developed for a large physics database. This application is described in more detail along with experimental results.
Global discretization of continuous attributes as preprocessing for machine learning
 International Journal of Approximate Reasoning
, 1996
"... Abstract. Reallife data usually are presented in databases by real numbers. On the other hand, most inductive learning methods require small number of attribute values. Thus it is necessary to convert input data sets with continuous attributes into input data sets with discrete attributes. Methods ..."
Abstract

Cited by 49 (3 self)
 Add to MetaCart
(Show Context)
Abstract. Reallife data usually are presented in databases by real numbers. On the other hand, most inductive learning methods require small number of attribute values. Thus it is necessary to convert input data sets with continuous attributes into input data sets with discrete attributes. Methods of discretization restricted to single continuous attributes will be called local, while methods that simultaneously convert all continuous attributes will be called global. In this paper, a method of transforming any local discretization method into a global one is presented. A global discretization method, based on cluster analysis, is presented and compared experimentally with three known local methods, transformed into global. Experiments include tenfold cross validation and leavingoneout methods for ten reallife data sets.
CAIM Discretization Algorithm
, 2003
"... The task of extracting knowledge from databases is quite often performed by machine leaming algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discre ..."
Abstract

Cited by 29 (2 self)
 Add to MetaCart
(Show Context)
The task of extracting knowledge from databases is quite often performed by machine leaming algorithms. The majority of these algorithms can be applied only to data described by discrete numerical or nominal attributes (features). In the case of continuous attributes, there is a need for a discretization algorithm that transforms continuous attributes into discrete ones. This paper describes such an algorithm, called CAIM (classattribute interdependence maximization), which is designed to work with supervised data. The goal of the CAlM algorithm is to maximize the classattribute interdependence and to generate a (possibly) minimal number of discrete intervals. The algorithm does not require the user to predefine the number of intervals, as opposed to some other discretization algorithms. The tests performed using CALM, and six other stateoftheart discretization algorithms, show that discrete attributes generated by the CAlM algorithm almost always have the lowest number of intervals and the highest classattribute interdependency. Two machine learning algorithms, the CLIP4 rule algorithm and the decision tree algorithm, are used to generate classification rules from data discretized by CALM. For both the CLIP4 and decision tree algorithms, the accuracy of the generated rules is higher and the number of the rules is lower for data discretized using the CAlM algorithm when compared to data discretized using six other discretization algorithms. The highest classification accuracy was always achieved for datasets discretized using the CAlM algorithm, as compared with the other six algorithms. Of four supervised algorithms used for comparison, the CAlM algorithm is comparable in speed to the two fastest.
K.: Abstraction of High Level Concepts from Numerical Values in Databases
 In: Proceedings of the Knowledge Discovery in Data Mining Workshop (KDD
, 1994
"... A conceptual clustering method is proposed for discovering high level concepts of numerical attribute values from databases. The method considers both frequency and value distributions of data, thus is able to discover relevant concepts from numerical attributes. The discovered knowledge can be used ..."
Abstract

Cited by 24 (12 self)
 Add to MetaCart
A conceptual clustering method is proposed for discovering high level concepts of numerical attribute values from databases. The method considers both frequency and value distributions of data, thus is able to discover relevant concepts from numerical attributes. The discovered knowledge can be used for representing data semantically and for providing approximate answers when exact ones are not available. Our knowledge discovery approach is to partition the data set of one or more attributes into clusters that minimize the relaxation error. An algorithm is developed which finds the best binary partition in O(n) time and generates a concept hierarchy in O(n2) time where n is the number of distinct values of the attribute. The effectiveness of our clustering method is demonstrated by applying it to a large transportation database for approximate query answering. Key words: approximate query answering, type abstraction hierarchy, conceptual clustering, discretization, data summarization, knowledge discovery in databases. 1
Proportional kinterval discretization for naiveBayes classifiers
 Proc. of the Twelfth European Conf. on Machine Learning
, 2001
"... Abstract. This paper argues that two commonlyused discretization approaches, fixed kinterval discretization and entropybased discretization have suboptimal characteristics for naiveBayes classification. This analysis leads to a new discretization method, Proportional kInterval Discretization ( ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
(Show Context)
Abstract. This paper argues that two commonlyused discretization approaches, fixed kinterval discretization and entropybased discretization have suboptimal characteristics for naiveBayes classification. This analysis leads to a new discretization method, Proportional kInterval Discretization (PKID), which adjusts the number and size of discretized intervals to the number of training instances, thus seeks an appropriate tradeoff between the bias and variance of the probability estimation for naiveBayes classifiers. We justify PKID in theory, as well as test it on a wide crosssection of datasets. Our experimental results suggest that in comparison to its alternatives, PKID provides naiveBayes classifiers competitive classification performance for smaller datasets and better classification performance for larger datasets. 1
Discretization Algorithm that Uses ClassAttribute Interdependence Maximization
 Proc. of the 2001 International Conference on Artificial Intelligence (ICAI2001): Las Vegas
, 2001
"... Most of the existing machine learning algorithms are able to extract knowledge from databases that store discrete attributes (features). If the attributes are continuous, the algorithms can be integrated with a discretization algorithm that transforms them into discrete attributes. The paper describ ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Most of the existing machine learning algorithms are able to extract knowledge from databases that store discrete attributes (features). If the attributes are continuous, the algorithms can be integrated with a discretization algorithm that transforms them into discrete attributes. The paper describes an algorithm, called CAIM (classattribute interdependence maximization), for discretization of continuous attributes that is designed to work with supervised learning algorithms. The algorithm maximizes the classattribute interdependence and, at the same time, generates possibly minimal number of discrete intervals. Its big advantage is that it does not require the user to predefine the number of intervals, in contrast to many existing discretization algorithms. The CAIM algorithm and five other stateof theart discretization algorithms were tested on wellknown machine learning datasets consisting of continuous and mixedmode attributes. The tests show that the proposed algorithm generates discrete attributes with, almost always, the highest classattribute interdependency when compared with other algorithms, and at the same time it always generates the lowest number of intervals. The discretized datasets were used in conjunction with the CLIP4 machine learning algorithm. The accuracy of the rules generated by the CLIP4 shows that the proposed algorithm significantly improves classification performance; it also performs best in comparison with other five discretization algorithms. The CAIM algorithm's speed is comparable to the simplest unsupervised algorithms and outperforms other supervised discretization algorithms. Keywords: Discretization, classattribute interdependency maximization, CAIM algorithm, machine learning, classification, CLIP4 algorithm
An Empirical Comparison of Discretization Methods
 In Proc. of the 10th Int. Symp. on Computer and Information Sciences
, 1995
"... Many machine learning and neurally inspired algorithms are limited, at least in their pure form, to working with nominal data. However, for many realworld problems, some provision must be made to support processing of continuously valued data. This paper presents empirical results obtained by using ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Many machine learning and neurally inspired algorithms are limited, at least in their pure form, to working with nominal data. However, for many realworld problems, some provision must be made to support processing of continuously valued data. This paper presents empirical results obtained by using six different discretization methods as preprocessors to three different supervised learners on several realworld problems. No discretization technique clearly outperforms the others. Also, discretization as a preprocessing step is in many cases found to be inferior to direct handling of continuously valued data. These results suggest that machine learning algorithms should be designed to directly handle continuously valued data rather than relying on preprocessing or ad hoc techniques. 1.