Results 1 -
3 of
3
Towards parameter-free data mining
- In: Proc. 10th ACM SIGKDD Intn’l Conf. Knowledge Discovery and Data Mining
, 2004
"... Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorit ..."
Abstract
-
Cited by 86 (15 self)
- Add to MetaCart
Most data mining algorithms require the setting of many input parameters. Two main dangers of working with parameter-laden algorithms are the following. First, incorrect settings may cause an algorithm to fail in finding the true patterns. Second, a perhaps more insidious problem is that the algorithm may report spurious patterns that do not really exist, or greatly overestimate the significance of the reported patterns. This is especially likely when the user fails to understand the role of parameters in the data mining process. Data mining algorithms should have as few parameters as possible, ideally none. A parameter-free algorithm would limit our ability to impose our prejudices, expectations, and presumptions on the problem at hand, and would let the data itself speak to us. In this work, we show that recent results in bioinformatics and computational theory hold great promise for a parameter-free datamining paradigm. The results are motivated by observations in Kolmogorov complexity theory. However, as a practical matter, they can be implemented using any off-the-shelf compression algorithm with the addition of just a dozen or so lines of code. We will show that this approach is competitive or superior to the stateof-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.
A simple and fast DNA compressor
- Software - Practice and Experience
, 2004
"... In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. W ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In this paper we consider the problem of DNA compression. It is well known that one of the main features of DNA sequences is that they contain substrings which are duplicated except for a few random mutations. For this reason most DNA compressors work by searching and encoding approximate repeats. We depart from this strategy by searching and encoding only exact repeats. However, we use an encoding designed to take advantage of the possible presence of approximate repeats. Our approach leads to an algorithm which is an order of magnitude faster than any other algorithm and achieves a compression ratio very close to the best DNA compressors. Another important feature of our algorithm is its small space occupancy which makes it possible to compress sequences hundreds of megabytes long, well beyond the range of any previous DNA compressor. 1
Using Item Descriptors in Recommender Systems
"... One of the earliest and most successful technologies used in recommender systems is known as collaborative filtering, a technique that predicts the preferences of one user based on the preferences of other similar users. We present here a different approach that uses a simple learning algorithm ..."
Abstract
- Add to MetaCart
One of the earliest and most successful technologies used in recommender systems is known as collaborative filtering, a technique that predicts the preferences of one user based on the preferences of other similar users. We present here a different approach that uses a simple learning algorithm to identify and store patterns about items, and a noisy-OR function in order to find recommendations. The technique represents knowledge in item descriptors, which are recordlike structures that store knowledge on when to recommend each item. A recommender system keeps several item descriptors that compete when a recommendation is requested. Besides showing a good performance, the item descriptors have the advantage of making it easy to understand and monitor the system's knowledge. This paper details the item descriptors as well as the way they are used to identify users' preferences. Preliminary results are presented, and directions for future work are indicated.

