On Issues of Instance Selection (2002)
| Venue: | |
| Citations: | 8 - 0 self |
BibTeX
@MISC{Liu02onissues,
author = {Huan Liu and Hiroshi Motoda},
title = {On Issues of Instance Selection},
year = {2002}
}
OpenURL
Abstract
The digital technologies and computer advances with the booming internet uses have led to massive data collection (corporate data, data warehouses, webs, just to name a few) and information (or misinformation) explosion. Szalay and Gray described this phenomenon as “drowning in data” (Szalay and Gray, 1999). They reported that each year the detectors at the CERN particle collider in Switzerland record 1 petabyte of data; and researchers in areas of science from astronomy to the human genome are facing the same problems and choking on information. A very natural question is “now that we have gathered so much data, what do we do with it?” Raw data is rarely of direct use and manual analysis simply cannot keep pace with the fast growth of data. Data mining and knowledge discovery (KDD), as a new emerging field comprising disciplines such as databases, statistics, machine learning, comes to the rescue. KDD attempts to turn raw data into nuggets and create special edges in this ever competitive world for science discovery and business intelligence. The KDD process is defined in Fayyad et al. (1996) as the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data Mining processes include data selection, preprocessing, data mining, interpretation and evaluation.







