## Using sample size to limit exposure to data mining

### Cached

### Download Links

- [www.cerias.purdue.edu]
- [members.aol.com]
- [mitre.org]
- [www.cerias.purdue.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Computer Security |

Citations: | 39 - 8 self |

### BibTeX

@ARTICLE{Clifton_usingsample,

author = {Christopher Clifton and Chris Clifton},

title = {Using sample size to limit exposure to data mining},

journal = {Journal of Computer Security},

year = {},

pages = {2000}

}

### Years of Citing Articles

### OpenURL

### Abstract

Data mining introduces new problems in database security. The basic problem of using non-sensitive data to infer sensitive data is made more difficult by the “probabilistic” inferences possible with data mining. This paper shows how lower bounds from pattern recognition theory can be used to determine sample sizes where data mining tools cannot obtain reliable results. 1

### Citations

3245 |
Self-Organizing Maps
- Kohonen
- 1995
(Show Context)
Citation Context ...s, the estimation error will be a better estimate of the total error. This is an area for further study. One solution would be to use clustering techniques on the database (e.g., self-organizing maps =-=[Koh90]-=- with thresholds on nearest neighbor) to give a likely value for the vc-dimension of a reasonable classifier. This idea is based on grouping the potential rule left-hand sides into similar groups, wit... |

184 |
Consistent nonparametric regression
- Stone
- 1977
(Show Context)
Citation Context ...rge sample, i.e., ELn → L ∗ . However, the classifier chosen may be dependent on the data or on n, for example this holds for a nearest neighbor classifier if the number of classes k → ∞ and k/n → 0 (=-=[Sto77]-=-). Note that this is heavily data dependent; the following theorem states that we can always find data where we will do poorly for any given sample size: Theorem 1 [DGL96]: Let � > 0 be an arbitrarily... |

100 |
Theory of Pattern Recognition
- Vapnik, Chervonenkis
- 1974
(Show Context)
Citation Context ...versary can expect they will be off by the given amount with a sample of size n randomly chosen over that distribution. The following theorem gives us a way to make use of this information: Theorem 5 =-=[VC74]-=-: Let C� be a class of discrimination functions with vc dimension V . Let X� be the set of all random variables (X, Y ) for which LC = 0. Then, for every discrimination rule gn based upon X1, Y1, . . ... |

78 | Disclosure limitation of sensitive rules - Atallah, Bertino, et al. - 1999 |

69 |
Secure statistical databases with random sample queries
- Denning
- 1980
(Show Context)
Citation Context ...aggregates from individual tuples. Although the basic problem is quite different, as we move toward non-random samples the two areas may overlap. Of particular note is work on random sampling queries =-=[Den80]-=-; this may provide tools to implement policies governing the creation of non-random samples. Another possible starting point for this is artificial intelligence work on selection of training data [CKB... |

27 | László Györfi, and Gábor Lugosi. A Probabilistic Theory of Pattern Recognition - Devroye - 1996 |

23 | Wizard: A database inference analysis and detection system
- Delugach, Hinke
- 1996
(Show Context)
Citation Context ...re ordered by SB – from this we can infer that there must be a SSA at SB. Most of the work in preventing inference in multi-level secure databases has concentrated on preventing such “provable” facts =-=[DH96]-=-. Recent work has extended this to capturing data-level, rather than schema-level, functional dependencies [YL98]. However, data mining provides what could be viewed as probabilistic inferences. These... |

22 |
Inference in MLS database systems
- Marks
- 1996
(Show Context)
Citation Context ...acking access over time, as a collection of small independent samples must be treated as a single large sample for our purposes. This is a problem faced with all inference protection mechanisms, Marks=-=[Mar96]-=- solves this for “normal” inference with a mechanism for tracking and limiting what a given user has seen over time. 2 Motivating Example In this section we will give a more detailed presentation of t... |

16 | Protecting databases from inference attacks - Hinke, Delugach, et al. - 1997 |

11 | Protecting confidentiality in small population health and environmental statistics - Cox - 1996 |

11 | Aerie: An inference modeling and detection approach for databases - Hinke, Delugach - 1992 |

9 | The design and implementation of a data level database inference detection system
- Yip, Levitt
- 1998
(Show Context)
Citation Context ...e in multi-level secure databases has concentrated on preventing such “provable” facts [DH96]. Recent work has extended this to capturing data-level, rather than schema-level, functional dependencies =-=[YL98]-=-. However, data mining provides what could be viewed as probabilistic inferences. These are relationships that do not always hold true (are not a functional dependency), but hold true substantially mo... |

8 | R.: A Framework for Inference-Directed Data Mining - Hinke, Delugach, et al. - 1996 |

7 | Logical vs. numerical inference on statistical databases - Chowdhury, Duncan, et al. |

5 | Protecting Databases From Inference Attacks," Computers and Security - Hinke, Delugach, et al. - 1997 |

4 |
Vapnik: Estimation of Dependences Based on Empirical Data
- Vladimir
- 1982
(Show Context)
Citation Context ... database). We show that we can draw a relationship between the sample size and the likelihood that the rules are correct. We base this on the fact that inference rules can be used to classify. Vapnik=-=[Vap82]-=- has shown error expectations in classification on samples. This is due to expectations of a random sample of a population having a different distribution with respect to any classification informatio... |

3 | Devroye and Gábor Lugosi. Lower bounds in pattern recognition and learning - Luc - 1995 |

2 |
aszl´ o Gy¨ orfi, and G´ abor Lugosi. A Probabilistic Theory of Pattern Recognition
- Devroye, L´
- 1996
(Show Context)
Citation Context ...f classes k → ∞ and k/n → 0 ([Sto77]). Note that this is heavily data dependent; the following theorem states that we can always find data where we will do poorly for any given sample size: Theorem 1 =-=[DGL96]-=-: Let � > 0 be an arbitrarily small number. For any integer n and classification rule gn, there exists a distribution of (X, Y ) with Bayes risk L ∗ = 0 such that ELn ≥ 1/2 − �. 9sSpace of possible cl... |

2 |
Towards a universal relation interface
- Osborn
- 1979
(Show Context)
Citation Context ...mining technology. For a complete database, independent analyses can be performed for each table (or collection of tables that describe a single type of entity), and the notion of a universal relation=-=[Osb79]-=- can be used to analyze “global” relationships. We need other information that is more difficult to provide, part of the focus of this work is defining this information in a form reasonable for securi... |

1 | A system that experiments with choices of training data using expert knowledge in the domain of DNA hydration - DEXTER - 1995 |

1 | Bayesian methods to the database inference problem - Chang, Moskowitz - 1998 |

1 | Vijay Raghavan. Impact of decision-region based classification algorithms on database security - Johnsten - 1999 |