Results 1 - 10
of
28
Fast Data Anonymization with Low Information Loss
- in VLDB, 2007
, 2007
"... Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy preserving paradigms of k-anonymity and ℓ-diversity. k-anonymity protects against the identification of an individual’s record. ℓ-diversity, in addition, safeguards against the ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
Recent research studied the problem of publishing microdata without revealing sensitive information, leading to the privacy preserving paradigms of k-anonymity and ℓ-diversity. k-anonymity protects against the identification of an individual’s record. ℓ-diversity, in addition, safeguards against the association of an individual with specific sensitive information. However, existing approaches suffer from at least one of the following drawbacks: (i) The information loss metrics are counter-intuitive and fail to capture data inaccuracies inflicted for the sake of privacy. (ii) ℓ-diversity is solved by techniques developed for the simpler k-anonymity problem, which introduces unnecessary inaccuracies. (iii) The anonymization process is inefficient in terms of computation and I/O cost. In this paper we propose a framework for efficient privacy preservation that addresses these deficiencies. First, we focus on one-dimensional (i.e., single attribute) quasiidentifiers, and study the properties of optimal solutions for k-anonymity and ℓ-diversity, based on meaningful information loss metrics. Guided by these properties, we develop efficient heuristics to solve the one-dimensional problems in linear time. Finally, we generalize our solutions to multi-dimensional quasi-identifiers using space-mapping techniques. Extensive experimental evaluation shows that our techniques clearly outperform the state-of-the-art, in terms of execution time and information loss. 1.
Anonymizing Bipartite Graph Data using Safe Groupings
, 2008
"... Private data often comes in the form of associations between entities, such as customers and products bought from a pharmacy, which are naturally represented in the form of a large, sparse bipartite graph. As with tabular data, it is desirable to be able to publish anonymized versions of such data, ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Private data often comes in the form of associations between entities, such as customers and products bought from a pharmacy, which are naturally represented in the form of a large, sparse bipartite graph. As with tabular data, it is desirable to be able to publish anonymized versions of such data, to allow others to perform ad hoc analysis of aggregate graph properties. However, existing tabular anonymization techniques do not give useful or meaningful results when applied to graphs: small changes or masking of the edge structure can radically change aggregate graph properties. We introduce a new family of anonymizations, for bipartite graph data, called (k, ℓ)-groupings. These groupings preserve the underlying graph structure perfectly, and instead anonymize the mapping from entities to nodes of the graph. We identify a class of “safe ” (k, ℓ)-groupings that have provable guarantees to resist a variety of attacks, and show how to find such safe groupings. We perform experiments on real bipartite graph data to study the utility of the anonymized version, and the impact of publishing alternate groupings of the same graph data. Our experiments demonstrate that (k, ℓ)-groupings offer strong tradeoffs between privacy and utility.
t-closeness: Privacy beyond k-anonymity and ℓ-diversity
- In ICDE
, 2007
"... The k-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying ” attributes) contains at least k records. Recently, several authors have recognized that k-anonym ..."
Abstract
-
Cited by 27 (5 self)
- Add to MetaCart
The k-anonymity privacy requirement for publishing microdata requires that each equivalence class (i.e., a set of records that are indistinguishable from each other with respect to certain “identifying ” attributes) contains at least k records. Recently, several authors have recognized that k-anonymity cannot prevent attribute disclosure. The notion of ℓ-diversity has been proposed to address this; ℓ-diversity requires that each equivalence class has at least ℓ well-represented values for each sensitive attribute. In this paper we show that ℓ-diversity has a number of limitations. In particular, it is neither necessary nor sufficient to prevent attribute disclosure. We propose a novel privacy notion called t-closeness, which requires that the distribution of a sensitive attribute in any equivalence class is close to the distribution of the attribute in the overall table (i.e., the distance between the two distributions should be no more than a threshold t). We choose to use the Earth Mover Distance measure for our t-closeness requirement. We discuss the rationale for t-closeness and illustrate its advantages through examples and experiments. 1.
Privacy: Theory meets Practice on the Map
"... Abstract — In this paper, we propose the first formal privacy analysis of a data anonymization process known as the synthetic data generation, a technique becoming popular in the statistics community. The target application for this work is a mapping program that shows the commuting patterns of the ..."
Abstract
-
Cited by 22 (6 self)
- Add to MetaCart
Abstract — In this paper, we propose the first formal privacy analysis of a data anonymization process known as the synthetic data generation, a technique becoming popular in the statistics community. The target application for this work is a mapping program that shows the commuting patterns of the population of the United States. The source data for this application were collected by the U.S. Census Bureau, but due to privacy constraints, they cannot be used directly by the mapping program. Instead, we generate synthetic data that statistically mimic the original data while providing privacy guarantees. We use these synthetic data as a surrogate for the original data. We find that while some existing definitions of privacy are inapplicable to our target application, others are too conservative and render the synthetic data useless since they guard against privacy breaches that are very unlikely. Moreover, the data in our target application is sparse, and none of the existing solutions are tailored to anonymize sparse data. In this paper, we propose solutions to address the above issues. I.
On the Anonymization of Sparse High-Dimensional Data
"... Abstract — Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is to enforce privacy-preserving paradigms, such as kanonymity and ℓ-diversity, while minimizing the information loss incurred in the anonymizing process (i.e. maximize data ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
Abstract — Existing research on privacy-preserving data publishing focuses on relational data: in this context, the objective is to enforce privacy-preserving paradigms, such as kanonymity and ℓ-diversity, while minimizing the information loss incurred in the anonymizing process (i.e. maximize data utility). However, existing techniques adopt an indexing- or clusteringbased approach, and work well for fixed-schema data, with low dimensionality. Nevertheless, certain applications require privacy-preserving publishing of transaction data (or basket data), which involves hundreds or even thousands of dimensions, rendering existing methods unusable. We propose a novel anonymization method for sparse highdimensional data. We employ a particular representation that captures the correlation in the underlying data, and facilitates the formation of anonymized groups with low information loss. We propose an efficient anonymization algorithm based on this representation. We show experimentally, using real-life datasets, that our method clearly outperforms existing state-of-the-art in terms of both data utility and computational overhead. I.
On Anti-Corruption Privacy Preserving Publication
"... Abstract — This paper deals with a new type of privacy threat, called “corruption”, in anonymized data publication. Specifically, an adversary is said to have corrupted some individuals, if s/he has already obtained their sensitive values before consulting the released information. Conventional gene ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Abstract — This paper deals with a new type of privacy threat, called “corruption”, in anonymized data publication. Specifically, an adversary is said to have corrupted some individuals, if s/he has already obtained their sensitive values before consulting the released information. Conventional generalization may lead to severe privacy disclosure in the presence of corruption. Motivated by this, we advocate an alternative anonymization technique that integrates generalization with perturbation and stratified sampling. The integration provides strong privacy guarantees, even if an adversary has corrupted any number of individuals. We verify the effectiveness of the proposed technique through experiments with real data. I.
Generalizing PIR for Practical Private Retrieval of Public Data
"... Abstract. Private retrieval of public data is useful when a client wants to query a public data service without revealing the specific query data to the server. Computational Private Information Retrieval (cPIR) is able to achieve complete privacy for a client, but is deemed impractical since it inv ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. Private retrieval of public data is useful when a client wants to query a public data service without revealing the specific query data to the server. Computational Private Information Retrieval (cPIR) is able to achieve complete privacy for a client, but is deemed impractical since it involves expensive computation on all the data on the server. Besides, it is inflexible if the server wants to charge the client based on the service data that is exposed. k-Anonymity, on the other hand, is flexible and cheap for anonymizing the querying process, but is vulnerable to privacy and security threats. In this paper, we propose a practical and flexible approach for the private retrieval of public data called Bounding-Box PIR (bbPIR). Using bbPIR, a client specifies both privacy requirement and service charge budget. The server satisfies the client’s requirements, and at the same time achieves overall good performance in terms of computation and communication costs. bbPIR generalizes cPIR and k-Anonymity in that the bounding box can include as much as all the data on the server or as little as just k data items. The effectiveness of bbPIR compared to cPIR and k-Anonymity is verified using extensive experimental evaluation. 1
Optimal Random Perturbation at Multiple Privacy Levels
"... Random perturbation is a popular method of computing anonymized data for privacy preserving data mining. It is simple to apply, ensures strong privacy protection, and permits effective mining of a large variety of data patterns. However, all the existing studies with good privacy guarantees focus on ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Random perturbation is a popular method of computing anonymized data for privacy preserving data mining. It is simple to apply, ensures strong privacy protection, and permits effective mining of a large variety of data patterns. However, all the existing studies with good privacy guarantees focus on perturbation at a single privacy level. Namely, a fixed degree of privacy protection is imposed on all anonymized data released by the data holder. This drawback seriously limits the applicability of random perturbation in scenarios where the holder has numerous recipients to which different privacy levels apply. Motivated by this, we study the problem of multi-level perturbation, whose objective is to release multiple versions of a dataset anonymized at different privacy levels. The challenge is that various recipients may collude by sharing their data to infer privacy beyond their permitted levels. Our solution overcomes this obstacle, and achieves two crucial properties. First, collusion is useless, meaning that the colluding recipients cannot learn anything more than what the most trustable recipient (among the colluding recipients) already knows alone. Second, the data each recipient receives can be regarded (and hence, analyzed in the same way) as the output of conventional uniform perturbation. Besides its solid theoretical foundation, the proposed technique is both space economical and computationally efficient. It requires O(n+m) expected space, and produces a new anonymized version in O(n+logm) expected time, where n is the cardinality of the original dataset, and m the number of versions released previously. Both bounds are optimal under the realistic assumption that n ≫ m. 1.
Can the utility of anonymized data be used for privacy breaches
- TKDD
"... Group based anonymization is the most widely studied approach for privacy-preserving data publishing. Privacy models/definitions using group based anonymization includes k-anonymity, l-diversity, and t-closeness, to name a few. The goal of this article is to raise a fundamental issue regarding the p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Group based anonymization is the most widely studied approach for privacy-preserving data publishing. Privacy models/definitions using group based anonymization includes k-anonymity, l-diversity, and t-closeness, to name a few. The goal of this article is to raise a fundamental issue regarding the privacy exposure of the approaches using group based anonymization. This has been overlooked in the past. The group based anonymization approach by bucketization basically hides each individual record behind a group to preserve data privacy. If not properly anonymized, patterns can actually be derived from the published data and be used by an adversary to breach individual privacy. For example, from the medical records released, if patterns such as that people from certain countries rarely suffer from some disease can be derived, then the information can be used to imply linkage of other people in an anonymized group with this disease with higher likelihood. We call the derived patterns from the published data the foreground knowledge. This is in contrast to the background knowledge that the adversary may obtain from other channels, as studied in some previous work. Finally, our experimental results show such an attack is realistic in the privacy benchmark dataset under the traditional group based anonymization approach.
Transparent Anonymization: Thwarting Adversaries Who Know the Algorithm
, 2010
"... Numerous generalization techniques have been proposed for privacy-preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Numerous generalization techniques have been proposed for privacy-preserving data publishing. Most existing techniques, however, implicitly assume that the adversary knows little about the anonymization algorithm adopted by the data publisher. Consequently, they cannot guard against privacy attacks that exploit various characteristics of the anonymization mechanism. This article provides a practical solution to this problem. First, we propose an analytical model for evaluating disclosure risks, when an adversary knows everything in the anonymization process, except the sensitive values. Based on this model, we develop a privacy principle, transparent l-diversity,which ensures privacy protection against such powerful adversaries. We identify three algorithms that achieve transparent l-diversity, and verify their effectiveness and efficiency through extensive experiments with real data.

