Results 1  10
of
30
Differentially Private Data Cubes: Optimizing Noise Sources and Consistency
"... Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table’s dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishi ..."
Abstract

Cited by 40 (3 self)
 Add to MetaCart
Data cubes play an essential role in data analysis and decision support. In a data cube, data from a fact table is aggregated on subsets of the table’s dimensions, forming a collection of smaller tables called cuboids. When the fact table includes sensitive data such as salary or diagnosis, publishing even a subset of its cuboids may compromise individuals ’ privacy. In this paper, we address this problem using differential privacy (DP), which provides provable privacy guarantees for individuals by adding noise to query answers. We choose an initial subset of cuboids to compute directly from the fact table, injecting DP noise as usual; and then compute the remaining cuboids from the initial set. Given a fixed privacy guarantee, we show that it is NPhard to choose the initial set of cuboids so that the maximal noise over all published cuboids is minimized, or so that the number of cuboids with noise below a given threshold (precise cuboids) is maximized. We provide an efficient procedure with running time polynomial in the number of cuboids to select the initial set of cuboids, such that the maximal noise in all published cuboids will be within a factor (ln L  +1) 2 of the optimal, where L  is the number of cuboids to be published, or the number of precise cuboids will be within a factor (1 − 1/e) of the optimal. We also show how to enforce consistency in the published cuboids while simultaneously improving their utility (reducing error). In an empirical evaluation on real and synthetic data, we report the amounts of error of different publishing algorithms, and show that our approaches outperform baselines significantly.
Answering topk queries with multidimensional selections: The ranking cube approach
 In VLDB
, 2006
"... Observed in many real applications, a topk query often consists of two components to reflect a user’s preference: a selection condition and a ranking function. A user may not only propose ad hoc ranking functions, but also use different interesting subsets of the data. In many cases, a user may wan ..."
Abstract

Cited by 23 (7 self)
 Add to MetaCart
(Show Context)
Observed in many real applications, a topk query often consists of two components to reflect a user’s preference: a selection condition and a ranking function. A user may not only propose ad hoc ranking functions, but also use different interesting subsets of the data. In many cases, a user may want to have a thorough study of the data by initiating a multidimensional analysis of the topk query results. Previous work on topk query processing mainly focuses on optimizing data access according to the ranking function only. The problem of efficient answering topk queries with multidimensional selections has not been well addressed yet. This paper proposes a new computational model, called ranking cube, for efficient answering topk queries with multidimensional selections. We define a rankaware measure for the cube, capturing our goal of responding to multidimensional ranking analysis. Based on the ranking cube, an efficient query algorithm is developed which progressively retrieves data blocks until the topk results are found. The curse of dimensionality is a wellknown challenge for the data cube and we cope with this difficulty by introducing a new technique of ranking fragments. Our experiments on Microsoft’s SQL Server 2005 show that our proposed approaches have significant improvement over the previous methods. 1.
Graph OLAP: Towards online analytical processing on graphs
 IN: PROC. 2008 INT. CONF. ON DATA MINING (ICDM 2008
, 2008
"... OLAP (OnLine Analytical Processing) is an important notion in data analysis. Recently, more and more graph or networked data sources come into being. There exists a similar need to deploy graph analysis from different perspectives and with multiple granularities. However, traditional OLAP technolog ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
OLAP (OnLine Analytical Processing) is an important notion in data analysis. Recently, more and more graph or networked data sources come into being. There exists a similar need to deploy graph analysis from different perspectives and with multiple granularities. However, traditional OLAP technology cannot handle such demands because it does not consider the links among individual data tuples. In this paper, we develop a novel graph OLAP framework, which presents a multidimensional and multilevel view over graphs. The contributions of this work are twofold. First, starting from basic definitions, i.e., what are dimensions and measures in the graph OLAP scenario, we develop a conceptual framework for data cubes on graphs. We also look into different semantics of OLAP operations, and classify the framework into two major subcases: informational OLAP and topological OLAP. Then, with more emphasis on informational OLAP (topological OLAP will be covered in a future study due to the lack of space), we show how a graph cube can be materialized by calculating a special kind of measure called aggregated graph and how to implement it efficiently. This includes both full materialization and partial materialization where constraints are enforced to obtain an iceberg cube. We can see that the aggregated graphs, which depend on the graph properties of underlying networks, are much harder to compute than their traditional OLAP counterparts, due to the increased structural complexity of data. Empirical studies show insightful results on real datasets and demonstrate the efficiency of our proposed optimizations.
OLAP on Sequence Data
"... Many kinds of reallife data exhibit logical ordering among their data items and are thus sequential in nature. However, traditional online analytical processing (OLAP) systems and techniques were not designed for sequence data and they are incapable of supporting sequence data analysis. In this pap ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
Many kinds of reallife data exhibit logical ordering among their data items and are thus sequential in nature. However, traditional online analytical processing (OLAP) systems and techniques were not designed for sequence data and they are incapable of supporting sequence data analysis. In this paper, we propose the concept of Sequence OLAP, or SOLAP for short. The biggest distinction of SOLAP from traditional OLAP is that a sequence can be characterized not only by the attributes ’ values of its constituting items, but also by the subsequence/substring patterns it possesses. This paper studies many aspects related to Sequence OLAP. The concepts of sequence cuboid and sequence data cube are introduced. A prototype SOLAP system is built in order to validate the proposed concepts. The prototype is able to support “patternbased” grouping and aggregation, which is currently not supported by any OLAP system. The implementation details of the prototype system as well as experimental results are presented.
Graph Cube: On Warehousing and OLAP Multidimensional Networks
"... We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the socalled multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
(Show Context)
We consider extending decision support facilities toward large sophisticated networks, upon which multidimensional attributes are associated with network entities, thereby forming the socalled multidimensional networks. Data warehouses and OLAP (Online Analytical Processing) technology have proven to be effective tools for decision support on relational data. However, they are not wellequipped to handle the new yet important multidimensional networks. In this paper, we introduce Graph Cube, a new data warehousing model that supports OLAP queries effectively on large multidimensional networks. By taking account of both attribute aggregation and structure summarization of the networks, Graph Cube goes beyond the traditional data cube model involved solely with numeric value based groupby’s, thus resulting in a more insightful and structureenriched aggregate network within every possible multidimensional space. Besides traditional cuboid queries, a new class of OLAP queries, crossboid, is introduced that is uniquely useful in multidimensional networks and has not been studied before. We implement Graph Cube by combining special characteristics of multidimensional networks with the existing wellstudied data cube techniques. We perform extensive experimental studies on a series of real world data sets and Graph Cube is shown to be a powerful and efficient tool for decision support on large multidimensional networks.
Mining approximate topk subspace anomalies in multidimensional timeseries data
 In VLDB
, 2007
"... Market analysis is a representative data analysis process with many applications. In such an analysis, critical numerical measures, such as profit and sales, fluctuate over time and form timeseries data. Moreover, the time series data correspond to market segments, which are described by a set of a ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
Market analysis is a representative data analysis process with many applications. In such an analysis, critical numerical measures, such as profit and sales, fluctuate over time and form timeseries data. Moreover, the time series data correspond to market segments, which are described by a set of attributes, such as age, gender, education, income level, and productcategory, that form a multidimensional structure. To better understand market dynamics and predict future trends, it is crucial to study the dynamics of timeseries in multidimensional market segments. This is a topic that has been largely ignored in time series and data cube research. In this study, we examine the issues of anomaly detection in multidimensional timeseries data. We propose timeseries data cube to capture the multidimensional space formed by the attribute structure. This facilitates the detection of anomalies based on expected values derived from higher level, “more general ” timeseries. Anomaly detection in a timeseries data cube poses computational challenges, especially for highdimensional, large data sets. To this end, we also propose an efficient search algorithm to iteratively select subspaces in the original highdimensional space and detect anomalies within each one. Our experiments with both synthetic and realworld data demonstrate the effectiveness and efficiency of the proposed solution. 1.
Graph OLAP: a multidimensional framework for graph data analysis
 KNOWL INF SYST
, 2009
"... Databases and data warehouse systems have been evolving from handling normalized spreadsheets stored in relational databases, to managing and analyzing diverse applicationoriented data with complex interconnecting structures. Responding to this emerging trend, graphs have been growing rapidly and ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Databases and data warehouse systems have been evolving from handling normalized spreadsheets stored in relational databases, to managing and analyzing diverse applicationoriented data with complex interconnecting structures. Responding to this emerging trend, graphs have been growing rapidly and showing their critical importance in many applications, such as the analysis of XML, social networks, Web, biological data, multimedia data and spatiotemporal data. Can we extend useful functions of databases and data warehouse systems to handle graph structured data? In particular, OLAP (OnLine Analytical Processing) has been a popular tool for fast and userfriendly multidimensional analysis of data warehouses. Can we OLAP graphs? Unfortunately, to our best knowledge, there are no OLAP tools available that can interactively view and analyze graph data from different perspectives and with multiple granularities. In this paper, we argue that it is critically important to OLAP graph structured data and propose a novel Graph OLAP framework. According to this framework, given a graph dataset with its nodes and edges associated with respective attributes, a multidimensional model can be built to enable efficient on
Distributed Cube Materialization on Holistic Measures ∗
"... Abstract—Cube computation over massive datasets is critical for many important analyses done in the real world. Unlike commonly studied algebraic measures such as SUM that are amenable to parallel computation, efficient cube computation of holistic measures such as TOPK is nontrivial and often imp ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Abstract—Cube computation over massive datasets is critical for many important analyses done in the real world. Unlike commonly studied algebraic measures such as SUM that are amenable to parallel computation, efficient cube computation of holistic measures such as TOPK is nontrivial and often impossible with current methods. In this paper we detail realworld challenges in cube materialization tasks on Webscale datasets. Specifically, we identify an important subset of holistic measures and introduce MRCube, a MapReduce based framework for efficient cube computation on these measures. We provide extensive experimental analyses over both real and synthetic data. We demonstrate that, unlike existing techniques which cannot scale to the 100 million tuple mark for our datasets, MRCube successfully and efficiently computes cubes with holistic measures over billiontuple datasets. I.
Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic
 In Extending database technology (EDBT
, 2008
"... Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estim ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
(Show Context)
Estimating the number of distinct elements in a large multiset has several applications, and hence has attracted active research in the past two decades. Several sampling and sketching algorithms have been proposed to accurately solve this problem. The goal of the literature has always been to estimate the number of distinct elements while using minimal resources. However, in some modern applications, the accuracy of the estimate is of primal importance, and businesses are willing to trade more resources for better accuracy. Throughout our experience with building a distinct count system at a major search engine, Ask.com, we reviewed the literature of approximating distinct counts, and compared most algorithms in the literature. We deduced that Linear Counting, one of the least used algorithms, has unique and impressive advantages when the accuracy of the distinct count is critical to the business. For other estimators to attain comparable accuracy, they need more space than Linear Counting. We have supported our analytical results through comprehensive experiments. The experimental results highly favor Linear Counting when the number of distinct elements is large and the error tolerance is low. 1.