Results 1 - 10
of
44
KDD-Cup 2000 organizers’ report: Peeling the onion
- SIGKDD EXPLORATIONS
, 2000
"... We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to “peel the onion ” and drill d ..."
Abstract
-
Cited by 74 (2 self)
- Add to MetaCart
We describe KDD-Cup 2000, the yearly competition in data mining. For the first time the Cup included insight problems in addition to prediction problems, thus posing new challenges in both the knowledge discovery and the evaluation criteria, and highlighting the need to “peel the onion ” and drill deeper into the reasons for the initial patterns found. We chronicle the data generation phase starting from the collection at the site through its conversion to a star schema in a warehouse through data cleansing, data obfuscation for privacy protection, and data aggregation. We describe the information given to the participants, including the questions, site structure, the marketing calendar, and the data schema. Finally, we discuss interesting insights, common mistakes, and lessons learned. Three winners were announced and they describe their own experiences and lessons in the pages following this paper.
Conceptual Modeling for ETL Processes
, 2002
"... software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed concep ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed conceptual model is (a) customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project; (b) enriched with a 'palette' of a set of frequently used ETL activities, like the assignment of surrogate keys, the check for null values, etc; and (c) constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns for ETL activities.
Why is the Snowflake Schema a Good Data Warehouse Design?
- Information Systems
"... Database design for data warehouses is based on the notion of the snowflake schema and its important special case, the star schema. The snowflake schema represents a dimensional model which is composed of a central fact table and a set of constituent dimension tables which can be further broken up i ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
Database design for data warehouses is based on the notion of the snowflake schema and its important special case, the star schema. The snowflake schema represents a dimensional model which is composed of a central fact table and a set of constituent dimension tables which can be further broken up into subdimension tables. We formalise the concept of a snowflake schema in terms of an acyclic database schema whose join tree satisfies certain structural properties. We then define a normal form for snowflake schemas which captures its intuitive meaning with respect to a set of functional and inclusion dependencies. We show that snowflake schemas in this normal form are independent as well as separable when the relation schemas are pairwise incomparable. This implies that relations in the data warehouse can be updated independently of each other as long as referential integrity is maintained. In addition, we show that a data warehouse in snowflake normal form can be queried by joining the relation over the fact table with the relations over its dimension and subdimension tables. We also examine an informationtheoretic interpretation of the snowflake schema and show that the redundancy of the primary key of the fact table is zero. Key words. Data warehouse design, star and snowflake schema, independent and separable database schema, acyclic database schema. 1
Lessons and challenges from mining retail e-commerce data
- MACHINE LEARNING
, 2004
"... Abstract. The architecture of Blue Martini Software’s e-commerce suite has supported data collection, transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
Abstract. The architecture of Blue Martini Software’s e-commerce suite has supported data collection, transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g. sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. We briefly review the architecture and discuss many lessons learned over the last four years and the challenges still facing us. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are more widely applicable to data mining domains outside retail e-commerce. Keywords: Data mining, data analysis, business intelligence, web analytics, web mining, OLAP,
Modeling ETL activities as graphs
- In Proc. 4 th Intl. Workshop on Design and Management of Data Warehouses (DMDW
, 2002
"... Abstract. Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the logical design of the ETL scenario of a data warehouse. Based ..."
Abstract
-
Cited by 14 (10 self)
- Add to MetaCart
Abstract. Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the logical design of the ETL scenario of a data warehouse. Based on a formal logical model that includes the data stores, activities and their constituent parts, we model an ETL scenario as a graph, which we call the Architecture Graph. We model all the aforementioned entities as nodes and four different kinds of relationships (instance-of, part-of, regulator and provider relationships) as edges. In addition, we provide simple graph transformations that reduce the complexity of the graph. Finally, in order to support the engineering of the design and the evolution of the warehouse, we introduce specific importance metrics, namely dependence and responsibility, to measure the degree to which entities are bound to each other. 1.
Using AutoMed Metadata in Data Warehousing Environments
- Methodologies of Component-Based Enterprise Systems, European Journal of Information Systems
, 2003
"... What kind of metadata can be used for expressing the multiplicity of data models and the data transformation and integration processes in data warehousing environments? How can this metadata be further used for supporting other data warehouse activities? We examine how these questions are addressed ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
What kind of metadata can be used for expressing the multiplicity of data models and the data transformation and integration processes in data warehousing environments? How can this metadata be further used for supporting other data warehouse activities? We examine how these questions are addressed by AutoMed, a system for expressing data transformation and integration processes in heterogeneous database environments.
A Decision Theoretic Cost Model for Dynamic Plans
- IEEE DATA ENGINEERING BULLETIN
, 2000
"... Since the classic optimization work in System R, query optimization has completely preceded query evaluation. Unfortunately, errors in cost model parameters such as selectivity estimation compromise the optimality of query evaluation plans optimized at compile time. The only promising remedy is to i ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Since the classic optimization work in System R, query optimization has completely preceded query evaluation. Unfortunately, errors in cost model parameters such as selectivity estimation compromise the optimality of query evaluation plans optimized at compile time. The only promising remedy is to interleave strategy selection and data access using run-time-dynamic plans. Based on the principles of decision theory, our cost model enables careful query analysis and prepares alternative query evaluation plans at compile time, delaying relatively few, selected decisions until run time. In our prototype optimizer, these run-time decisions are based only on those materialized intermediate results for which materialization costs are expected to be less than the benefits from the improved decision quality.
A Methodology for the Conceptual Modeling of ETL Processes
- In Proc. of DSE’03
, 2002
"... Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we propose a methodology for the earliest stages of the data warehouse design, with the ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Extraction-Transformation-Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we propose a methodology for the earliest stages of the data warehouse design, with the goal of tracing the analysis of the structure and content of the existing data sources and their intentional mapping to the common conceptual data warehouse model. The methodology comprises a set of steps that can be summarized as follows: (a) identification of the proper data stores; (b) candidates and active candidates for the involved data stores; (c) attribute mapping between the providers and the consumers, and (d) annotation of the diagram with runtime constraints.
A.: Multidimensional Design by Examples. In
- Proc. of 8th Int. Conf. on Data Warehousing and Knowledge Discovery (DaWaK 2006). Volume 4081 of LNCS
, 2006
"... Abstract. In this paper we present a method to validate user multidimensional requirements expressed in terms of SQL queries. Furthermore, our approach automatically generates and proposes the set of multidimensional schemas satisfying the user requirements, from the organizational operational schem ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. In this paper we present a method to validate user multidimensional requirements expressed in terms of SQL queries. Furthermore, our approach automatically generates and proposes the set of multidimensional schemas satisfying the user requirements, from the organizational operational schemas. If no multidimensional schema is generated for a query, we can state that requirement is not multidimensional. Keywords: Multidimensional Design, Design by Examples, DW. 1
Business Process Oriented Development of Data Warehouse Structures
- DATA WAREHOUSING 2000 – METHODEN, ANWENDUNGEN, STRATEGIEN; PHYSICA: HEIDELBERG 2000
, 2000
"... In recent years data warehouse projects have become popular to most companies. Unfortunately many of these projects come to grief due to missing engineering strategies and modeling standards. The common method for developing multidimensional data structures is deriving relevant datasets from underly ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
In recent years data warehouse projects have become popular to most companies. Unfortunately many of these projects come to grief due to missing engineering strategies and modeling standards. The common method for developing multidimensional data structures is deriving relevant datasets from underlying operational data sources. But in fact developing large data schemata for decision support applications requires more knowledge about the underlying business domain. It is obvious that for efficient decision making an approach is required which additionally focuses on goals and strategies of the company. This information can not be extracted by analyzing the operational data sources. We realized that commonly applied business process models can be used to handle this problem. In this paper we present a completely new approach to warehouse development – the derivation of data warehouse structures from business process models.

