## Clustering Structured Web Sources: a Schema-based, Model-Differentiation Approach (2004)

### Cached

### Download Links

- [eagle.cs.uiuc.edu]
- [www-forward.cs.uiuc.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In EDBT’04 ClustWeb Workshop |

Citations: | 11 - 6 self |

### BibTeX

@INPROCEEDINGS{He04clusteringstructured,

author = {Bin He and Tao Tao and Kevin Chen-chuan Chang},

title = {Clustering Structured Web Sources: a Schema-based, Model-Differentiation Approach},

booktitle = {In EDBT’04 ClustWeb Workshop},

year = {2004},

pages = {536--546}

}

### OpenURL

### Abstract

The Web has been rapidly "deepened" with the prevalence of databases online.

### Citations

1308 | Data clustering: A review
- JAIN, MURTY, et al.
- 1999
(Show Context)
Citation Context ....., CG) = D2 D2 . (6) s(df) 3.3 HAC Algorithm And MD-Based Similarity Measure For constructing domain hierarchy, we adopt the general HAC clustering approach, which is widely used for data clustering =-=[19]-=-. Figure 4 illustrates the general HAC framework [7]. In HAC, we need to measure the similarity of clusters. That is, given a set of clusters, C1,...,CV , we compute all the pairwise values s(k, l), w... |

1212 |
Categorical Data Analysis
- Agresti
- 1990
(Show Context)
Citation Context ...andom variable D 2 m� n� (C1, ..., Cm) = [ (Oij − Xi × Yj S )2 ]. (4) i=1 j=1 Xi × Yj S It can be shown that D 2 has asymptotically a χ 2 distribution with (n−1)(m−1) degree of freedom, d=-=enoted by df [18]-=-. Note that we have to use both the values of D 2 and df to decide how similar the m clusters are. D 2 value itself is not a valid indicator for the similarity of clusters without being qualified the ... |

885 | A language modeling approach to information retrieval - Ponte, Croft - 1998 |

338 | K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes
- Guha, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...sparse in a high-dimensional space, conventional clustering based on similarity measures does not work well. Several recent efforts have thus developed new objective functions, e.g., context-linkages =-=[2]-=- and entropy [3]. In this paper, we pursue model-based clustering with a new objective function, motivated by our observations on the query schemas. In particular, we collected a dataset of deep WebsN... |

313 |
Model-based Gaussian and non-Gaussian clustering
- Banfield, Raftery
- 1993
(Show Context)
Citation Context ...tion measure, which maximizes the statistical heterogeneity among clusters. Section 4 compares these related approaches. Our statistical approach belongs to the general idea of model-based clustering =-=[5, 6]. -=-In general, such clustering assumes that data is generated from a mixture of distributions, each of which defines a cluster. This general approach is traditionally not specific to categorical data– ... |

251 | Survey of clustering data mining techniques
- BERKHIN
- 2002
(Show Context)
Citation Context ...ontext linkage (ROCK) [2] based approaches using HAC algorithm. Also, we show the domain hierarchy built by MDhac. To measure the result of clustering, we adopt the conditional entropy, introduced in =-=[20]-=-. For a given number of clusters G, the value of the conditional entropy is within the range from 0 to log G, where 0 denotes the 100% correct clustering, log G denotes the totally messing up result. ... |

144 | Clustering categorical data: An approach based on dynamical systems
- Gibson, Kleinberg, et al.
- 1998
(Show Context)
Citation Context ...atabases. Second, in terms of the techniques, this paper proposes model-differentiation for clustering schema data. Clustering of categorical data has recently been more actively studied, e.g., STIRR =-=[13]-=-, CACTUS [14], ROCK [2], and COOLCAT [3]. STIRR treats clustering as a partitioning problem of hypergraph and solves it based on non-linear dynamical systems. CACTUS considers a cluster as a set of pa... |

132 | Statistical schema matching across web query interfaces
- He, Chang
- 2003
(Show Context)
Citation Context ...nder this multinomial view, we can express C as aggregate attribute frequencies, i.e., C = {A1:z1,...,AN:zN }. More discussion about this multinomial modeling and its comparison with the model in MGS =-=[17]-=- can be found in our extended report [15]. 3.2 Model-Differentiation: A New Objective Function A clustering must be guided by some objective function that specifies the property of the ideal clusters.... |

125 | Medmaker: A mediation system based on declarative speci cations
- Papakonstantinou, Garcia-Molina, et al.
- 1996
(Show Context)
Citation Context ...” has largely been unexplored. On one hand, for structured sources, information integration has mainly assumed relatively small-scaled, pre-configured systems (e.g., Information Manifold [8], TSIMMI=-=S [9]).-=- On the other hand, research efforts on large-scale search has mostly focused on text sources [10–12]. Our focus mixes both of the above: We aim to enable large-scale metaquery over structured datab... |

111 | Automatic discovery of language models for text databases - Callan, Connell, et al. - 1999 |

93 | CACTUS: Clustering categorical data using summaries
- Ganti, Gehrke, et al.
- 1999
(Show Context)
Citation Context ...ond, in terms of the techniques, this paper proposes model-differentiation for clustering schema data. Clustering of categorical data has recently been more actively studied, e.g., STIRR [13], CACTUS =-=[14]-=-, ROCK [2], and COOLCAT [3]. STIRR treats clustering as a partitioning problem of hypergraph and solves it based on non-linear dynamical systems. CACTUS considers a cluster as a set of pairwise strong... |

79 | An experimental comparison of several clustering and initialization methods
- Meilă, Heckerman
- 1998
(Show Context)
Citation Context ...ring assumes that data is generated from a mixture of distributions, each of which defines a cluster. This general approach is traditionally not specific to categorical data– More recently, referenc=-=e [7]-=- proposes a multivariate multinomial distribution (in which each feature is an independent multinomial distribution) for categorical data. In comparison, the model we propose for schema data (or trans... |

68 | COOLCAT: an entropy-based algorithm for categorical clustering
- Barbara, Couto, et al.
- 2002
(Show Context)
Citation Context ...-dimensional space, conventional clustering based on similarity measures does not work well. Several recent efforts have thus developed new objective functions, e.g., context-linkages [2] and entropy =-=[3]-=-. In this paper, we pursue model-based clustering with a new objective function, motivated by our observations on the query schemas. In particular, we collected a dataset of deep WebsNumber of Observa... |

65 | Structured databases on the web: observations and implications
- Chang, He, et al.
(Show Context)
Citation Context ...ation. On this “deep Web”(databasebacked web sources), numerous online databases provide dynamic query-based data access through their query interfaces, instead of static URL links. Our recent sur=-=vey [1]-=- in December 2002 estimated between 127,000 to 330,000 deep Web sources. The deep Web thus presents challenges for large-scale information integration: While there are myriad useful databases, how can... |

56 | Algorithms for model-based Gaussian hierarchical clustering
- Fraley
- 1997
(Show Context)
Citation Context ...tion measure, which maximizes the statistical heterogeneity among clusters. Section 4 compares these related approaches. Our statistical approach belongs to the general idea of model-based clustering =-=[5, 6]. -=-In general, such clustering assumes that data is generated from a mixture of distributions, each of which defines a cluster. This general approach is traditionally not specific to categorical data– ... |

39 | N.: Determining text databases to search in the internet - Meng, Liu, et al. - 1998 |

24 |
Ordille J.J.: Querying Heterogeneous Information sources Using Source Descriptions
- Levy, Rajaraman
- 1996
(Show Context)
Citation Context ...or “metaquery” has largely been unexplored. On one hand, for structured sources, information integration has mainly assumed relatively small-scaled, pre-configured systems (e.g., Information Manif=-=old [8], -=-TSIMMIS [9]). On the other hand, research efforts on large-scale search has mostly focused on text sources [10–12]. Our focus mixes both of the above: We aim to enable large-scale metaquery over str... |

21 |
An Introduction to Mathematical Statistics
- Brunk
- 1965
(Show Context)
Citation Context ...h seeks to maximize statistical heterogeneity among clusters. Rather than relying on ad-hoc cluster-similarity measures, MD takes principled statistical hypothesis testing, called test of homogeneity =-=[4]-=-, to evaluate if multiple clusters are generated from homogeneous distributions. Specifically, we develop Algorithm MDhac for clustering query schemas to build a domain hierarchy. First, we develop th... |

1 | Gravano, M.S.: Probe, count, and classify: Categorizing hidden web databases - Ipeirotis, Luis |