## A probabilistic validation algorithm for Web users’ clusters (2004)

### Cached

### Download Links

Venue: | In Proceedings of the IEEE international conference on systems, man and cybernetics (SMC |

Citations: | 3 - 1 self |

### BibTeX

@INPROCEEDINGS{Pallis04aprobabilistic,

author = {George Pallis and Lefteris Angelis and Athena Vakali},

title = {A probabilistic validation algorithm for Web users’ clusters},

booktitle = {In Proceedings of the IEEE international conference on systems, man and cybernetics (SMC},

year = {2004},

pages = {4129--4134},

publisher = {IEEE}

}

### OpenURL

### Abstract

Abstract – Cluster analysis is one of the most important aspects in the data mining process for discovering groups and identifying interesting distributions or patterns over the considered data sets. In the context of Web data mining, model-based clustering algorithms are often used to cluster similar users ’ sessions in order to determine Website access behaviors. An important issue in cluster analysis is the evaluation of clustering results to find the partitioning that best fits the underlying data. In this paper, we present a novel validation technique for modelbased clustering approaches.

### Citations

8089 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ... we can use it to assign each user to a cluster or fractionally to a set of clusters. The parameters can be learned using the Expectation-Maximization (EM) algorithm. The EM algorithm originates from =-=[8]-=- and in [3] a method for employing on EM on users’ sessions is proposed. In particular, the EM algorithm is an iterative procedure that finds the maximum likelihood estimates of the parameter vector b... |

438 | Data Preparation for Mining World Wide Web Browsing Patterns
- Cooley, Mobasher, et al.
- 1999
(Show Context)
Citation Context ...files get millions of lines every day. Figure 1 presents a sample of a Web server log file. These data are undergone a certain pre-processing, such as invalid data cleaning and session identification =-=[6]-=-. Data cleaning removes the records which do not include useful information for the users’ behavior, such as graphics, javascripts, small pictures of buttons, advertisements etc. Figure 1. A sample of... |

308 | Automatic personalization based on web usage mining
- Mobasher, Cooley, et al.
(Show Context)
Citation Context ... (s)he moves through a Web site and each one reflects an individual user’s behavior. Such knowledge is especially useful for customizing a Web site to the needs of a particular user or a set of users =-=[15]-=-. On the other hand, clustering of Web pages tends to establish groups of pages based either on their content or on their hyperlink information [14], [16]. In this paper, we focus on validating the We... |

277 | How many clusters? Which clustering method? Answers via model-based Cluster Analysis
- Fraley, Raftery
- 1998
(Show Context)
Citation Context ...ters with some probability. The number of clusters may be determined by using several probabilistic models, such as BIC (Bayesian Information Criterion), bayesian approximations, or bootstrap methods =-=[10]-=-. • The behavior of each cluster is governed by a statistical model and the user’s behavior is generated from this model to that cluster. In general, each cluster has a data-generating model with diff... |

188 | Data Clustering
- Jain, Murty, et al.
- 1999
(Show Context)
Citation Context ...earching. In general, clustering is one of the most important practices in the data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data =-=[13]-=-. The clustering problem is about partitioning a given data set into clusters (groups) such that the data points in the cluster are more similar to each other than points in different clusters. In the... |

182 | On clustering validation techniques
- Halkidi, Batistakis, et al.
- 2001
(Show Context)
Citation Context ... Related Work and Paper’s Contribution The problem of evaluating the results of a clustering algorithm is one of the most important issues in cluster analysis and has attracted research interest [9], =-=[11]-=-, [22]. The problem is related to the question: After applying a cluster algorithm, how can one assess the quality of the clusters returned? Clustering schemes always produce a partition of the given ... |

153 | Knowledge discovery from users web-page navigation
- Shahabi, Zarkesh, et al.
- 1999
(Show Context)
Citation Context ...eb users’ clusters. Specifically, several clustering approaches have been proposed in the past, assigning the sessions (users’ behaviors) with common characteristics into the same cluster [16], [17], =-=[18]-=-. These may be classified into two schemes: • Similarity-based: It uses distance functions (e.g. Euclidean, Manhattan, cosine etc.) to measure similarities among sessions [2], [20]. Distance functions... |

116 |
Link prediction and path analysis using Markov chains
- Sarukkai
- 2000
(Show Context)
Citation Context ... needs of a particular user or a set of users [15]. On the other hand, clustering of Web pages tends to establish groups of pages based either on their content or on their hyperlink information [14], =-=[16]-=-. In this paper, we focus on validating the Web users’ clusters. Specifically, several clustering approaches have been proposed in the past, assigning the sessions (users’ behaviors) with common chara... |

78 | Mining the Web - Chakrabarti - 2003 |

57 | A unified framework for model-based clustering
- Zhong, Ghosh
- 2003
(Show Context)
Citation Context ...ed Work and Paper’s Contribution The problem of evaluating the results of a clustering algorithm is one of the most important issues in cluster analysis and has attracted research interest [9], [11], =-=[22]-=-. The problem is related to the question: After applying a cluster algorithm, how can one assess the quality of the clusters returned? Clustering schemes always produce a partition of the given data s... |

53 | Model-based clustering and visualization of navigation patterns on a web site
- Cadez, Heckerman, et al.
(Show Context)
Citation Context ...ce it improves the data management and in addition eliminates the complexity of the underlying problem (since the number of page categories is smaller than the number of Web pages in a Web site) [1], =-=[3]-=-, [12]. In particular, the individual pages are grouped intossemantically similar groups. Scanning for specific keywords that occur in the URL string of page request makes the assignment of the page r... |

52 | Clickstream clustering using weighted longest common subsequences
- Banerjee, Ghosh
- 2001
(Show Context)
Citation Context ...same cluster [16], [17], [18]. These may be classified into two schemes: • Similarity-based: It uses distance functions (e.g. Euclidean, Manhattan, cosine etc.) to measure similarities among sessions =-=[2]-=-, [20]. Distance functions can be determined either directly, or indirectly, although the latter is more common in most applications. Hierarchical and partitional are the most indicative approaches th... |

30 |
Why so many clustering algorithms: A position paper
- ESTIVILL-CASTRO
(Show Context)
Citation Context ...ks. 2 Related Work and Paper’s Contribution The problem of evaluating the results of a clustering algorithm is one of the most important issues in cluster analysis and has attracted research interest =-=[9]-=-, [11], [22]. The problem is related to the question: After applying a cluster algorithm, how can one assess the quality of the clusters returned? Clustering schemes always produce a partition of the ... |

28 | Categorization of web pages and user clustering with mixtures of hidden markov models
- Ypma, Heskes
- 2002
(Show Context)
Citation Context ...-generating model with different parameters for each cluster. Modelbased schemes are usually preferred from the Web community since they can efficiently describe the dynamic evolution of the Web [1], =-=[21]-=-. In this paper, we also use a model-based algorithm to cluster the users’ sessions.sThe rest of the paper is organized as follows. In Section 2, we provide a brief discussion of related work, and cla... |

18 |
G.W.(1989): Statistical Methods (eighth edition
- SNEDECOR, COCHRAN
(Show Context)
Citation Context ...chnical contribution of the paper can be summarized in the following: • Suggestion of a validation algorithm for modelbased clustering. This algorithm is based on a statistical chi-square test (χ 2 ) =-=[19]-=- employed to each cluster. • The statistic criterion provided does not depend on tunable parameters. • The algorithm was tested on a real data set collected from an educational Web server (the Departm... |

12 | Optimal algorithms for finding user access sessions from very large web logs
- Tong
(Show Context)
Citation Context ...n the URL string of page request makes the assignment of the page requests to a category. In order to identify the users’ sessions, heuristic methods are usually used based on IP and session timeouts =-=[5]-=-. In this paper, we first consider that we have an ordered set of traces with respect to the IPs. Therefore, a new session is created when a new IP address is encountered or if the visiting page time ... |

10 | A cube model for web access sessions and cluster analysis
- Huang
- 2001
(Show Context)
Citation Context ...ut the clustering structure of this data set. • Internal approach: It evaluates the clustering result in terms of quantities obtained from the data set itself. This approach is used by the authors in =-=[12]-=- in order to validate clusters of users’ sessions. In particular, their method provides only a sense about how similar are the sessions within the cluster. This is indicated by using the Frobenius nor... |

9 |
Clustering of web users using session-based similarity measures
- Xiao, Zhang
(Show Context)
Citation Context ...cluster [16], [17], [18]. These may be classified into two schemes: • Similarity-based: It uses distance functions (e.g. Euclidean, Manhattan, cosine etc.) to measure similarities among sessions [2], =-=[20]-=-. Distance functions can be determined either directly, or indirectly, although the latter is more common in most applications. Hierarchical and partitional are the most indicative approaches that bel... |

6 |
Predicting a Web user’s next request based on log data
- Sen, Hansen
- 2003
(Show Context)
Citation Context ... the Web users’ clusters. Specifically, several clustering approaches have been proposed in the past, assigning the sessions (users’ behaviors) with common characteristics into the same cluster [16], =-=[17]-=-, [18]. These may be classified into two schemes: • Similarity-based: It uses distance functions (e.g. Euclidean, Manhattan, cosine etc.) to measure similarities among sessions [2], [20]. Distance fun... |

4 |
Web page grouping based on parameterized connectivity
- Masada, Takasu, et al.
- 2000
(Show Context)
Citation Context ...to the needs of a particular user or a set of users [15]. On the other hand, clustering of Web pages tends to establish groups of pages based either on their content or on their hyperlink information =-=[14]-=-, [16]. In this paper, we focus on validating the Web users’ clusters. Specifically, several clustering approaches have been proposed in the past, assigning the sessions (users’ behaviors) with common... |

2 |
The theory of stohastic processes
- Cox, Miller
- 1997
(Show Context)
Citation Context ... a unique vector f=(f 1 , …, f V ), such that lim P n →∞ ⎛ f ⎞ ⎜ ⎟ ⎜ f n ⎟ = ⎜... ⎟ ⎜ ⎟ ⎜ ⎟ ⎝ f ⎠ A thorough study and classification of finite Markov chains and the proof of this theorem is given in =-=[7]-=-. This theorem offers us a way of approximately evaluating the access frequencies of the nodes, by simply calculating powers of the transition matrix. It gives us a way to evaluate the relative freque... |

1 | Modeling fhe lit/eriief and fhe Web - Baldi, Frasconi, et al. - 2003 |

1 | Mining fhe Web - Chakrabarti - 2003 |

1 | The theory of stohastic processes”, Chapman and - Miller - 1997 |

1 | Why so many clustering algorithms - a position paper - Estivill-Casho |

1 | On clustering validation techniques”, Jourriul of hifelligent hforrnation S),sfems - Halkidi, Batistakis, et al. - 2001 |