## Clustering Web Sessions by Sequence Alignment (2002)

### Cached

### Download Links

- [ftp.cs.ualberta.ca]
- [www.cs.ualberta.ca]
- [webdocs.cs.ualberta.ca]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 13th international workshop on database and expert systems applications (DEXA 2002). Aix-en-Provence |

Citations: | 20 - 0 self |

### BibTeX

@INPROCEEDINGS{Wang02clusteringweb,

author = {Weinan Wang and Osmar R. Zaïane},

title = {Clustering Web Sessions by Sequence Alignment},

booktitle = {In Proceedings of the 13th international workshop on database and expert systems applications (DEXA 2002). Aix-en-Provence},

year = {2002},

pages = {394--398},

publisher = {Springer-Verlag}

}

### Years of Citing Articles

### OpenURL

### Abstract

Clustering means grouping similar objects into groups such that objects within a same group bear similarity to each other while objects in di#erent groups are dissimilar to each other. As an important component of data mining, much research on clustering has been conducted in di#erent disciplines. In the context of web mining, clustering could be used to cluster similar clickstreams to determine learning behaviours in the case of e-learning, or general site access behaviours in e-commerce or other on-line applications. Most of the algorithms presented in the literature to deal with clustering web sessions treat sessions as sets of visited pages within a time period and don't consider the sequence of the click-strem visitation. This has a significant consequence when comparing similarities between web sessions. We propose in this paper a new algorithm based on sequence alignment to measure similarities between web sessions where sessions are chronologically ordered sequences of page accesses.

### Citations

1410 |
A general method applicable to the search for similarities in the amino acid sequence of two proteins
- Needleman, Wunsch
- 1970
(Show Context)
Citation Context ...session “abcd”, “bcad” and“abdc”. Using our session sequence similarity measure, it can tell you that the three are different, and “abcd’ is more similar to “abdc” thanto“bcad”=-=. There are many papers [17]-=- [19] in the area of bioinformatics area talking about sequence alignment. Their objects are DNA or protein sequences instead of web page sequences. One difference between web page sequences and DNA s... |

1096 | A density-based algorithm for discovering clusters in large spatial databases with noise
- Ester, Kriegel, et al.
- 1996
(Show Context)
Citation Context ...rom our sequence similarity measure, also considering the common problem of k-means family algorithms, which assumes clusters of spherical shapes , we did not try k-mode in our implementation. DBSCAN =-=[2]-=- and WaveCluster [16] could also be applied to some special categorical data sets, however, they require that any dimension of the categorical data space be somehow converted into numerical order. In ... |

436 | BIRCH: an efficient data clustering method for very large databases - Zhang, Ramakrishnan, et al. - 1996 |

370 | Web usage mining: Discovery and applications of usage patterns from web data
- Srivastava, Cooley, et al.
- 2000
(Show Context)
Citation Context ... web sessions is part of a larger work of web usage mining which is the application of data mining techniques to discover usage patterns from Web data typically collected by web servers in large logs =-=[18]-=-. Data mining from web access logs is a process consisting of three consecutive steps: data gathering and pre-processing for filtering and formattingsthe log entries, pattern discovery which consists ... |

338 | K.: ROCK: A Robust Clustering Algorithm for Categorical Attributes
- Guha, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...e usage by the learners. In order to cluster sessions, after identifying the sessions in a pre-processing phase, we used clustering algorithms known for their ability to handle categorical data: ROCK =-=[5]-=- an algorithm that acts on a sample of the dataset, CHAMELEON [9], which is based on graph partitioning, and a new algorithm TURN for discrete distributions that we introduced in [3]. All of these alg... |

308 | Automatic personalization based on web usage mining
- Mobasher, Cooley, et al.
(Show Context)
Citation Context ... of data input. The paper does not discuss in detail how they measure the closeness between sessions and how they set the similarity threshold which are very important for clustering. Mobasher et al. =-=[12]-=- used clustering on a web log using the Cosine coefficient and a threshold of 0.5. No detail is mentioned of the actual clustering algorithm used as the paper is principally on Association Rule mining... |

216 |
an efficient data clustering method for very large databases
- Zhang, Ramakrishnan, et al.
- 1996
(Show Context)
Citation Context ...ers based on clustering web sessions. Their work employed attribute oriented induction to transfer the web session data into a space of generalized sessions, then apply the clustering algorithm BIRCH =-=[22]-=- to this generalized session space. Their method scaled well over increasing large data. However, problems of BIRCH include that it needs the setting of a similarity threshold and it is sensitive to t... |

212 | CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling
- Karypis, Han, et al.
- 1999
(Show Context)
Citation Context ...tifying the sessions in a pre-processing phase, we used clustering algorithms known for their ability to handle categorical data: ROCK [5] an algorithm that acts on a sample of the dataset, CHAMELEON =-=[9]-=-, which is based on graph partitioning, and a new algorithm TURN for discrete distributions that we introduced in [3]. All of these algorithms, when used in the past for clustering web sessions, have ... |

182 | On clustering validation techniques
- Halkidi, Batistakis, et al.
- 2001
(Show Context)
Citation Context ...portant issue is how to evaluate the quality of clusters in the result. Clustering Validation is a field where attempts have been made to find rules for quantifying the quality of a clustering result =-=[11]-=-. This issue, however, is a difficult one and typically people evaluate clustering results visually or compare to known manually clustered data. Visually inspecting clusters in 2-dimensional numerical... |

172 | WaveCluster: a multi-resolution clustering approach for very large spatial databases
- Sheikholeslami, Chatterjee, et al.
- 1998
(Show Context)
Citation Context ...ilarity measure, also considering the common problem of k-means family algorithms, which assumes clusters of spherical shapes , we did not try k-mode in our implementation. DBSCAN [2] and WaveCluster =-=[16]-=- could also be applied to some special categorical data sets, however, they require that any dimension of the categorical data space be somehow converted into numerical order. In this sense, they are ... |

159 | Mining longest repeating subsequences to predict world wide web surfing
- Pitkow, Pirolli
(Show Context)
Citation Context ...wo session sequences. In our method, similarity between sessions are measured through their Best Matching. Other works indirectly related to the topic of web session clustering include: Pitkow et al. =-=[8]-=- explored predictive modeling techniques by introducing a statistic Longest Repeating Sub-sequence model which can be used for modeling and predicting user surfing paths. Spiliopoulou et al. [14] buil... |

156 | Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
- Huang
- 1998
(Show Context)
Citation Context ...nd the session clusters. 12 5 30 45 45 65(1)sFor the known clustering algorithms, we tried ROCK[5], CHAMELEON [9] and TURN [3] on our testing data set. K-modes is also applicable for categorical data =-=[7]-=-, but its similarity measure is tightly based on vector space which is different from our sequence similarity measure, also considering the common problem of k-means family algorithms, which assumes c... |

153 | Knowledge discovery from users web-page navigation
- Shahabi, Zarkesh, et al.
- 1999
(Show Context)
Citation Context ...he studies in the area of web usage mining are very new, and the topic of clustering web sessions has recently become popular in the field of real application of clustering techniques. Shahabi et al. =-=[15]-=- introduced the idea of Path Feature Space to represent all the navigation paths. Similarity between each two paths in the Path Feature Space is measured by the definition of Path Angle which is actua... |

52 | Clickstream clustering using weighted longest common subsequences
- Banerjee, Ghosh
- 2001
(Show Context)
Citation Context ...detail is mentioned of the actual clustering algorithm used as the paper is principally on Association Rule mining. One recent paper which bears some 4ssimilarity to our work is by Banerjee and Ghosh =-=[1]-=-. This paper introduced a new method for measuring similarity between web sessions: The longest common sub-sequences between two sessions is first found through dynamic programming, then the similarit... |

46 | Clustering of Web users based on access patterns,” presented at
- Fu, Sandhu, et al.
- 2002
(Show Context)
Citation Context ...d by the definition of Path Angle which is actually based on the Cosine similarity between two vectors. In this work, k-means cluster method is utilized to cluster user navigation patterns. Fu et al. =-=[4]-=- cluster users based on clustering web sessions. Their work employed attribute oriented induction to transfer the web session data into a space of generalized sessions, then apply the clustering algor... |

35 |
Global partial orders from sequential data
- Mannila, Meek
- 2000
(Show Context)
Citation Context ... of interesting navigation patterns. In their system, interestingness criteria for navigation patterns are dynamically specified by the human expert using WUM’s mining language MINT. Mannila and Mee=-=k [6]-=- presented a method for finding partial orders that describe the ordering relationships between the events in a collection of sequences. Their method can be applied to the discovery of partial orders ... |

32 | Usage Mining for a better Web-based Learning Environment
- Zaiane, Web
- 2001
(Show Context)
Citation Context ...sites, sessions could be accurately determined by identifying idle times between page accesses [13]. While we do not advocate this sessioninzing practice for web sessions in the context of e-learning =-=[20]-=-, we have adopted this method in this paper for the sake of simplicity and proof of concept. In this study, user sessions have been identified using a 25-minute timeout threshold between page access. ... |

17 | A non-parametric approach to Web log analysis
- FOSS, WANG, et al.
- 2001
(Show Context)
Citation Context ...orical data: ROCK [5] an algorithm that acts on a sample of the dataset, CHAMELEON [9], which is based on graph partitioning, and a new algorithm TURN for discrete distributions that we introduced in =-=[3]-=-. All of these algorithms, when used in the past for clustering web sessions, have treated sessions as unordered sets of clicks. The similarity measures used to compare sessions were simply based on i... |

14 |
Sequence alignment using FastLSA
- Charter, Schaeffer, et al.
- 2000
(Show Context)
Citation Context ... and apply sequence similarity measure to measure similarity between sessions. Sequence alignment actually is not a new topic; there exist several 5salgorithms for solving sequence alignment problems =-=[10]-=-. Our method for measuring similarity between session sequences borrows the basic ideas from these algorithms. However, most sequence alignment algorithms for DNA sequencing consider very long sequenc... |

14 | Towards evaluating learners’ behaviour in a Web-based distance learning environment
- Zaïane, Luo
- 2001
(Show Context)
Citation Context ...ssification on the transformed data in order to discover relevant and potentially useful patterns, and finally, pattern analysis during which the user retrieves and interprets the patterns discovered =-=[21]-=-. Session cluster discovery is an important part of web data mining. In the context of e-learning, our application of interest, the function of clustering can have a myriad uses, such as grouping lear... |

1 |
Measuring the accuracy of sessionisers for web usage mining
- Mobasher, Spilipoulou, et al.
- 2001
(Show Context)
Citation Context ...xies, and requests reset by visitors, etc. It was demonstrated that in the general context of e-commerce sites, sessions could be accurately determined by identifying idle times between page accesses =-=[13]-=-. While we do not advocate this sessioninzing practice for web sessions in the context of e-learning [20], we have adopted this method in this paper for the sake of simplicity and proof of concept. In... |

1 |
Gno04] “Usability and Readability Considerations For Technical Documentation,” http://developer.gnome.org/ documents/usability/usability-readability.html. Accessed June 2004. [HHS03] United States Department of Health and Human Services
- unknown authors
- 1999
(Show Context)
Citation Context ...t al. [8] explored predictive modeling techniques by introducing a statistic Longest Repeating Sub-sequence model which can be used for modeling and predicting user surfing paths. Spiliopoulou et al. =-=[14]-=- built a mining system, WUM, for the discovery of interesting navigation patterns. In their system, interestingness criteria for navigation patterns are dynamically specified by the human expert using... |