Results 1  10
of
15
Clickstream Clustering Using Weighted Longest Common Subsequences
 In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining
, 2001
"... Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorith ..."
Abstract

Cited by 52 (4 self)
 Add to MetaCart
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques.
Longest Common Subsequences
 In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband supersequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to stringmatching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
Upper Bounds for the Expected Length of Longest Common Subsequences
, 1996
"... Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched opti ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched optimally. Upper bounds for ck can be derived from upper bounds for the number of nondominated collations. Using local properties of matches we can eliminate many nondominated collations and improve upper bounds for ck . 1 Introduction The problem of finding longest common subsequences arises in various situations. As typical we can mention approximate string matching and text comparisons (e.g. the diff function in UNIX) [1, 11]. Another important area where the longest common subsequence problem appears is molecular biology. The longest common subsequence problem is a special case of the more general sequence alignment problem. A survey on the longest common subsequence problem can be found in...
Bounding the Expected Length of Longest Common Subsequences and Forests
 Proc. of WSP'96
, 1999
"... . We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogo ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
. We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogorov complexity. As a corollary, we obtain some new lower and upper bounds for the problems mentioned. 1 Introduction The longest common subsequence (LCS) of two strings is one of the main problems in combinatorial pattern matching. The LCS problem is related to DNA or protein alignments, file comparison, speech recognition, etc. We say that x is a subsequence of u if we can obtain x by deleting zero or more characters of u. The LCS of two strings u and v of length n is defined as the longest subsequence x common to u and v. For example, the LCS of longest and large is lge. An open problem related to the LCS is its expected length for two random strings of length n over a uniformly distrib...
An Analytic Study of the Phase Transition Line in Local Sequence Alignment with Gaps
 Appl. Math
, 1999
"... A detailed analytic study of the loglinear phase transition of the SmithWaterman local alignment algorithm is presented. A rectangular alignment lattice is introduced to facilitate the statistical analysis for alignment with gaps. With a few simplifying assumptions, we obtain an analytic expressio ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
A detailed analytic study of the loglinear phase transition of the SmithWaterman local alignment algorithm is presented. A rectangular alignment lattice is introduced to facilitate the statistical analysis for alignment with gaps. With a few simplifying assumptions, we obtain an analytic expression for the loci of the phase transition line. Our result reproduces the exact and conjectured values for the very large and very small gap costs; the latter corresponds to the related problem of the longest common subsequence. For intermediate values of gap costs, our result is not exact, although a comparison to numerical results yielded a difference of no more than several percent.
Extensive Simulations for Longest Common Subsequences: Finite Size Scaling, a Cavity Solution, and Configuration Space Properties
, 1998
"... . Given two strings X and Y of N and M characters respectively, the Longest Common Subsequence (LCS) Problem asks for the longest sequence of (noncontiguous) matches between X and Y. Let LN be the length of a LCS of two random strings of size N . Using extensive Monte Carlo simulations for this ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
. Given two strings X and Y of N and M characters respectively, the Longest Common Subsequence (LCS) Problem asks for the longest sequence of (noncontiguous) matches between X and Y. Let LN be the length of a LCS of two random strings of size N . Using extensive Monte Carlo simulations for this problem, we find a finite size scaling law of the form E(LN )=N = fl S +AS=(lnN p N) + :::, where fl S and AS are constants depending on S, the alphabet size. We provide precise estimates of fl S for 2 S 15. We also study the related Bernoulli Matching model where the different entries of the "strings" are matched independently with probability 1=S. Let L B NM be the length of a longest sequence of matches in this case, for a given instance of size N \Theta M . On the basis of a cavitylike analysis we find fl B S (r) = (2 p rS \Gamma r \Gamma 1)=(S \Gamma 1), where fl B S (r) is the limit of E(L B NM )=N as N !1, the ratio r = M=N being fixed. This formula agrees very we...
Speedingup Hirschberg and HuntSzymanski LCS Algorithms
, 2003
"... Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divideandconquer algorithm. The second algorithm is an improvement of HuntSzymanski algorithm based on an efficient computation of ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divideandconquer algorithm. The second algorithm is an improvement of HuntSzymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bitvector operations and are shown to work very efficiently in practice.
Shift Error Detection in Standardized Exams
 In Proc. 11th Symp. Combinatorial Pattern Matching (CPM
, 2000
"... Hundreds of millions of multiple choice exams are given every year in the United States. These exams permit formfilling shift errors, where an absentminded mismarking displaces a long run of correct answers. A shift error can substantially alter the exam's score, and thus invalidate it. In this pap ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Hundreds of millions of multiple choice exams are given every year in the United States. These exams permit formfilling shift errors, where an absentminded mismarking displaces a long run of correct answers. A shift error can substantially alter the exam's score, and thus invalidate it. In this paper, we develop algorithms to accurately detect and correct shift errors, while guaranteeing few false detections. We propose a shift error model, and probabilistic methods to identify shifted exam regions. We describe the results of our search for shift errors in undergraduate Stony Brook exam sets, and in over 100,000 Scholastic Amplitude Tests. These results suggest that approximately 2% of all tests contain shift errors. Extrapolating these results over all multiple choice exams and forms leads us to conclude that exam takers make millions of undetected shift errors each year. Employing probabilistic shift correcting systems is inherently dangerous. Such systems may be taken advantage of by clever examinees, who seek to increase the probability of correct guessing. We conclude our paper with a short study of optimal guessing strategies when faced with a generous shift error correcting system.
A Practical Approach to Significance Assessment in Alignment with Gaps
"... Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine this new approach with known properties of the dynamics of the global alignment algorithm and high performance numerical techniques and present a novel method for assessing significance of gaps within practical time scales. The results and performance of these new methods test very well against tried methods with drastically less effort.