• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Expected length of longest common subsequences (1994)

by V Dančík
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 14
Next 10 →

Clickstream Clustering Using Weighted Longest Common Subsequences

by Arindam Banerjee, Joydeep Ghosh - In Proceedings of the Web Mining Workshop at the 1st SIAM Conference on Data Mining , 2001
"... Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorith ..."
Abstract - Cited by 41 (3 self) - Add to MetaCart
Categorizing visitors based on their interactions with a website is a key problem in web usage mining. The clickstreams generated by various users often follow distinct patterns, the knowledge of which may help in providing customized content. In this paper, we propose a novel and effective algorithm for clustering webusers based on a function of the longest common subsequence of their clickstreams that takes into account both the trajectory taken through a website and the time spent at each page. Results are presented on weblogs of www.sulekha.com to illustrate the techniques.

Longest Common Subsequences

by Mike Paterson, Vlado Dancik - In Proc. of 19th MFCS, number 841 in LNCS , 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore ..."
Abstract - Cited by 25 (1 self) - Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband super-sequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to string-matching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...

Upper Bounds for the Expected Length of Longest Common Subsequences

by Vlado Dancik , 1996
"... Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched opti ..."
Abstract - Cited by 16 (3 self) - Add to MetaCart
Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched optimally. Upper bounds for ck can be derived from upper bounds for the number of nondominated collations. Using local properties of matches we can eliminate many nondominated collations and improve upper bounds for ck . 1 Introduction The problem of finding longest common subsequences arises in various situations. As typical we can mention approximate string matching and text comparisons (e.g. the diff function in UNIX) [1, 11]. Another important area where the longest common subsequence problem appears is molecular biology. The longest common subsequence problem is a special case of the more general sequence alignment problem. A survey on the longest common subsequence problem can be found in...

Bounding the Expected Length of Longest Common Subsequences and Forests

by Ricardo A. Baeza-yates, Ricard Gavalda, Gonzalo Navarro - Proc. of WSP'96 , 1999
"... . We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogo ..."
Abstract - Cited by 12 (0 self) - Add to MetaCart
. We present two techniques to find lower and upper bounds for the expected length of longest common subsequences and forests of two random sequences of the same length, over a fixed size, uniformly distributed alphabet. We emphasize the power of the methods used, which are Markov chains and Kolmogorov complexity. As a corollary, we obtain some new lower and upper bounds for the problems mentioned. 1 Introduction The longest common subsequence (LCS) of two strings is one of the main problems in combinatorial pattern matching. The LCS problem is related to DNA or protein alignments, file comparison, speech recognition, etc. We say that x is a subsequence of u if we can obtain x by deleting zero or more characters of u. The LCS of two strings u and v of length n is defined as the longest subsequence x common to u and v. For example, the LCS of longest and large is lge. An open problem related to the LCS is its expected length for two random strings of length n over a uniformly distrib...

An Analytic Study of the Phase Transition Line in Local Sequence Alignment with Gaps

by R. Bundschuh, T. Hwa - Appl. Math , 1999
"... A detailed analytic study of the log-linear phase transition of the Smith-Waterman local alignment algorithm is presented. A rectangular alignment lattice is introduced to facilitate the statistical analysis for alignment with gaps. With a few simplifying assumptions, we obtain an analytic expressio ..."
Abstract - Cited by 9 (5 self) - Add to MetaCart
A detailed analytic study of the log-linear phase transition of the Smith-Waterman local alignment algorithm is presented. A rectangular alignment lattice is introduced to facilitate the statistical analysis for alignment with gaps. With a few simplifying assumptions, we obtain an analytic expression for the loci of the phase transition line. Our result reproduces the exact and conjectured values for the very large and very small gap costs; the latter corresponds to the related problem of the longest common subsequence. For intermediate values of gap costs, our result is not exact, although a comparison to numerical results yielded a difference of no more than several percent.

Expected Length of the Longest Common Subsequence for Large Alphabets

by Marcos Kiwi, Depto Ing Matem'atica, Jiri Matousek, Martin Loebl
"... We consider the length L of the longest common subsequence of two randomly uniformly and independently chosen n character words over a k-ary alphabet. Subadditivity arguments yield that E [L] =n converges to a constant fl k . We prove a conjecture of Sankoff and Mainville from the early 80's clai ..."
Abstract - Cited by 6 (1 self) - Add to MetaCart
We consider the length L of the longest common subsequence of two randomly uniformly and independently chosen n character words over a k-ary alphabet. Subadditivity arguments yield that E [L] =n converges to a constant fl k . We prove a conjecture of Sankoff and Mainville from the early 80's claiming that fl k k ! 2 as k !1.

Extensive Simulations for Longest Common Subsequences: Finite Size Scaling, a Cavity Solution, and Configuration Space Properties

by J. Boutet de Monvel , 1998
"... . Given two strings X and Y of N and M characters respectively, the Longest Common Subsequence (LCS) Problem asks for the longest sequence of (non-contiguous) matches between X and Y. Let LN be the length of a LCS of two random strings of size N . Using extensive Monte Carlo simulations for this ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
. Given two strings X and Y of N and M characters respectively, the Longest Common Subsequence (LCS) Problem asks for the longest sequence of (non-contiguous) matches between X and Y. Let LN be the length of a LCS of two random strings of size N . Using extensive Monte Carlo simulations for this problem, we find a finite size scaling law of the form E(LN )=N = fl S +AS=(lnN p N) + :::, where fl S and AS are constants depending on S, the alphabet size. We provide precise estimates of fl S for 2 S 15. We also study the related Bernoulli Matching model where the different entries of the "strings" are matched independently with probability 1=S. Let L B NM be the length of a longest sequence of matches in this case, for a given instance of size N \Theta M . On the basis of a cavity-like analysis we find fl B S (r) = (2 p rS \Gamma r \Gamma 1)=(S \Gamma 1), where fl B S (r) is the limit of E(L B NM )=N as N !1, the ratio r = M=N being fixed. This formula agrees very we...

Speeding-up Hirschberg and Hunt-Szymanski LCS Algorithms

by Maxime Crochemore, Costas S. Iliopoulos, Yoan J. Pinzon , 2003
"... Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of ..."
Abstract - Cited by 5 (0 self) - Add to MetaCart
Two algorithms are presented that solve the problem of recovering the longest common subsequence of two strings. The first algorithm is an improvement of Hirschberg’s divide-and-conquer algorithm. The second algorithm is an improvement of Hunt-Szymanski algorithm based on an efficient computation of all dominant match points. These two algorithms use bit-vector operations and are shown to work very efficiently in practice.

Shift Error Detection in Standardized Exams

by Steven Skiena, Pavel Sumazin - In Proc. 11th Symp. Combinatorial Pattern Matching (CPM , 2000
"... Hundreds of millions of multiple choice exams are given every year in the United States. These exams permit formfilling shift errors, where an absent-minded mismarking displaces a long run of correct answers. A shift error can substantially alter the exam's score, and thus invalidate it. In this pap ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Hundreds of millions of multiple choice exams are given every year in the United States. These exams permit formfilling shift errors, where an absent-minded mismarking displaces a long run of correct answers. A shift error can substantially alter the exam's score, and thus invalidate it. In this paper, we develop algorithms to accurately detect and correct shift errors, while guaranteeing few false detections. We propose a shift error model, and probabilistic methods to identify shifted exam regions. We describe the results of our search for shift errors in undergraduate Stony Brook exam sets, and in over 100,000 Scholastic Amplitude Tests. These results suggest that approximately 2% of all tests contain shift errors. Extrapolating these results over all multiple choice exams and forms leads us to conclude that exam takers make millions of undetected shift errors each year. Employing probabilistic shift correcting systems is inherently dangerous. Such systems may be taken advantage of by clever examinees, who seek to increase the probability of correct guessing. We conclude our paper with a short study of optimal guessing strategies when faced with a generous shift error correcting system.

A Practical Approach to Significance Assessment in Alignment with Gaps

by Nicholas Chia, Ralf Bundschuh
"... Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract. Current numerical methods for assessing the statistical significance of local alignments with gaps are time consuming. Analytical solutions thus far have been limited to specific cases. Here, we present a new line of attack to the problem of statistical significance assessment. We combine this new approach with known properties of the dynamics of the global alignment algorithm and high performance numerical techniques and present a novel method for assessing significance of gaps within practical time scales. The results and performance of these new methods test very well against tried methods with drastically less effort.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University