A greedy algorithm for aligning DNA sequences
 J. COMPUT. BIOL
, 2000
"... For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy a ..."
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors from other sources, a greedy algorithm can be much faster than traditional dynamic programming approaches and yet produce an alignment that is guaranteed to be theoretically optimal. We introduce a new greedy alignment algorithm with particularly good performance and show that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data. An implementation of this algorithm is currently used in a program that assembles the UniGene database at the National Center for Biotechnology Information.
Optimal alignments in linear space
 CABIOS
, 1988
"... Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed spacesaving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the newer proposals, bo ..."
Space, not time, is often the limiting factor when computing optimal sequence alignments, and a number of recent papers in the biology literature have proposed spacesaving strategies. However, a 1975 computer science paper by Hirschberg presented a method that is superior to the newer proposals, both in theory and in practice. The goal of this note is to give Hirschberg’s idea the visibility it deserves by developing a linearspace version of Gotoh’s algorithm, which accommodates affine gap penalties. A portable Csoftware package implementing this algorithm is available on the BIONET free of charge.
Mining email social networks
 in Proceedings of the 3rd International Workshop on Mining Software Repositories
, 2006
"... Communication & Coordination activities are central to large software projects, but are difficult to observe and study in traditional (closedsource, commercial) settings because of the prevalence of informal, direct communication modes. OSS projects, on the other hand, use the internet as the comm ..."
Communication & Coordination activities are central to large software projects, but are difficult to observe and study in traditional (closedsource, commercial) settings because of the prevalence of informal, direct communication modes. OSS projects, on the other hand, use the internet as the communication medium, and typically conduct discussions in an open, public manner. As a result, the email archives of OSS projects provide a useful trace of the communication and coordination activities of the participants. However, there are various challenges that must be addressed before this data can be effectively mined. Once this is done, we can construct social networks of email correspondents, and begin to address some interesting questions. These include questions relating to participation in the email; the social status of different types of OSS participants; the relationship of email activity and commit activity (in the CVS repositories) and the relationship of social status with commit activity. In this paper, we begin with a discussion of our infrastructure and then discuss our approach to mining the email archives; and finally we present some preliminary results from our data analysis.
An Algorithm for Approximate Tandem Repeats
 In Proceedings of the 4th Annual Symposium on Combinatorial Pattern Matching (CPM), volume 684 of Lecture Notes in Computer Science
, 1993
"... A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g. abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g. abcdaacd. ..."
A perfect single tandem repeat is defined as a nonempty string that can be divided into two identical substrings, e.g. abcabc. An approximate single tandem repeat is one in which the substrings are similar, but not identical, e.g. abcdaacd.
Faster Approximate String Matching
 Algorithmica
, 1999
"... We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, ..."
We present a new algorithm for online approximate string matching. The algorithm is based on the simulation of a nondeterministic finite automaton built from the pattern and using the text as input. This simulation uses bit operations on a RAM machine with word length w = \Omega\Gamma137 n) bits, where n is the text size. This is essentially similar to the model used in Wu and Manber's work, although we improve the search time by packing the automaton states differently. The running time achieved is O(n) for small patterns (i.e. whenever mk = O(log n)), where m is the pattern length and k ! m the number of allowed errors. This is in contrast with the result of Wu and Manber, which is O(kn) for m = O(log n). Longer patterns can be processed by partitioning the automaton into many machine words, at O(mk=w n) search cost. We allow generalizations in the pattern, such as classes of characters, gaps and others, at essentially the same search cost. We then explore other novel techniques t...
Communication networks from the enron email corpus ”it’s always about the people. enron is no different
 Computational & Mathematical Organization Theory
, 2005
"... The Enron email corpus is appealing to researchers because it is a) a large scale email collection from b) a real organization c) over a period of 3.5 years. In this paper we contribute to the initial investigation of the Enron email dataset from a social network analytic perspective. We report on h ..."
The Enron email corpus is appealing to researchers because it is a) a large scale email collection from b) a real organization c) over a period of 3.5 years. In this paper we contribute to the initial investigation of the Enron email dataset from a social network analytic perspective. We report on how we enhanced and refined the Enron corpus with respect to relational data and how we extracted communication networks from it. We apply various network analytic techniques in order to explore structural properties of the networks in Enron and to identify key players across time. Our initial results indicate that during the Enron crisis the network had been denser, more centralized and more connected than during normal times. Our data also suggests that during the crisis the communication among Enron’s employees had been more diverse with respect to people’s formal positions, and that top executives had formed a tight clique with mutual support and highly brokered interactions with the rest of organization. The insights gained with the analyses we perform and propose are of potential further benefit for modeling the development of crisis scenarios in organizations and the investigation of indicators of failure.
Faster Algorithms for String Matching with k Mismatches
 J. OF ALGORITHMS
, 2000
"... The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T . Currently, the fastest algorithms for this problem are the following. The LandauVishkin algorithm finds all locations where the pat ..."
The string matching with mismatches problem is that of finding the number of mismatches between a pattern P of length m and every length m substring of the text T . Currently, the fastest algorithms for this problem are the following. The LandauVishkin algorithm finds all locations where the pattern has at most k errors (where k is part of the input) in time O(nk). The Abrahamson algorithm finds the number of mismatches at every location in time O(n p m log m). We present
Efficient similarity joins for near duplicate detection
 In WWW
, 2008
"... With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given ..."
With the increasing amount of data and the need to integrate data from multiple data sources, one of the challenging issues is to identify near duplicate records efficiently. In this paper, we focus on efficient algorithms to find pair of records such that their similarities are no less than a given threshold. Several existing algorithms rely on the prefix filtering principle to avoid computing similarity values for all possible pairs of records. We propose new filtering techniques by exploiting the token ordering information; they are integrated into the existing methods and drastically reduce the candidate sizes and hence improve the efficiency. We have also studied the implementation of our proposed algorithm in standalone and RDBMSbased settings. Experimental results show our proposed algorithms can outperforms previous algorithms on several real datasets.
Block Edit Models for Approximate String Matching
 Theoretical Computer Science
, 1997
"... In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important realworld applications, including pen computing and molecu ..."
In this paper we examine string block edit distance, in which two strings A and B are compared by extracting collections of substrings and placing them into correspondence. This model accounts for certain phenomena encountered in important realworld applications, including pen computing and molecular biology. The basic problem admits a family of variations depending on whether the strings must be matched in their entireties, and whether overlap is permitted. We show that several variants are NPcomplete, and give polynomialtime algorithms for solving the remainder. Keywords: block edit distance, approximate string matching, sequence comparison, approximate ink matching, dynamic programming. 1 Introduction The edit distance model for string comparison [Lev66, NW70, WF74] has found widespread application in fields ranging from molecular biology to bird song classification [SK83]. A great deal of research has been devoted to this area, and numerous algorithms have been proposed for com...
Divideandconquer frontier search applied to optimal sequence alignment
 In National Conference on Artificial Intelligence (AAAI
, 2000
"... We present a new algorithm that reduces the space complexity of heuristic search. It is most e ective for problem spaces that grow polynomially with problem size, but contain large numbers of short cycles. For example, the problem of nding an optimal global alignment ofseveral DNA or aminoacid sequ ..."
We present a new algorithm that reduces the space complexity of heuristic search. It is most e ective for problem spaces that grow polynomially with problem size, but contain large numbers of short cycles. For example, the problem of nding an optimal global alignment ofseveral DNA or aminoacid sequences can be solved by nding a lowestcost cornertocorner path in a ddimensional grid. A previous algorithm, called divideandconquer bidirectional search (Korf 1999), saves memory by storing only the Open lists and not the Closed lists. We show that this idea can be applied in a unidirectional search aswell. This extends the technique to problems where bidirectional search is not applicable, and is more e cient in both time and space than the bidirectional version. If n is the length of the strings, and d is the number of strings, this algorithm can reduce the memory requirement from O(n d) to O(n d;1). While our current implementation of DCFS is somewhat slower than existing dynamic programming approaches for optimal alignment of multiple gene sequences, DCFS is a more general algorithm 1