Results 1  10
of
43
Prediction of complete gene structures in human genomic DNA
 J. Mol. Biol
, 1997
"... The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely ..."
Abstract

Cited by 748 (7 self)
 Add to MetaCart
The problem of identifying genes in genomic DNA sequences by computational methods has attracted considerable research attention in recent years. From one point of view, the problem is closely
Current methods of gene prediction, their strengths and weaknesses
, 2002
"... While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along ..."
Abstract

Cited by 64 (4 self)
 Add to MetaCart
While the genomes of many organisms have been sequenced over the last few years, transforming such raw sequence data into knowledge remains a hard task. A great number of prediction programs have been developed that try to address one part of this problem, which consists of locating the genes along a genome. This paper reviews the existing approaches to predicting genes in eukaryotic genomes and underlines their intrinsic advantages and limitations. The main mathematical models and computational algorithms adopted are also briefly described and the resulting software classified according to both the method and the type of evidence used. Finally, the several difficulties and pitfalls encountered by the programs are detailed, showing that improvements are needed and that new directions must be considered.
Evaluation of genefinding programs on mammalian sequences
 Genome Res
, 2001
"... Article cited in: ..."
Identification of Genes in Human Genomic DNA
, 1997
"... A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semiMarkov" or semiMarkov source architecture which incorpo ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
A general probabilistic model of the gene structural and compositional properties of human genomic DNA is introduced and applied to the problem of identifying genes in unannotated human genomic sequences. The model uses a \Hidden semiMarkov&quot; or semiMarkov source architecture which incorporates probabilistic descriptions of fundamental transcriptional, translational and splicing signals, as well as length distributions and compositional features of exons, introns and intergenic regions. Distinct sets of model parameters are derived which account for many of the substantial di erences in gene density and structure observed in distinct C+G compositional regions (\isochores&quot;) of the human genome. A novel model building procedure, termed Maximal Dependence Decomposition, is introduced which captures potentially important dependencies between nonadjacent aswell as adjacent positions in a biological signal. Application of this model to the donor splice signal not only gives better discrimination of potential donor sites than previous probabilistic models, but also reveals subtle properties of this signal which suggest aspects of its biochemical function. Acceptor
Nature and structure of human genes that generate retropseudogenes. Genome Res 10: 672–678
, 2000
"... ..."
An optimal algorithm for the maximumdensity segment problem
 SIAM Journal on Computing
"... We address a fundamental problem arising from analysis of biomolecular sequences. The input consists of two numbers wmin and wmax and a sequence S of n number pairs (ai, wi) with wi> 0. Let segment S(i, j) of S be the consecutive subsequence of S between indices i and j. The density of S(i, j) is ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We address a fundamental problem arising from analysis of biomolecular sequences. The input consists of two numbers wmin and wmax and a sequence S of n number pairs (ai, wi) with wi> 0. Let segment S(i, j) of S be the consecutive subsequence of S between indices i and j. The density of S(i, j) is d(i, j) = (ai+ai+1+ · · ·+aj)/(wi+wi+1+ · · ·+wj). The maximumdensity segment problem is to find a maximumdensity segment over all segments S(i, j) with wmin ≤ wi + wi+1 + · · · + wj ≤ wmax. The best previously known algorithm for the problem, due to Goldwasser, Kao, and Lu, runs in O(n log(wmax−wmin+1)) time. In the present paper, we solve the problem in O(n) time. Our approach bypasses the complicated rightskew decomposition, introduced by Lin, Jiang, and Chao. As a result, our algorithm has the capability to process the input sequence in an online manner, which is an important feature for dealing with genomescale sequences. Moreover, for a type of input sequences S representable in O(m) space, we show how to exploit the sparsity of S and solve the maximumdensity segment problem for S in O(m) time. 1
LinearTime Algorithms for Computing MaximumDensity Sequence Segments with Bioinformatics Applications
"... We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a i ; w i ) for i = 1; : : : ; n and w i > 0, a segment A(i; j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i; j) is w(i; j) = ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A of pairs (a i ; w i ) for i = 1; : : : ; n and w i > 0, a segment A(i; j) is a consecutive subsequence of A starting with index i and ending with index j. The width of A(i; j) is w(i; j) = w k , and the density is ( ikj a k )=w(i; j): The maximumdensity segment problem takes A and two values L and U as input and asks for a segment of A with the largest possible density among those of width at least L and at most U . When U is unbounded, we provide a relatively simple, O(n)time algorithm, improving upon the O(n log L)time algorithm by Lin, Jiang and Chao. When both L and U are speci ed, there are no previous nontrivial results. We solve the problem in O(n) time if w i = 1 for all i, and more generally in O(n + n log(U L + 1)) time when w i 1 for all i.
Statistical Properties of Open Reading Frames in Complete Genome Sequences
, 1999
"... Some statistical properties of open reading frames in all currently available complete genome sequences are analyzed (seventeen prokatyotic genomes, and 16 chromosome sequences from the yeast genome). The size distribution of open reading frames is characterized by various techniques, such as quanti ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Some statistical properties of open reading frames in all currently available complete genome sequences are analyzed (seventeen prokatyotic genomes, and 16 chromosome sequences from the yeast genome). The size distribution of open reading frames is characterized by various techniques, such as quantile tables, QQplots, ranksize plots (Zipf's plots), and spatial densities. The issue of the influence of CG% on the size distribution is addressed. When yeast chromosomes are compared with archaeal and eubacterial genomes, they tend to have more long open reading frames. There is little or no evidence to reject the null hypothesis that open reading frames on six different reading frames and two strands distribute similarly. A topic of current interest, the base composition asymmetry in open reading frames between the two strands, is studied using regression analysis. The base composition asymmetry at three codon positions is analyzed separately. It was shown in these genome sequences that the first codon position is G and Arich (i.e. purinerich); there is a coexistence of A and Trich branches at the second codon position; and the third codon position is weakly Trich.
Fast Algorithms for Finding MaximumDensity Segments of a Sequence with Applications to Bioinformatics
 in Proceedings of the Second International Workshop on Algorithms in Bioinformatics
, 2002
"... We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A = ha1 ; a2 ; : : : ; ani of real numbers, a segment S is a consecutive subsequence ha i ; a i+1 ; : : : ; a j i. ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
We study an abstract optimization problem arising from biomolecular sequence analysis. For a sequence A = ha1 ; a2 ; : : : ; ani of real numbers, a segment S is a consecutive subsequence ha i ; a i+1 ; : : : ; a j i.
Identification of a candidate regulatory region in the human CD8 gene complex by colocalization of DNase I hypersensitive sites and matrix attachment regions which bind SATB1
, 2002
"... To locate elements regulating the human CD8 gene complex, we mapped nuclear matrix attachment regions (MARs) and DNase I hypersensitive (HS) sites over a 100kb region that included the CD8B gene, the intergenic region, and the CD8A gene. MARs facilitate longrange chromatin remodeling required for ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
To locate elements regulating the human CD8 gene complex, we mapped nuclear matrix attachment regions (MARs) and DNase I hypersensitive (HS) sites over a 100kb region that included the CD8B gene, the intergenic region, and the CD8A gene. MARs facilitate longrange chromatin remodeling required for enhancer activity and have been found closely linked to several lymphoid enhancers. Within the human CD8 gene complex, we identified six DNase HS clusters, four strong MARs, and several weaker MARs. Three of the strong MARs were closely linked to two tissuespecific DNase HS clusters (III and IV) at the 3 � end of the CD8B gene. To further establish the importance of this region, we obtained 19 kb of sequence and screened for potential binding sites for the MARbinding protein, SATB1, and for GATA3, both of which are critical for T cell development. By gel shift analysis we identified two strong SATB1 binding sites, located 4.5 kb apart, in strong MARs. We also detected strong GATA3 binding to an oligonucleotide containing two GATA3 motifs located at an HS site in cluster IV. This clustering of DNase HS sites and MARs capable of binding SATB1 and GATA3 at the 3 � end of the CD8B gene suggests that this region is an epigenetic regulator of CD8 expression. The Journal of Immunology, 2002, 168: 3915–3922. As thymocytes progress through development, they undergo induction and repression of a number of cell surface molecules. Changes in the expression of the CD4 and CD8 T cell surface glycoproteins best characterize the stages