Computational Tools for Protein-DNA Interactions How Many Binding Proteins Exist?
BibTeX
@MISC{Kauffman_computationaltools,
author = {Christopher Kauffman and George Karypis},
title = {Computational Tools for Protein-DNA Interactions How Many Binding Proteins Exist?},
year = {}
}
OpenURL
Abstract
Abstract Interactions between DNA and proteins are central to living systems, and characterizing how and when they occur would greatly enhance our understanding of working genomes. We review the different computational problems associated with protein-DNA interactions and the various methods used to solve them. A wide range of topics is covered including physics-based models for direct and indirect recognition, identification of transcription factor binding sites, and methods to predict DNA-binding proteins. Our goal is to introduce this important problem domain to data mining researchers by identifying the key issues and challenges inherent to the area as well as provide directions for fruitful future research. Interactions between deoxyribonucleic acid (DNA) and proteins are widely recognized as central to living systems. These interactions come in a variety of forms including repair of damaged DNA and transcription of genes into RNA. More recently it has been found that, by binding to certain DNA segments, proteins can promote or repress the transcription of genes in the vicinity of the binding site. Proteins of this kind are referred to as transcription factors (TFs). The number of TFs in an organism appears to be related to the complexity of the underlying genome: as the number of of genes increases, the number of TFs increases according to a power law Characterizing how and when protein-DNA interactions occur would greatly enhance our understanding of the genome at work. A full picture of the interactions will eventually allow characterization of which genes are transcribed at any given time in order for the organism to react dynamically to a changing environment. Protein-DNA interactions are studied both in the wet lab and computationally. Here a synergy exists: lab experiments provide data and problems for computational methods to solve while computation provides hypotheses which guide additional lab experiments. The goal of this article is to review three major areas of interest for computational studies of protein DNA interactions: (1) physics-based studies of protein-DNA interaction, (2) identification of transcription factor binding sites, and (3) identification of DNA-binding proteins. How Many Binding Proteins Exist? Accounts of how many DNA-binding proteins exist vary through the literature. Attention is particularly focused on transcription factors. Older sources estimated that 2-3% of a prokaryotic genome and 6-7% of a eukaryotic genome encodes DNA-binding proteins According to gene ontology annotations in PEDANT, there are currently 1714 genes in the human genome identified as coding for DNA-binding proteins with 885 of them identified as 1 GO Term Count %all %func All gene products 18269 100.0 115.6 Molecular Function Given 15801 86.5 100.0 DNA-binding 2375 13.0 15.0 Transcription factor activity 969 5.3 6.1 transcription factors 1 . This is slightly smaller than the numbers currently in the AMIGO gene ontology browser 2 which are given in Most proteins are composed of several independent units called domains. A domain which interacts with DNA is referred to as a DNA-binding domain and contains a structural motif that enables binding (see section 7.4 of Physical Models and Energetics Insight can be gained about DNA-protein interactions by studying them using physics models. Approaches in the literature examine bound protein-DNA complexes and either apply existing software to obtain interaction energy or develop new energy functions. Both approaches make use of complexed structures from the PDB. The goals of such studies are usually to establish why binding happens, to quantify energy changes between the bound and unbound states, and to understand how mutation in either protein or DNA may affect binding affinity. Basic understanding of binding physics guides both the development of transcription factor binding site models and the generation of protein and DNA features used in machine learning. Early Work An early review of the structure motifs used by transcription factors provided a number of principles used by the proteins to recognize DNA Once a sufficient number of different DNA-binding protein families became available, it became apparent that various protein structures use diverse means of binding and achieving binding specificity to targeted sequences of DNA calling for more complex modeling techniques Physics of Recognition Mechanisms Protein-DNA-binding is thought to occur because the bound pair has lower free energy than the unbound molecules. A variety of factors governing free energy change are considered by Jayaram and coworkers in Specificity Tests: Mutating DNA and Protein Sequences A common use of binding energetics models is to study DNA mutations and their effects on binding energy. Determining which DNA sequences result in low-energy binding to the protein indicates the protein's likely binding sites on the genome 3 Transcription Factor Binding Site Identification Transcription factors (TFs) are DNA-binding proteins whose primary purpose is to regulate the transcription of genes. Though there are some exceptions, many TFs accomplish regulation by binding to DNA at specific sites. The presence of the bound TF will attract or obstruct RNA polymerase thus promoting or repressing gene expression, respectively. TFs appear in greater abundance in eukaryotes and higher animals allowing more complex regulatory control of how and when genes are transcribed In order to form a picture of the working genome, it has become important to identify the genes that TFs affect by finding the genomic locations to which they bind. Computational tools comprise an important part of this discovery process. Reviews of TF Binding Site Discovery Transcription factor binding site identification is a well-studied area but continues to develop rapidly. Here we mention a few good reviews of the area which are useful to understanding the data and tools available for analysis. Narlikar and Ovcharenko Hannenhalli gives a review of current computational techniques for various representations of TF binding sites and how they are derived Charoensawan and coworkers give a current review of the resources available for study of TFs including databases of TFs with known binding sites and the types of annotations available for the TFs [1]. Finally, Das and Dai surveyed motif discovery algorithms which may be of use to determine appropriate algorithms for a particular task Motif Identification Typically biologists are interested in which genes a TF regulates. This can be determined by identifying the genomic locations to which the TF binds. In motif identification, one starts with a collection of DNA sequences thought to contain TF binding sites. The computational task is to identify the TF binding site amongst these DNA sequences. Early approaches used simple models such as exact DNA motif sequences. These have largely been supplanted by position weight matrices (PWMs, alternatively referred to as position specific scoring matrices, PSSMs) as they more accurately model the probabilistic nature of binding. Though the assumption of independent contributions from each position of PWM is not entirely realistic 4 The newest models incorporate additional information specific to the experimental technique used to derive the DNA sequence collection An alternative to direct motif detection is phylogenetic footprinting. Homologous genomes are aligned to identify conserved noncoding regions which are likely to assume regulatory roles such as working as a TF binding sites. A number of such approaches are reviewed in The function of a new gene can be inferred from the TFs associated with it. Using a library of transcription factor binding sites, one can detect TF binding sites in the noncoding region near a gene. Enrichment of a particular TF indicates the gene may share a function with other genes that the TF affects Obtaining DNA Sequences for Motif Identification: Experimental Methods Computational motif identification requires a collection of DNA sequences which contain a DNA-binding motif. Several wet lab techniques can provide such a collection by determining the approximate genomic location TF binding sites. Chromatin immunoprecipitation (ChIP) is a fundamental tool used in most wet lab TF binding site identification techniques. ChIP allows an in vivo snapshot of the proteins bound to DNA to be obtained. Traditionally ChIP was followed by microarray analysis, together called ChIP-chip Alternatively, co-regulated genes may be used as a source for approximate TF binding sites. Genes that are up-and down-regulated together are typically affected by the same TFs. Thus, the noncoding regions near these genes constitute a collection of DNA sequences which are likely to contain binding sites for a TF Identification of Binding Proteins and Binding Residues While studies of transcription factors tend to focus on DNA motifs and binding locations in the genome, attributes specific to DNA-binding proteins are also of interest. After isolating a new protein, biologists frequently want to discern its function. Data mining may be used to distinguish DNA-binding proteins from other types. Once it is established that a given protein interacts with DNA, a biologist may be interested in which of the protein's residues are involved with binding. Computational methods are of service here again to perform binding residue identification. Both binding protein and binding residue identification may be addressed using techniques from supervised machine learning. The goal is to train a model which differentiates between the binding (positive) class and nonbinding (negative) class. The classes may represent either whole proteins or individual residues. The usual process for supervised learning is the following: establish a set of proteins as training examples, determine which features of the proteins will be given to the computational model as input, and then train the model to discriminate between binding and nonbinding classes. Predictive performance is evaluated on proteins which are excluded from the training process in order to judge the method's capabilities on future data. Whole Protein versus Residue-level Predictions Most methods focus on predicting at either the whole protein level or residue level. Some methods accomplish both tasks simultaneously, but for the most part, addressing these two problems calls for different techniques. In the first case, the task is to identify DNA-binding proteins amongst proteins with other functions. This has increasing relevance as both sequencing and structural genomics projects have dramatically increased the number of proteins with unknown function. A variety of methods have been developed to accomplish this task Prediction of DNA-binding residues assumes that that the protein under scrutiny binds DNA and predicts which residues are involved at the interface. Again, a wide array of approaches utilizing both sequence and structure features have been developed for residue-level prediction While DNA-binding protein predictions are used primarily to elucidate the function of a new protein, there are several uses for DNA-binding residue predictions. They may be used to guide wet lab mutation experiments that affect binding affinity between protein and DNA. Rather than trying every residue in the protein, attention may be focused on mutating only residues which are predicted to play a role in binding. When structure is available but unbound, it may be possible to use predicted binding residues used to help identify the geometric binding site on protein as has been done for small ligands Prediction of DNA-binding Function In the current literature, most methods approach binding protein and binding residue identification assuming either (1) the protein of interest has known structure, or (2) only the the protein's sequence is available. A third class of methods, known as homology modeling or threading, make predictions by assessing the compatibility of a target protein with DNA-binding structures. Prediction from Structure Knowledge of the protein's structure can be very helpful in determining its DNA-binding status. The structure may come from several sources. Traditionally, a protein's structure has been determined experimentally due to specific interest in how it fulfills its role in a biological system. Thus X-ray diffraction is used to determine the structure of protein-DNA complexes and this information is deposited in structural databases, primarily the PDB. These database entries provide examples for learning predictive models as the protein's function is typically well characterized. In some cases, two structures of the protein are available: the bound complex which has DNA present (holo protein conformation) and the unbound protein with no DNA present (apo conformation). Though studies of single proteins have traditionally been the source for structure information, structural genomics projects are producing the structures of many new proteins for which no function information is available A very simple method of determining whether a protein is DNA-binding is to identify similar structures of known function using any of a number of structure alignment methods. However, the presence of a good structural match does not definitively establish the function of protein as similar structure/different function proteins exist. DNA-binding residues can be inferred as those structurally aligned to known binding residues. Rather than rely directly on structural similarity to known DNA binding proteins to classify the function of new proteins, there are several lines of research which exploit structure features for identification of DNA-binding proteins. Examples of these include direct use of structural motifs and electrostatics to predict function, or the encoding of structural information into features amenable to machine learning methods Prediction from Sequence Difficulty determining a protein's structure has motivated the development of binding predictors which utilize only sequence information. Such methods predict whether a protein binds DNA and which residues are involved in the process without relying on the geometry of the protein. Aside from using standard sequence database searches such as BLAST and PSI-BLAST, few purely sequence-based methods are available for binding protein prediction There have been some claims that these "template-free" models, which do not consider structural aspects of the protein, give inferior performance to their structure-based counterparts Homology Model-ling and Threading A technique that has proved effective for DNA-binding prediction but does not constitute traditional machine learning is homology modeling and its relative threading. In both techniques, a target protein with unknown structure is modeled by identifying a template protein of known structure. The target sequence is then mapped onto the known template structure and refined (e.g. Threading methods can handle both whole protein and residue-wise binding prediction When no structure is available for a target protein, in some cases it may be possible to generate a full three-dimensional model using homology modeling or threading. In most cases, homology models are not entirely accurate, but for the purpose of determining whether the protein binds DNA, recent work has demonstrated that the use of homology models has promise Producing a homology model of the protein's structure may fail for several reasons, most commonly because no suitable template is available. Dependence on a good structural template is the primary disadvantage of template-based methods Machine Learning Features Numerous features have been employed in prediction schemes for binding proteins and binding residues. These are divided into structure and sequence features. There is mild overlap in some cases: for instance, secondary structure is available from the protein's structure or it may be predicted from sequence. Structural Features • Electrostatic Potentials Molecular dynamics software is used to compute the charges for each atom which is usually averaged to assign an electrostatic score to each residue • Dipole and quadrupole moments Charge moments measure how widely distributed electric charge is across the protein. Fairly simple methods can calculate the electric dipole and quadrupole from structure and according to the cited study, dipoles in combination with overall charge make a fairly discriminatory feature between binding and nonbinding proteins • Structural Motifs Certain structural motifs (patterns) are known for interaction with DNA. Identifying such a motif in a novel protein can lend support to its classification as a DNA-binding protein • Structural Neighborhood A simple representation of residue environment is to count the other amino acids inside a ball centered on the residue of interest • Surface Curvature In order to accommodate bound DNA, proteins may exhibit certain curvature, at least locally at the binding site • Secondary Structure Proteins assume local, repeated geometric patterns called secondary structure which may be calculated from its coordinates • Solvent Accessible Surface Area (SASA) Binding residues are almost always well exposed to solvent to enable them to form contacts with DNA making SASA a useful 8 predictive feature. Like secondary structure, SASA can be calculated from the protein structure or predicted from sequence. Some studies limit their focus to only surface residues from the outset Sequence Features • Amino Acid Sequence The most common feature to any sequence-based predictor, the protein's amino acid sequence provides baseline information to the predictor. Raw sequence is usually encoded as a 20-dimensional binary vector. Positively charged residues such as arginine are more likely to interact with the negatively charged backbone of DNA according to both physical and statistical studies • Residue Class/Type The twenty amino acids may be grouped according to physical properties such as charge and hydrophobicity which is then used as an additional sequence feature such as the six classes in • Sequence Profiles The majority of machine learning approaches to bioinformatics problems now employ sequence profiles rather than raw sequence as profiles are generally acknowledged to provide better information. Profiles are usually generated using PSI-BLAST • Global Composition of AAs When attempting to identify DNA-binding proteins counts or frequencies of each type of amino acid are often used, typically as a 20-dimensional vector. Pairs of adjacent residues have also been used as a compositional feature • Hydrophobicity Measures of residue hydrophobicity, the degree to which the residue is repelled by water, are a commonly used feature. A typical example is the hydrophobicity scale in • Evolutionarily Conserved Residues Residues that mitigate interactions between proteins and DNA are usually conserved through evolution. Thus identifying conserved residues can yield a powerful feature. This may be done using only sequence or combined with structural information to yield collections of conserved residues which are proximal in space Additionally, sequence features are commonly augmented via sliding windows to capture the local sequence environment of a residue. Features of residues immediately to the left and right are concatenated onto those of a central residue before being presented to the machine learner. Window sizes between one (only the central residue) and eleven (five residues on either side) are commonly used. Many of the features described above are used in sliding windows in the approaches that describe them. Machine Learning Tools Most standard machine learning tools have been applied to DNA-binding protein and DNAbinding residue prediction. The short list includes support vector machines (SVMs) Data Sets If possible, new studies of DNA-protein interactions should employ a data set that has already been used in the literature. This facilitates direct comparison to previous efforts. Some common data sets in use are listed in For new data sets, authors should report the maximum level of sequence similarity amongst proteins in the set. The similarity level should be kept at or below 30-35% to be comparable to current methods. This can be accomplished using a sequence clustering program such as blastclust (available from NCBI) to group similar sequences and then select a single representative from each cluster. It is also important to eliminate proteins that are subsequence of other proteins in a dataset which can also be done with blastclust. For example, the following use of blastclust will cluster sequences at 35% identity and detect subsequences that are as little as 10% of the length of other sequences. blastclust -i seq.fa -o seq.bc -S 35 -L 0.1 This is the mechanism that was used to analyze sequence redundancy of the datsets in When dividing the data set for cross-validation, ensure divisions are done at the protein level even for binding residue prediction: residues from the same protein should not appear in both training and testing sets. When reporting performance, a variety of measures should be included, particularly an ROC analysis Current State of the Art The current crop of DNA-binding protein predictors provide good results when sequences homologous to the target protein are available. DBD-Threader provides a state-of-the-art threading approach which is likely amongst the best predictors when good templates are available for a target Most current DNA-binding classification methods rely upon the availability of similar proteins, either explicitly in the case of threading methods, or implicitly through the similarity measures used in machine learning methods and sequence comparison. When a homologs to the target protein are not available, the task of identifying DNA-binding proteins and residues is significantly more difficult. The work in The number of experimentally verified DNA-binding structures is likely to continue increasing which will extend the capabilities of similarity-based methods. However, until homologs are available for all protein families, predicting DNA-binding attributes of new proteins is likely to remain a challenge. Future Directions Machine learning has already impacted the study of protein-DNA interactions, particularly the identification of DNA-binding proteins. These innovations are set to continue down a number of avenues. The capability of machine learning to identify binding residues in a protein may be used to guide physical simulations of protein-DNA interactions. This capability has been utilized in some studies of protein interactions with small molecules to guide ligand docking Another avenue of pursuit is applying machine learning to identify the genomic binding sites for transcription factors. There has already been some work done to develop models for various structural classes of TFs Finally, a true head-to-head comparison of the various methods for DNA-binding protein identification and DNA-binding residue prediction would guide further development in this area. Dividing a benchmark into sequence-based and structure-based predictions would elucidate how 14 much inference capability is gained when a protein's structure is available.