Amino Acid Sequences Indicators Of Evolution

9/8/2019

Amino Acid Sequences Indicators Of Evolution

Read Now

AbstractAlong protein sequences, co-evolution analysis identifies residue pairs demonstrating either a specific co-adaptation, where changes in one of the residues are compensated by changes in the other during evolution or a less specific external force that affects the evolutionary rates of both residues in a similar magnitude. In both cases, independently of the underlying cause, co-evolutionary signatures within or between proteins serve as markers of physical interactions and/or functional relationships. Depending on the type of protein under study, the set of available homologous sequences may greatly differ in size and amino acid variability. BIS2Analyzer, openly accessible at, is a web server providing the online analysis of co-evolving amino-acid pairs in protein alignments, especially designed for vertebrate and viral protein families, which typically display a small number of highly similar sequences. It is based on BIS 2, a re-implemented fast version of the co-evolution analysis tool Blocks in Sequences (BIS). BIS2Analyzer provides a rich and interactive graphical interface to ease biological interpretation of the results. INTRODUCTIONIn recent years, a particular focus has been drawn to the study of co-evolving residues within a protein and among proteins.

Co-evolving residues in a protein structure, possibly a complex, correspond to groups of residues whose mutations have arisen simultaneously during the evolution of different species and this is due to several possible reasons involving the 3D shape of the protein: functional interactions, conformational changes and folding. Several studies addressed the problem of extracting signals of co-evolution between residues. All these methods provide sets of co-evolved residues that are usually physically close in the 3D structure (–) and form connected networks covering roughly a third of the entire structure. Co-evolved residues have been demonstrated, for a few protein complexes (for which experimental data are available), to play a crucial role in allosteric mechanisms (,), to maintain short paths in network communication and to mediate signaling (,). Methods such as Direct Coupling Analysis (DCA) , EVcouplings and PSICOV are applicable to protein families displaying a large number of evolutionarily related sequences and sufficient divergence, these characteristics constituting the bottleneck of today co-evolution analysis methods. The requirement on the large number of sequences has been dropped in recently developed methods but the divergence of the sequences remains a mandatory constraint.For many proteins, characteristic of vertebrate or viral species, the statistics that current co-evolution methods require (to estimate the ‘background noise’ and the relevance of the co-evolution signals) are not applicable because of the reduced number of sequences, either coming from species or from populations and their conservation. Hence, alternative paradigms should be followed.

Amino Acid Sequences Indicators Of Evolution

Provide historical insights into the process of evolution by identifying the period and lineage in which a func-tion changed, the mutations that arose in that interval and the sequence sites that retain putative signatures of selection. Protein structures can help to identify historical amino-acid replacements that are likely to be involved in. At all suggests evolution. EVIDENCE FROM BIOCHEMISTRY (MOLECULAR BIOLOGY) Amino acid sequences of certain proteins can be used to determine how closely related different species are. If the amino acid sequences for a certain protein are very similar in two species, one can assume that those two species had a common ancestor.

To overcome these difficulties, we developed a fast algorithm for the co-evolution analysis of relatively small sets of sequences (where ‘small’ means. In the first two examples below, BIS2Analyzer was compared to several web servers and programs: EVcouplings (with option PLM, pseudo-likelihood maximization approach (,)), DCA and PSICOV, all of them producing a list of predicted pairs of co-evolving positions ranked according to best confidence values. Given that the reliability of statistical methods strongly depends on the number of input sequences, we ran the above tools on two MSAs, in addition to the ones described in Table, consisting of larger sets of sequences. For each method, we considered the top 50 predicted pairs. Other two comparisons were realized with CAPS and ContactMap. Fragments of residues in contact within a proteinAmyloid β is a peptide playing a crucial role in Alzheimer. There is experimental evidence that six regions (32 aa in whole) of the protein sequence play a role in the disease.

BIS2Analyzer finds 5 co-evolution clusters, for a total of 30 aa, and 26 of these residues overlap the 6 regions with known function. On their 50 top scored pairs, EVcouplings, DCA and PSICOV do not provide successful results compared to BIS2Analyzer, as reported in Table. Similarly, low performance is reported for CAPS and ContactMap. Performances of various co-evolution analysis methods. Amyloid β peptideB domain/protein AC/G/P (TP)Pr (TP)(R)C/G/P (TP)Pr (TP)BIS 25 (4)30 (26)(6)5 (2)28 (10)CAPS3 (2)14 (6)(1)6 (0)22 (1)ContactMap.50 (13)28 (14)(1)50 (16)25 (11)DCA50 (8)29 (14)(3)50 (1)30 (4)DCA a50 (26)37 (23)(5)50 (2)48 (11)DCA b50 (13)29 (15)(3)50 (1)42 (10)EVcouplings50 (2)28 (13)(1)50 (0)32 (3)EVcouplings a50 (2)29 (19)(1)50 (0)26 (1)EVcouplings b50 (2)27 (16)(1)50 (3)46 (9)PSICOV50 (11)33 (20)(3)50 (1)24 (4)PSICOV a50 (20)30 (26)(3)50 (4)45 (8)PSICOV b-50 (4)44 (12). Amyloid β peptideB domain/protein AC/G/P (TP)Pr (TP)(R)C/G/P (TP)Pr (TP)BIS 25 (4)30 (26)(6)5 (2)28 (10)CAPS3 (2)14 (6)(1)6 (0)22 (1)ContactMap.50 (13)28 (14)(1)50 (16)25 (11)DCA50 (8)29 (14)(3)50 (1)30 (4)DCA a50 (26)37 (23)(5)50 (2)48 (11)DCA b50 (13)29 (15)(3)50 (1)42 (10)EVcouplings50 (2)28 (13)(1)50 (0)32 (3)EVcouplings a50 (2)29 (19)(1)50 (0)26 (1)EVcouplings b50 (2)27 (16)(1)50 (3)46 (9)PSICOV50 (11)33 (20)(3)50 (1)24 (4)PSICOV a50 (20)30 (26)(3)50 (4)45 (8)PSICOV b-50 (4)44 (12). Amyloid β peptideB domain/protein AC/G/P (TP)Pr (TP)(R)C/G/P (TP)Pr (TP)BIS 25 (4)30 (26)(6)5 (2)28 (10)CAPS3 (2)14 (6)(1)6 (0)22 (1)ContactMap.50 (13)28 (14)(1)50 (16)25 (11)DCA50 (8)29 (14)(3)50 (1)30 (4)DCA a50 (26)37 (23)(5)50 (2)48 (11)DCA b50 (13)29 (15)(3)50 (1)42 (10)EVcouplings50 (2)28 (13)(1)50 (0)32 (3)EVcouplings a50 (2)29 (19)(1)50 (0)26 (1)EVcouplings b50 (2)27 (16)(1)50 (3)46 (9)PSICOV50 (11)33 (20)(3)50 (1)24 (4)PSICOV a50 (20)30 (26)(3)50 (4)45 (8)PSICOV b-50 (4)44 (12).

Amyloid β peptideB domain/protein AC/G/P (TP)Pr (TP)(R)C/G/P (TP)Pr (TP)BIS 25 (4)30 (26)(6)5 (2)28 (10)CAPS3 (2)14 (6)(1)6 (0)22 (1)ContactMap.50 (13)28 (14)(1)50 (16)25 (11)DCA50 (8)29 (14)(3)50 (1)30 (4)DCA a50 (26)37 (23)(5)50 (2)48 (11)DCA b50 (13)29 (15)(3)50 (1)42 (10)EVcouplings50 (2)28 (13)(1)50 (0)32 (3)EVcouplings a50 (2)29 (19)(1)50 (0)26 (1)EVcouplings b50 (2)27 (16)(1)50 (3)46 (9)PSICOV50 (11)33 (20)(3)50 (1)24 (4)PSICOV a50 (20)30 (26)(3)50 (4)45 (8)PSICOV b-50 (4)44 (12). Hotspot residues within a proteinBIS2Analyzer applied to B domain of protein A identifies 28 co-evolving residues organized in 5 clusters and finds co-evolution among 10 hotspots over 13 known to be important for the folding of the protein. Among the 50 top scored pairs, EVcouplings, PSICOV and DCA do not perform well as shown in Table.

However, notice that on the MSA described in Table, DCA detects 26 contacts in the 3D structure out of 50 predictions. CAPS performance is very low, while ContactMap identifies 11 hotspots within a relatively low number of predicted residues. Finding correlations in unfolded structuresc-KIT is a receptor tyrosine kinase of type III implicated in signaling pathways crucial for cell growth, differentiation and survival (–). The Juxta Membrane Region (JMR) is folded in the c-KIT inactive form while it becomes unfolded in the active form.

BIS2Analyzer highlights a cluster of three co-evolving residues, lying in JMR, that are in physical contact in the inactive form (Figure, left). BIS2Analyzer visualization of both structures helps to reason on the structural role of the residues in disordered regions. Visualization of c-KIT tyrosine kinase analysis on Bis2Analyzer. ( A) Part of the sequence alignment where cluster 9 is localized. ( B) Description of the three hits comprising cluster 9. ( C) Display of cluster 9 on c-KIT inactive form (left, 1T45) and on its active form (right, 1PKG).

Note that the active form has an unfolded N-terminal that has been partially removed in the crystal (right). ( D) Plot of cluster 9 (green dots) on a multiple sequence alignmen (MSA) sequence. Finding long distance correlationsBIS2Analyzer analysis of HCV genotype 1b-MD sequences of the zinc-binding phosphoprotein NS5A highlighted two clusters of co-evolving residues (orange and violet in Figure ) localized in the same two regions of the protein. The co-localization of the residues allows to propose biologically interesting hypotheses explaining the correlations.

We can hypothesize a conformational change of the protein and a potential functional role of the co-evolved residue pairs in the possible allosteric movement (Figure ). Note that a third independent pair of co-evolving residues localized in the same regions was found in genotype 2b sequences (see Figure 9 in ) adding confidence in the hypothesis. Also, note that the pair of orange residues are located one in front of the other at the interface of the D1 domain of NS5A (Figure ), suggesting a structural role of these residues in the dimeric contact (,) (Figure ). ( A) Visualization of two clusters (orange and violet) obtained by BIS2Analyzer on Hepatitis C Virus (HCV) protein NS5A (1ZH1). These co-evolving residues are localized in the same regions of the protein, suggesting a conformational change (see schema in ( C)).

( B) The orange co-evolving residues in (A), localized far in the monomer structure of NS5A, are found in close proximity in the dimer (1ZH1, see schema in ( D)). ( F) Co-evolving residues (green; P-value.

( A) Visualization of two clusters (orange and violet) obtained by BIS2Analyzer on Hepatitis C Virus (HCV) protein NS5A (1ZH1). These co-evolving residues are localized in the same regions of the protein, suggesting a conformational change (see schema in ( C)). ( B) The orange co-evolving residues in (A), localized far in the monomer structure of NS5A, are found in close proximity in the dimer (1ZH1, see schema in ( D)). ( F) Co-evolving residues (green; P-value. ( A) BIS2Analyzer detects six co-evolving residues (blue, P-value ≤ 1.2e −5) located at close distance on the surface of c-KIT tyrosine kinase (1T45).

( B) Schema illustrating the distances among the 6 co-evolving residues in (A). ( D) Three co-evolving residues (red) face opposite sites of protein N (chain A, 4XJN, right). They identify inter-protein contacts at the interface of the mononegavirales protein N assembly (4XJN, left) (see schema in ( C)). Finding distant correlations justified by a large complex assemblyBIS2Analyzer analysis of the mononegavirales protein N (Table ) highlighted a cluster comprised of three co-evolving residues (red in Figure, right) localized in two opposite faces of the protein.

The co-localization of the residues is visualized in the large structure of the parainfluenza virus 5 nucleocapsid–RNA complex (4XJN), formed by 13 homodimers, where the three residues enter in contact after dimerization, as illustrated in Figure, left. Exploring large sets of divergent sequences with BIS2AnalyzerBIS2Analyzer can be used to explore large sets of divergent sequences, as the ribosomal L3 protein family. We considered 2414 sequences (alignment length 390 aa) and analyzed 14 subtrees of its distance tree (constructed with BioNJ ), selected to contain at least 20 sequences and displaying non-trivial clusters of co-evolving residues (Figure ). A cluster is considered to be trivial if (i) either it is conserved ( P-value = 1), (ii) or the co-evolution pattern comprises only one amino-acid occurring more than once, (iii) or the co-evolution pattern is only due to the presence of gaps. After applying BIS2Analyzer, we retained 16 co-evolution clusters, belonging to 11 subtrees, with a P-value. ( A) Clustering of the phylogenetic tree constructed from a dataset of RL3 sequences (2414 sequences; subset of the UniRef90 dataset in UniProt from which too divergent sequences have been eliminated). Selected subtrees (shown in color) contain at least 20 sequences and at least one non-trivial co-evolution cluster (See text.).

( B) BIS 2 co-evolution clusters on the 3D structure (PDB ID: 4U26, chain BD), colored according to the subtree they belong to; the six co-evolution clusters shown above belong to five sub-trees and link the ordered extension loop with the structured region. ( A) Clustering of the phylogenetic tree constructed from a dataset of RL3 sequences (2414 sequences; subset of the UniRef90 dataset in UniProt from which too divergent sequences have been eliminated). Selected subtrees (shown in color) contain at least 20 sequences and at least one non-trivial co-evolution cluster (See text.).

( B) BIS 2 co-evolution clusters on the 3D structure (PDB ID: 4U26, chain BD), colored according to the subtree they belong to; the six co-evolution clusters shown above belong to five sub-trees and link the ordered extension loop with the structured region. HOW TO RUN BIS2AnalyzerThe web server provides a job submission page (‘Submit’). It requires input files formatted in a standard way and most of the parameters are automatically set, so the basic usage should be straightforward. To have a glimpse on the type of input required, sample inputs can be loaded for intra- and inter-protein co-evolution analysis. For information on how to customize the default behavior, a detailed tutorial is provided and is accessible online at the ‘Tutorial’ page. Below, we overview the usage of the web server.

Details on the input dataBIS2Analyzer accepts as input a MSA in FASTA format, either copy-pasted or uploaded as a file. Sequences must contain only upper case characters, dashes and dots. There is no restriction for sequence names. Once the job is submitted, based on a randomly generated jobID, a web link is provided allowing the user to access the data at a later time. Optionally, an e-mail address can be provided; job queuing, beginning and completion are notified.

The mail reports a mnemonic jobname chosen by the user or generated otherwise. Guidelines on input sequencesWe recommend applying BIS 2 either on tens of sequences, or on a few hundreds of sequences with relatively high API.

In the latter case, we identify very high API (∼80% and above), where BIS 2 might be run with default parameters and moderately high API (∼60 ÷ 80%), where BIS 2 could be run with higher dimensions D or with the alphabet reduction option enabled; in this way, amino acid variability within the same class is neglected. BIS2Analyzer default parametersBIS2Analyzer generates a rooted phylogenetic tree with BIONJ by default. First, it computes the distance matrix, based on Jones–Taylor–Thornton distance model with Protdist, from PHYLIP version 3.696 ; then, it uses BIONJ to build the phylogenetic tree and SeaView to re-root the tree.The dimension parameter D is set to 2.

By default, BIS2Analyzer enables the block mode; this means that hits are extended by conservation on neighboring positions. Alphabet reduction option is disabled. BIS2Analyzer optionsThe user can provide a phylogenetic tree in NEWICK format (either copy-pasted or uploaded as a file) or set PhyML in replacement of BIONJ. The tree must be rooted (SeaView can be used for this purpose). The dimension ‘ D’ option sets the maximum number of allowed exceptions (with maximum allowed value D ≤ 10).

The ‘block’ option can be disabled to force BIS2 to report co-evolving hits only, without extending a hit into a block. By default, the ‘ pc’ option reduces the amino-acid alphabet of 20 to 8 letters representing physico-chemical classes of residues, where each residue on a class is assigned the same letter. The eight physico-chemical classes are defined by default as in : hydrophobic (VILMFWA), negatively charged (DE), positively charged (KR), aromatic (YH), polar (NSTQ) and C, G, P are considered as special. The user can provide a custom definition of amino acid classes, by typing a string containing the 20 amino acids, with classes separated by commas (for instance: KR,AFILMVW,NQST,HY,C,DE,P,G) in the dedicated box. Guidelines for different analysesCo-evolution analysis within a protein complex is of paramount importance to dissect an interface or to get clues on potential interacting residues when the structural complex is not available. The procedure for such analysis conducted with BIS2Analyzer is indicated within the ‘Tutorial’ page. Also, the computational strategy for analyzing large datasets of divergent sequences is reported in the ‘Tutorial’ page.

DISPLAY OF THE RESULTS OutputBIS2Analyzer supplies a graphical interface to inspect co-evolution clusters, resulting as BIS 2 predictions.For each dimension considered in the analysis ( d ≤ D), BIS2Analyzer displays the MSA labeled with all co-evolution clusters of that dimension (see Figure ). At the bottom of the MSA, a histogram reports the conservation level of the most frequent character occurring at a fixed position.

A graphical ruler describing each cluster helps to browse the MSA and easily identify positions belonging to the cluster (‘H’ labels a hit and ‘E’ labels a block extension; Figure ).For each cluster, BIS2Analyzer allows visualization of residue types, physico-chemical properties and MSA positions (Figure ). Three scores are provided for each cluster: symmetric, environmental and P-value. The first two scores vary in the interval 0, 1 and are computed by the clustering algorithm CLAG.

They express the degree of ‘similarity’ of co-evolution of positions in a cluster with respect to all other analysed positions. In particular, scores equal to 1 correspond to a cluster where all positions show an identical co-evolution pattern with all other analysed positions. High scores guarantee the confidence in a cluster and because of this, BIS2Analyzer outputs only clusters with both scores 0.5.

The P-value score is computed with a Fisher test on a diagonal matrix, where the elements of the diagonal represent the co-evolution pattern satisfied by all positions in a cluster; for example, 77−3−1 in Figure is a pattern representing three distinct amino-acids on three MSA positions that occur on subsets of 77, 3 and 1 sequences, respectively. The subsets are the same for the three positions and, in this case, we talk about a perfect pattern. When the pattern if not perfect, the P-value is computed on the maximum set of aligned sequences displaying a perfect pattern. Output visualizationThe user can visualize each prediction onto a sequence through the ‘Mapping to sequence’ page, or the 3D structure through the ‘Mapping to structure’ page.

In addition to the clusters’ listing, the web server provides interactive ways to inspect them on one or two proteins of interest.Mapping on a reference sequence (Figure ) can be done either on the MSA consensus sequence, or on any sequence present in the MSA, or on a new sequence provided by the user as a FASTA file. All co-evolution clusters are viewable on the sequence, they are labeled with different colors and can be enabled or disabled globally or one by one. If the sequence is among the ones in the MSA, the representation is done with alignment's gaps been removed. Otherwise, if a new sequence is provided, the Smith–Waterman algorithm is applied to the new sequence and the consensus sequence computed from the MSA. We adopted the same scoring scheme as for PSIBLAST, namely, we use BLOSUM62 matrix for match-mismatch scores and assign penalties of −11 and −1 to gap opening and gap extension, respectively.

A match/mismatch between non-standard amino acids is scored with 1 for a match and −4 for a mismatch. Gaps at the beginning or at the end of the alignment are scored 0, so that the sequence provided can be much shorter or longer than the length of the MSA.Mapping on a reference structure (Figure ) is done by providing a PDB file or PDB ID, possibly containing multiple chains. Chains can be enabled/disabled for display. A residue mapping on the MSA is done by retrieving the sequences of each chain of the PDB and by aligning each of them independently on the MSA consensus. An alignment score cut-off is set (at 0) and chains that align against the consensus of the MSA with a score lower than the cut-off are not considered. The user can enable/disable each cluster for visualization. Colors used for identifying clusters on the structure and on the sequence are consistent.

Finally, the user can decide to upload up to two PDB structures for the same co-evolution analysis. This is a useful feature when either protein interactions or different foldings (e.g. Disordered versus ordered regions) for the same protein are explored. The graphical interface is implemented with Protein Viewer (PV), a WebGL-based viewer for proteins and other biological macromolecules, very fast and visualizable on smartphones. DISCUSSIONBIS2Analyzer conveys an automatic, though detailed and highly customizable, pipeline; it provides to the scientific community an established method for the co-evolution analysis of very few and/or highly conserved sequences. BIS2Analyzer can be used by the biologist to foster hypothesis on protein behavior and new strategies for the design of experiments.

FUNDINGInstitut Universitaire de France; French Governement Funds, at UPMC, for HPC resources ‘Equip@Meso project - ANR-10-EQPX- 29-01’; French Governement—Excellence Program ‘Investissement d'Avenir’ in Bioinformatics ‘MAPPING project-ANR-11-BINF-0003’. Funding for open access charge: French Governement—Excellence Program ‘Investissement d'Avenir’ in Bioinformatics ‘MAPPING project-ANR-11-BINF-0003’.Conflict of interest statement. None declared.

Amino Acid Sequences Indicators Of Evolution

Author

Archives

Categories