Prediction of amphipathic in-plane membrane anchors in monotopic proteins using a SVM classifier.
BMC Bioinformatics. 2006 May 16;7(1):255
Sapay N, Guermeur Y, Deleage G.

BACKGROUND: Membrane proteins are estimated to represent about 25 % of open reading frames in fully sequenced genomes. However, the experimental study of proteins remains difficult. Considerable efforts have thus been made to develop prediction methods. Most of these were conceived to detect transmembrane helices in polytopic proteins. Alternatively, a membrane protein can be monotopic and anchored via an amphipathic helix inserted in a parallel way to the membrane interface, so-called in-plane membrane (IPM) anchors. This type of membrane anchor is still poorly understood and no suitable prediction method is currently available. RESULTS: We report here the "AmphipaSeeK" method developed to predict IPM anchors. It uses a set of 21 reported examples of IPM anchored proteins. The method is based on a pattern recognition Support Vector Machine with a dedicated kernel and multiple alignments. CONCLUSIONS: AmphipaSeeK was shown to be highly specific, in contrast with classically used methods (e.g. hydrophobic moment). Additionally, it has been able to retrieve IPM anchors in naively tested sets of transmembrane proteins (e.g. PagP). AmphipaSeek and the list of the 21 IPM anchored proteins is available on NPS@, our protein sequence analysis server.
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
Nucleic Acids Res. 1997 Sep 1;25(17):3389-3402
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ
National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA.

The BLAST programs are widely used tools for searching protein and DNA databases for sequence similarities. For protein comparisons, a variety of definitional, algorithmic and statistical refinements described here permits the execution time of the BLAST programs to be decreased substantially while enhancing their sensitivity to weak similarities. A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original. In addition, a method is introduced for automatically combining statistically significant alignments produced by BLAST into a position-specific score matrix, and searching the database using this matrix. The resulting Position-Specific Iterated BLAST (PSI-BLAST) program runs at approximately the same speed per iteration as gapped BLAST, but in many cases is much more sensitive to weak but biologically relevant sequence similarities. PSI-BLAST is used to uncover several new and interesting members of the BRCT superfamily.
CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.
Nucleic Acids Res 1994 Nov 11;22(22):4673-4680
Thompson JD, Higgins DG, Gibson TJ
European Molecular Biology Laboratory, Heidelberg, Germany.

The sensitivity of the commonly used progressive multiple sequence alignment method has been greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are assigned to each sequence in a partial alignment in order to down-weight near-duplicate sequences and up-weight the most divergent ones. Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned. Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions. These modifications are incorporated into a new program, CLUSTAL W which is freely available.
Coiled-coil prediction
Predicting coiled coils from protein sequences.
Science 1991 May 24;252(5010):1162-1164
Lupas A, Van Dyke M, Stock J
Department of Molecular Biology, Princeton University, NJ 08544.

The probability that a residue in a protein is part of a coiled-coil structure was assessed by comparison of its flanking sequences with sequences of known coiled-coil proteins. This method was used to delineate coiled-coil domains in otherwise globular proteins, such as the leucine zipper domains in transcriptional regulators, and to predict regions of discontinuity within coiled-coil structures, such as the hinge region in myosin. More than 200 proteins that probably have coiled-coil domains were identified in GenBank, including alpha- and beta-tubulins, flagellins, G protein beta subunits, some bacterial transfer RNA synthetases, and members of the heat shock protein (Hsp70) family.
An algorithm for protein secondary structure prediction based on class prediction.
Protein Eng 1987 Aug;1(4):289-294
Deleage G, Roux B
Laboratoire de Physico-Chimie Biologique, LBTM-CNRS UM 24, Universite Claude Bernard, Villeurbanne, France.

An algorithm has been developed to improve the success rate in the prediction of the secondary structure of proteins by taking into account the predicted class of the proteins. This method has been called the 'double prediction method' and consists of a first prediction of the secondary structure from a new algorithm which uses parameters of the type described by Chou and Fasman, and the prediction of the class of the proteins from their amino acid composition. These two independent predictions allow one to optimize the parameters calculated over the secondary structure database to provide the final prediction of secondary structure. This method has been tested on 59 proteins in the database (i.e. 10,322 residues) and yields 72% success in class prediction, 61.3% of residues correctly predicted for three states (helix, sheet and coil) and a good agreement between observed and predicted contents in secondary structure.
Identification and application of the concepts important for accurate and reliable protein secondary structure prediction
Protein Sci 1996 Nov;5(11):2298-310
King RD, Sternberg MJ
Biomolecular Modelling Laboratory, Imperial Cancer Research Fund, London, United Kingdom.

A protein secondary structure prediction method from multiply aligned homologous sequences is presented with an overall per residue three-state accuracy of 70.1%. There are two aims: to obtain high accuracy by identification of a set of concepts important for prediction followed by use of linear statistics; and to provide insight into the folding process. The important concepts in secondary structure prediction are identified as: residue conformational propensities, sequence edge effects, moments of hydrophobicity, position of insertions and deletions in aligned homologous sequence, moments of conservation, auto-correlation, residue ratios, secondary structure feedback effects, and filtering. Explicit use of edge effects, moments of conservation, and auto-correlation are new to this paper. The relative importance of the concepts used in prediction was analyzed by stepwise addition of information and examination of weights in the discrimination function. The simple and explicit structure of the prediction allows the method to be reimplemented easily. The accuracy of a prediction is predictable a priori. This permits evaluation of the utility of the prediction: 10% of the chains predicted were identified correctly as having a mean accuracy of > 80%. Existing high-accuracy prediction methods are "black-box" predictors based on complex nonlinear statistics (e.g., neural networks in PHD: Rost & Sander, 1993a). For medium- to short-length chains (> or = 90 residues and < 170 residues), the prediction method is significantly more accurate (P < 0.01) than the PHD algorithm (probably the most commonly used algorithm). In combination with the PHD, an algorithm is formed that is significantly more accurate than either method, with an estimated overall three-state accuracy of 72.4%, the highest accuracy reported for any prediction method.
Dictionary of protein secondary structure : pattern recognition of hydrogen-bonded and geometrical features
Biopolymers 1983, 22: 2577-2637
Kabsch W & Sander C

Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.
PNAS (1988) 85:2444-2448
Pearson WR
Department of Biochemistry, University of Virginia, Charlottesville 22908.

The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.

Improved tools for biological sequence comparison.
Pearson WR, Lipman DJ
Department of Biochemistry, University of Virginia, Charlottesville 22908.

We have developed three computer programs for comparisons of protein and DNA sequences. They can be used to search sequence data bases, evaluate similarity scores, and identify periodic structures based on local sequence similarity. The FASTA program is a more sensitive derivative of the FASTP program, which can be used to search protein or DNA sequence data bases and can compare a protein sequence to a DNA sequence data base by translating the DNA data base as it is searched. FASTA includes an additional step in the calculation of the initial pairwise similarity score that allows multiple regions of similarity to be joined to increase the score of related sequences. The RDF2 program can be used to evaluate the significance of similarity scores using a shuffling method that preserves local sequence composition. The LFASTA program can display all the regions of local similarity between two sequences with scores greater than a threshold, using the same scoring parameters and a similar alignment algorithm; these local similarities can be displayed as a "graphic matrix" plot or as individual alignments. In addition, these programs have been generalized to allow comparison of DNA orprotein sequences based on a variety of alternative scoring matrices.
Generalized profiles
A flexible motif search technique based on generalized profile
Comput Chem 1996 Mar;20(1):3-23
Bucher P, Karplus K, Moeri N, Hofmann K
Swiss Institute for Experimental Cancer Research, Epalinges, Switzerland.

A flexible motif search technique is presented which has two major components: (1) a generalized profile syntax serving as a motif definition language; and (2) a motif search method specifically adapted to the problem of finding multiple instances of a motif in the same sequence. The new profile structure, which is the core of the generalized profile syntax, combines the functions of a variety of motif descriptors implemented in other methods, including regular expression-like patterns, weight matrices, previously used profiles, and certain types of hidden Markov models (HMMs). The relationship between generalized profiles and other biomolecular motif descriptors is analyzed in detail, with special attention to HMMs. Generalized profiles are shown to be equivalent to a particular class of HMMs, and conversion procedures in both directions are given. The conversion procedures provide an interpretation for local alignment in the framework of stochastic models, allowing for clear, simple significance tests. A mathematical statement of the motif search problem defines the new method exactly without linking it to a specific algorithmic solution. Part of the definition includes a new definition of disjointness of alignments.
Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins.
J Mol Biol 1978 Mar 25;120(1):97-120
Garnier J, Osguthorpe DJ, Robson B

1) Co-operation between a laboratory interested in developing the theory for protein secondary structure prediction methods and a laboratory interested in applying and comparing such methods has led to the development of a simple predictive algorithm.
2) 2) Four-state predictions, in which each residue is unambiguously assigned one conformational state of a-helix, extended chain, reverse turn or coil, predict 49% of residue states correctly (in a sample of 26 proteins) when the overall helix and estended-chain content is not taken into account.
3) When the relative abundances of helix, extended chain, reverse turn and coil observed by X-ray crystallography are tajen into account, a single constant for each protein and type of conformation can be used to bias the prediction. When predictions are optimized in this way, 63% of all residue states are unambiguously and correctly assigned.
4) By analysing the nature of the bias required, proteins can be classified into helix-rich types, pleated-sheet-rich types, and so on. It is shown that, if the type of protein can be determined even approximately by circular dichroism, 57% of residue states can be correctly predicted without taking into account the X-ray structure. Further, comparable predictions can be obtained if, instead of circular dichroism, preliminary predictions are made to assess the protein type.
5) It is emphasized that the numbers quoted here depend on the method used to assess accuracy, and the algorithm is shown to be at least as good as, and usually superior to, the reported predictions methods assessed in the same way.
6) Ways of further enhancing predictions by the use of additional information from hydrophobic triplets and homologous sequences are also explored. Hydrophobic triplet information does not significantly improve predictive power and it is concluded that this information is used by proteins in the next stage of folding. On the other hand, the use of homologous sequences appears to be very promising.
7) The implication of these results in protein folding is discussed.

Further developments of protein secondary structure prediction using information theory. New parameters and consideration of residue pairs.
J Mol Biol 1987 Dec 5;198(3):425-443
Gibrat JF, Garnier J, Robson B
Laboratoire de Biochimie-Physique, INRA, Universite de Paris-Sud, Orsay, France.

We have re-evaluated the information used in the Garnier-Osguthorpe-Robson (GOR) method of secondary structure prediction with the currently available database. The framework of information theory provides a means to formulate the influence of local sequence upon the conformation of a given residue, in a rigorous manner. However, the existing database does not allow the evaluation of parameters required for an exact treatment of the problem. The validity of the approximations drawn from the theory is examined. It is shown that the first-level approximation, involving single-residue parameters, is only marginally improved by an increase in the database. The second-level approximation, involving pairs of residues, provides a better model. However, in this case the database is not big enough and this method might lead to parameters with deficiencies. Attention is therefore given to overcoming this lack of data. We have determined the significant pairs and the number of dummy observations necessary to obtain the best result for the prediction. This new version of the GOR method increases the accuracy of prediction by 7%, bringing the amount of residues correctly predicted to 63% for three states and 68 proteins, each protein to be predicted being removed from the database and the parameters derived from the other proteins. If the protein to be predicted is kept in the database the accuracy goes up to 69.7%.
GOR secondary structure prediction method version IV
Methods in Enzymology 1996 R.F. Doolittle Ed., vol 266, 540-553
Garnier J, Gibrat J-F, Robson B

GOR:The GOR method is based on information theory and was developed by J.Garnier, D.Osguthorpe and B.Robson (J.Mol.Biol.120,97, 1978). The present version, GOR IV, uses all possible pair frequencies within a window of 17 amino acid residues and is reported by J. Garnier. J.F. Gibrat and B.Robson in Methods in Enzymology, vol 266, p 540-553 (1996). After crossvalidation on a data base of 267 proteins, the version IV of GOR has a mean accuracy of 64.4% for a three state prediction (Q3). The program gives two outputs, one eye-friendly giving the sequence and the predicted secondary structure in rows, H=helix, E=extended or beta strand and C=coil; the second gives the probability values for each secondary structure at each amino acid position. The predicted secondary structure is the one of highest probability compatible with a predicted helix segment of at least four residues and a predicted extended segment of at least two residues.
Profile hidden Markov models
Bioinformatics 1998;14(9):755-763
Eddy SR
Department of Genetics, Washington University School of Medicine, St Louis, USA.

The recent literature on profile hidden Markov model (profile HMM) methods and software is reviewed. Profile HMMs turn a multiple sequence alignment into a position-specific scoring system suitable for searching databases for remotely homologous sequences. Profile HMM analyses complement standard pairwise comparison methods for large-scale sequence analysis. Several software implementations and two large libraries of profile HMMs of common protein domains are available. HMM methods performed comparably to threading methods in the CASP2 structure prediction exercise.
Combinaison de classifieurs statistiques, Application a la prediction de structure secondaire des proteines
PhD Thesis
Guermeur, Y
Model combination has recently been at the origine of significant improvements in the field of statistical learning, both for regression and pattern recognition tasks. However, fundamental questions have remained virtually untackled. Few criteria have thus been developed to motivate the choice of a specific method, whereas no independent result has been derived in the field of discrimination. This dissertation deals with one of the most commonly used combination techniques: linear regression. We first characterize the regularizing effect of the "stacked regression" method introduced by Breiman. We then study the application of the multivariate linear regression model to the combination of discriminant experts the outputs of which are estimates of th class posterior probabilities. This question is successively considered from the point of view of optimization and complexity control. The latter point involves the computation of generalized Vapnik-Chervonenkis dimensions. The study is followed up with the description of a non parametric method fo Bayes' error rate estimation. Our ensemble method is assessed on an open biological sequence processing problem: the problem of globular protein secondary structure prediction. To perform this discrimination task, we introduce a hierarchical and modular approach in which combination is used at an intermediate level.
Helix-turn-helix DNA-binding motifs prediction
Improved detection of helix-turn-helix DNA-binding motifs in protein sequences.
Nucleic Acids Res 1990 Sep 11;18(17):5019-5026
Dodd IB, Egan JB
Department of Biochemistry, University of Adelaide, Australia.

We present an update of our method for systematic detection and evaluation of potential helix-turn-helix DNA-binding motifs in protein sequences [Dodd, I. and Egan, J. B. (1987) J. Mol. Biol. 194, 557-564]. The new method is considerably more powerful, detecting approximately 50% more likely helix-turn-helix sequences without an increase in false predictions. This improvement is due almost entirely to the use of a much larger reference set of 91 presumed helix-turn-helix sequences. The scoring matrix derived from this reference set has been calibrated against a large protein sequence database so that the score obtained by a sequence can be used to give a practical estimation of the probability that the sequence is a helix-turn-helix motif.
Improved Performance in Protein Secondary Structure Prediction by Inhomogeneous Score Combination
Bioinformatics vol. 15 no. 5 1999 pp 413-421
Guermeur Y, Geourjon C, Gallinari P, & Deleage G
In many fields of pattern recognition, combination has proved efficient to increase the generalization performance of individual prediction methods. Numerous systems have been developed for protein secondary structure prediction, based on different principles. Finding better ensemble methods for this task may thus become crucial. In addition, efforts need to be made to help the biologist in the post-processing of the outputs. Results:
An ensemble method has been designed to post-process the outputs of protein secondary structure prediction methods, in order to obtain an improvement of prediction accuracy while generating class posterior probability estimates. Experimental results establish that it can increase the recognition rate of methods that provide inhomogeneous scores, even if their individual prediction successes are largely different. This combination thus contsitutes an help for the biologist, who can use it confidently on top of any set of prediction methods. Furthermore, the resulting estimates can be used in various ways, for instance to determine which residues are predicted with a given high level of reliability. Availability:
Free availability over the internet on the Network Protein Sequence @nalysis (NPS@) WWW server at The method is proposed as the default choice. Contact:
Neural networks and ensemble method :, server and software :
MPSA: integrated system for multiple protein sequence analysis with client/server capabilities.
Bioinformatics 2000 Mar;16(3):286-7

Blanchet C, Combet C, Geourjon C, Deleage G
Summary: MPSA is a stand-alone software intended to protein sequence analysis with a high integration level and Web clients/server capabilities. It provides many methods and tools, which are integrated into an interactive graphical user interface. It is available for most Unix/Linux and non-Unix systems. MPSA is able to connect to a Web server (e.g. in order to perform large-scale sequence comparison on up-to-date databanks. Availability: Free to academic Contact:
Multiple sequence alignment with hierarchical clustering.
Nucleic Acids Res 1988 Nov 25;16(22):10881-10890
Corpet F
Laboratoire de Genetique Cellulaire, INRA Toulouse, France.

An algorithm is presented for the multiple alignment of sequences, either proteins or nucleic acids, that is both accurate and easy to use on microcomputers. The approach is based on the conventional dynamic-programming method of pairwise alignment. Initially, a hierarchical clustering of the sequences is performed using the matrix of the pairwise alignment scores. The closest sequences are aligned creating groups of aligned sequences. Then close groups are aligned until all sequences are aligned in one group. The pairwise alignments included in the multiple alignment form a new matrix that is used to produce a hierarchical clustering. If it is different from the first one, iteration of the process can be performed. The method is illustrated by an example: a global alignment of 39 sequences of cytochrome c.
NPS@: Network Protein Sequence Analysis
TIBS 2000 March Vol. 25, No 3 [291]:147-150
Combet C., Blanchet C., Geourjon C. and Deléage G.

P-SEA: a new efficient assignment of secondary structure from C alpha trace of proteins.
Comput Appl Biosci 1997 Jun;13(3):291-5
Labesse G, Colloc'h N, Pothier J, Mornon JP
MOTIVATION: The secondary structure is a key element of architectural organization in proteins. Accurate assignment of the secondary structure elements (SSE) (helix, strand, coil) is an essential step for the analysis and modelling of protein structure. Various methods have been proposed to assign secondary structure. Comparative studies of their results have shown some of their drawbacks, pointing out the difficulties in the task of SSE assignment.
RESULTS: We have designed a new automatic method, named P-SEA, to assign efficiently secondary structure from the sole C alpha position. Some advantages of the new algorithm are discussed.
AVAILABILITY: The program P-SEA is available by anonymous ftp: directory: pub/.
Prediction of protein secondary structure at better than 70% accuracy.
J Mol Biol 1993 Jul 20;232(2):584-99
Rost B, Sander C
European Molecular Biology Laboratory, Heidelberg, Germany.

We have trained a two-layered feed-forward neural network on a non-redundant data base of 130 protein chains to predict the secondary structure of water-soluble proteins. A new key aspect is the use of evolutionary information in the form of multiple sequence alignments that are used as input in place of single sequences. The inclusion of protein family information in this form increases the prediction accuracy by six to eight percentage points. A combination of three levels of networks results in an overall three-state accuracy of 70.8% for globular proteins (sustained performance). If four membrane protein chains are included in the evaluation, the overall accuracy drops to 70.2%. The prediction is well balanced between alpha-helix, beta-strand and loop: 65% of the observed strand residues are predicted correctly. The accuracy in predicting the content of three secondary structure types is comparable to that of circular dichroism spectroscopy. The performance accuracy is verified by a sevenfold cross-validation test, and an additional test on 26 recently solved proteins. Of particular practical importance is the definition of a position-specific reliability index. For half of the residues predicted with a high level of reliability the overall accuracy increases to better than 82%. A further strength of the method is the more realistic prediction of segment length. The protein family prediction method is available for testing by academic researchers via an electronic mail server.

Combining evolutionary information and neural networks to predict protein secondary structure.
Proteins 1994 May;19(1):55-72
Rost B, Sander C
European Molecular Biology Laboratory, Heidelberg, Germany.

Using evolutionary information contained in multiple sequence alignments as input to neural networks, secondary structure can be predicted at significantly increased accuracy. Here, we extend our previous three-level system of neural networks by using additional input information derived from multiple alignments. Using a position-specific conservation weight as part of the input increases performance. Using the number of insertions and deletions reduces the tendency for overprediction and increases overall accuracy. Addition of the global amino acid content yields a further improvement, mainly in predicting structural class. The final network system has sustained overall accuracy of 71.6% in a multiple cross-validation test on 126 unique protein chains. A test on a new set of 124 recently solved protein structures that have no significant sequence similarity to the learning set confirms the high level of accuracy. The average cross-validated accuracy for all 250 sequence-unique chains is above 72%. Using various data sets, the method is compared to alternative prediction methods, some of which also use multiple alignments: the performance advantage of the network system is at least 6 percentage points in three-state accuracy. In addition, the network estimates secondary structure content from multiple sequence alignments about as well as circular dichroism spectroscopy on a single protein and classifies 75% of the 250 proteins correctly into one of four protein structural classes. Of particular practical importance is the definition of a position-specific reliability index. For 40% of all residues the method has a sustained three-state accuracy of 88%, as high as the overall average for homology modelling. A further strength of the method is greatly increased accuracy in predicting the placement of secondary structure segments.
Physico-chemical profiles
A computer program for predicting protein antigenic determinants.
Mol Immunol 1983 Apr;20(4):483-489
Hopp TP, Woods KR

A computerized method for predicting the locations of protein antigenic determinants is presented, which requires only the amino acid sequence of a protein, and no other information. This procedure has been used to predict the major antigenic determinant of the hepatitis B surface antigen, as well as antigenic sites on a series of test proteins of known antigenic structure [Hopp & Woods (1981) Proc. Nat. Acad. Sci. U.S.A. 78, 3824-3828.] The method is suitable for use in smaller personal computers, and is written in the BASIC language, in order to make it available to investigators with limited computer experience and/or resources. A means of locating multiple antigenic sites on a homologous series of proteins is demonstrated using the influenza hemagglutinin as an example.

A simple method for displaying the hydropathic character of a protein.
J Mol Biol 1982 May 5;157(1):105-132
Kyte J, Doolittle RF

A computer program that progressively evaluates the hydrophilicity and hydrophobicity of a protein along its amino acid sequence has been devised. For this purpose, a hydropathy scale has been composed wherein the hydrophilic and hydrophobic properties of each of the 20 amio acid side-chains is taken into consideration. The scale is based on an amalgran of experimental observations derived from the literature. The program uses a moving-segment approach that continuously determine the average hydropathy within a segment of predetermined length as it advances through the sequence. The consecutive scores are plotted from the amino to the carboxy terminus. At the same time, a midpoint line is printed that corresponds to the grand average of the hydropathy of the amino acid compositions found in most of the sequenced proteins. In the case of soluble, globular proteins there is a remarkable correpondence between the interior portions of their sequence and the regions appearing on the hydrophobic side of the midpoint line, as well as the exterior portions and the regions on the hydrophilic side. The correlation was demonstrated by comparisons between the plotted values and known structures determined by cristallography. In the case of membrane-bound proteins, the portions of their sequences that are located within the lipid bilayer are also clearly delineated by large uninterrupted areas on the hydrophobic side of the midpoint line. As such, the membrane-spannin segments of these proteins can be identified by this procedure. Although the method is not unique and embodies principles that have long been appreciated, its simplicity and its graphic nature make it a very useful tool for the evaluation of protein structures.

Prediction of chain flexibility in proteins
Naturwissens-chaften (1985),72, 212-213
Karplus, P.A. & Schulz, G.E

No summary available yet

New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites.
Biochemistry 1986 Sep 23;25(19):5425-5432
Parker JM, Guo D, Hodges RS

A new set of hydrophilicity high-performance liquid chromatography (HPLC) parameters is presented. These parameters were derived from the retention times of 20 model synthetic peptides, Ac-Gly-X-X-(Leu)3-(Lys)2-amide, where X was substituted with the 20 amino acids found in proteins. Since hydrophilicity parameters have been used extensively in algorithms to predict which amino acid residues are antigenic, we have compared the profiles generated by our new set of hydrophilic HPLC parameters on the same scale as nine other sets of parameters. Generally, it was found that the HPLC parameters obtained in this study correlated best with antigenicity. In addition, it was shown that a combination of the three best parameters for predicting antigenicity further improved the predictions. These predicted surface sites or, in other words, the hydrophilic, accessible, or mobile regions were then correlated to the known antigenic sites from immunological studies and accessible sites determined by X-ray crystallographic data for several proteins.

Structural prediction of membrane-bound proteins.
Eur J Biochem 1982 Nov 15;128(2-3):565-575
Argos P, Rao JK, Hargrave PA

A prediction algorithm based on physical characteristics of the twenty amino acids and refined by comparison to the proposed bacteriorhodopsin structure was devised to delineate likely membrane-buried regions in the primary sequences of proteins known to interact with the lipid bilayer. Application of the method to the sequence of the carboxyl terminal one-third of bovine rhodopsin predicted a membrane-buried helical hairpin structure. With the use of lipid-buried segments in bacteriorhodopsin as well as regions predicted by the algorithm in other membrane-bound proteins, a hierarchical ranking of the twenty amino acids in their preferences to be in lipid contact was calculated. A helical wheel analysis of the predicted regions suggests which helical faces are within the protein interior and which are in contact with the lipid bilayer.
Incorporation of non-local interactions in protein secondary structure prediction from the amino acid sequence.
Protein Eng 1996 Feb;9(2):133-142
Frishman D, Argos P
European Molecular Biology Laboratory, Heidelberg, Germany.

Existing approaches to protein secondary structure prediction from the amino acid sequence usually rely on the statistics of local residue interactions within a sliding window and the secondary structural state of the central residue. The practically achieved accuracy limit of such single residue and single sequence prediction methods is 65% in three structural stages (alpha-helix, beta-strand and coil). Further improvement in the prediction quality is likely to require exploitation of various aspects of three-dimensional protein architecture. Here we make such an attempt and present an accurate algorithm for secondary structure prediction based on recognition of potentially hydrogen-bonded residues in a single amino acid sequence. The unique feature of our approach involves database-derived statistics on residue type occurrences in different classes of beta-bridges to delineate interacting beta-strands. The alpha-helical structures are also recognized on the basis of amino acid occurrences in hydrogen-bonded pairs (i,i + 4). The algorithm has a prediction accuracy of 68% in three structural stages, relies only on a single protein sequence as input and has the potential to be improved by 5-7% if homologous aligned sequences are also considered.
Secondary consensus prediction
Protein structure prediction. Implications for the biologist.
Biochimie 1997 Nov;79(11):681-686
Deleage G, Blanchet C, Geourjon C
Institute of Biology and Chemistry of Proteins, Lyon, France.

Recent improvements in the prediction of protein secondary structure are described, particularly those methods using the information contained into multiple alignments. In this respect, the prediction accuracy has been checked and methods that take into account multiple alignments are 70% correct for a three-state description of secondary structure. This quality is obtained by a 'leave-one out' procedure on a reference database of proteins sharing less than 25% identity. Biological applications such as 'protein domain design' and structural phylogeny are given. The biologist's point of view is also considered and joint predictions are encouraged in order to derive an amino acid based accuracy. All the tools described in this paper are available for biologists on the Web (
An algorithm for secondary structure determination in proteins based on sequence similarity.
FEBS Lett 1986 Sep 15;205(2):303-308
Levin JM, Robson B, Garnier J

A secondary structure prediction algorithm is proposed on the hypothesis that short homologous sequences of amino acids have the same secondary structure tendencies. Comparisons are made with the secondary structure assignments of Kabsch and Sander from X-ray data [(1983) Biopolymers 22, 2577-2637] and an empirically determined similarity matrix which assigns a sequence similarity score between any two sequences of 7 residues in length. This similarity matrix differs in many respects from that of the Dayhoff substitution matrix [(1978) in: Atlas of Protein Sequence and Structure, (Dayhoff, M.O. ed). vol. 5. suppl. 3, pp. 353-358, National Biochemical Research Foundation, Washington, DC]. This homologue method had a prediction accuracy of 62.2% over 3states for 61 proteins and 63.6% for a new set of 7 proteins not in the original data base.

Exploring the limits of nearest neighbour secondary structure prediction.
Protein Eng. (1997),7, 771-776

SIMPA is a nearest neighbour method for predicting secondary structures using a similarity matrix, in its latest version the BLOSUM 62, an optimized similarity threshold, a window of 13 to 17 residues and a database of observed secondary structures. In version simpa96 used here, the database contains circa 300 proteins and the window is 13 residues long. Its crossvalidated accuracy was a Q3 of 67.7% for a single sequence and 72.8% when using multiple alignments of homologous sequences.

Major references:
- J. LEVIN, B. ROBSON, J. GARNIER. An Algorithm for secondary structure determination in proteins based on sequence similarity. FEBS, 205, (1986) 303-308. This describes the basic algorithm.
- J. LEVIN, J. GARNIER. Improvements in a secondary structure prediction method based on a search for local sequence homologies and its use as a model building tool. Biochim. Biophys. Acta, (1988) 955, 283-295. Here the window and threshold are optimized and the results are crossvalidated by jack knife process.
- J. LEVIN. Exploring the limits of nearest neighbour secondary structure prediction. Protein Eng. (1997),7, 771-776 This corresponds to simpa96.
SOPM: a self-optimized method for protein secondary structure prediction.
Protein Eng 1994 Feb;7(2):157-164
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.

A new method called the self-optimized prediction method (SOPM) has been developed to improve the success rate in the prediction of the secondary structure of proteins. This new method has been checked against an updated release of the Kabsch and Sander database, 'DATABASE.DSSP', comprising 239 protein chains. The first step of the SOPM is to buildsub-databases of protein sequences and their known secondary structures drawn from 'DATABASE.DSSP' by (i) making binary comparisons of all protein sequences and (ii) taking into account the prediction of structural classes of proteins. The second step is to submit each protein of the sub-database to a secondary structure prediction using a predictive algorithm based on sequence similarity. The third step is to iteratively determine the predictive parameters that optimize the prediction quality on the whole sub-database. The last step is to apply the final parameters to the query sequence. This new method correctly predicts 69% of amino acids for a three-state description of the secondary structure (alpha helix, beta sheet and coil) in the whole database (46,011 amino acids). The correlation coefficients are C alpha = 0.54, C beta = 0.50 and Cc = 0.48. Root mean square deviations of 10% in the secondary structure content are obtained. Implications for the users are drawn so as to derive an accuracy at the amino acid level and provide the user with a guide for secondary structure prediction. The SOPM method is available by anonymous ftp to
SOPMA: significant improvements in protein secondary structure prediction by consensus prediction from multiple alignments.
Comput Appl Biosci 1995 Dec;11(6):681-684
Geourjon C, Deleage G
Institut de Biologie et de Chimie des Proteines, UPR 412-CNRS, Lyon, France.

Recently a new method called the self-optimized prediction method (SOPM) has been described to improve the success rate in the prediction of the secondary structure of proteins. In this paper we report improvements brought about by predicting all the sequences of a set of aligned proteins belonging to the same family. This improved SOPM method (SOPMA) correctly predicts 69.5% of amino acids for a three-state description of the secondary structure (alpha-helix, beta-sheet and coil) in a whole database containing 126 chains of non-homologous (less than 25% identity) proteins. Joint prediction with SOPMA and a neural networks method (PHD) correctly predicts 82.2% of residues for 74% of co-predicted amino acids. Predictions are available by Email to or on a Web page (
Identification of common molecular subsequences.
J. Mol. Biol. (1981) 147:195-197
Smith TF, Waterman MS

No summary available yet
Knowledge-based secondary structure assignment
Proteins: structure, function and genetics (1995), 23, 566-579
Frishman D & Argos P

Transmembrane helices prediction
Transmembrane helices predicted at 95% accuracy.
Protein Sci 1995 Mar;4(3):521-33
Rost B, Casadio R, Fariselli P, Sander C
Protein Design Group, EMBL Heidelberg, Germany.

We describe a neural network system that predicts the locations of transmembrane helices in integral membrane proteins. By using evolutionary information as input to the network system, the method significantly improved on a previously published neural network prediction method that had been based on single sequence information. The input data were derived from multiple alignments for each position in a window of 13 adjacent residues: amino acid frequency, conservation weights, number of insertions and deletions, and position of the window with respect to the ends of the protein chain. Additional input was the amino acid composition and length of the whole protein. A rigorous cross-validation test on 69 proteins with experimentally determined locations of transmembrane segments yielded an overall two-state per-residue accuracy of 95%. About 94% of all segments were predicted correctly. When applied to known globular proteins as a negative control, the network system incorrectly predicted fewer than 5% of globular proteins as having transmembrane helices. The methodwas applied to all 269 open reading frames from the complete yeast VIII chromosome. For 59 of these, at least two transmembrane helices were predicted. Thus, the prediction is that about one-fourth of all proteins from yeast VIII contain one transmembrane helix, and some 20%, more than one.


The PROSITE database, its status in 1997.
Nucleic Acids Res. (1997)Jan 1;25(1):217-221
Bairoch A, Bucher P, Hofmann K
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet 1211 Geneva 4, Switzerland.

The PROSITE database consists of biologically significant patterns and profiles formulated in such a way that with appropriate computational tools it can help to determine to which known family of protein (if any) a new sequence belongs, or which known domain(s) it contains.
The SWISS-PROT protein sequence data bank and its supplement TrEMBL.
Nucleic Acids Res 1997 Jan 1;25(1):31-36
Bairoch A, Apweiler R
Department of Medical Biochemistry, University of Geneva, 1 rue Michel Servet, 1211 Geneva 4, Switzerland.

SWISS-PROT is a curated protein sequence database which strives to provide a high level of annotations (such as the description of the function of a protein, structure of its domains, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. Recent developments of the database include: an increase in the number and scope of model organisms; cross-references to two additional databases; a variety of new documentation files and the creation of TrEMBL, a computer annotated supplement to SWISS-PROT. This supplement consists of entries in SWISS-PROT-like format derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except the CDS already included in SWISS-PROT.
Sov parameter
Identification of related proteins with weak sequence identity using secondary structure information.
Protein Sci 2001 Apr;10(4):788-97
Geourjon C, Combet C, Blanchet C, Deleage G
Molecular modeling of proteins is confronted with the problem of finding homologous proteins, especially when few identities remain after the process of molecular evolution. Using even the most recent methods based on sequence identity detection, structural relationships are still difficult to establish with high reliability. As protein structures are more conserved than sequences, we investigated the possibility of using protein secondary structure comparison (observed or predicted structures) to discriminate between related and unrelated proteins sequences in the range of 10%-30% sequence identity. Pairwise comparison of secondary structures have been measured using the structural overlap (Sov) parameter. In this article, we show that if the secondary structures likeness is >50%, most of the pairs are structurally related. Taking into account the secondary structures of proteins that have been detected by BLAST, FASTA, or SSEARCH in the noisy region (with high E: value), we show that distantly related protein sequences (even with <20% identity) can be still identified. This strategy can be used to identify three-dimensional templates in homology modeling by finding unexpected related proteins and to select proteins for experimental investigation in a structural genomic approach, as well as for genome annotation.