BioinformaticsTheMachineLearningApproach Chap 6.
- NeuralNetwork Application 
History of application of NeuralNetwork to BiologicalSequenceAnalysis.
- 1982 perceptron prediction of ribosome binding site
- Stormo et.al. prediction of E.coli translational initiation site -- no hidden layer
- 1988 Qian and Sejnowski prediction of protein secondary structure -- MLP with backpropagation
- Adaptation of the NetTalk MLP from speech recognition to sequence analysis 
6.1 Sequence Encoding and Output Interpretation
Importance of Input representation
- Clever input representation  --> The problem at least can be solved by simple linear methods 
- The contractive nature of most prediction problem  --> To add extra information does'nt always help 
During training of MLP, networks tries to segregate the input space into decision regions using hyperplanes.
a window of size W :
- symmetric
- asymmetric window --> signal peptide cleavage site, intron splice site 
- windows with hole --> promoter, transcriptional initiation site, beta-sheet partner, distance constraint between two amino acids. 
|A| possible monomers for each position
- orthogonal (local) encoding : orthogonal binary vectors (log2|A|)
- advantage of not introducing any algebraic correlations between monomers
- compact encoding --> drastically increased nonlinearity. 
trade-offs between different encoding scheme
- complexity of the space in which the input windows live,
- the network architecture size,
- ease of learning
real-numbered quantification --> harmful impact on the input space
- preprocessed version of the original sequence segment - ex) stastics of certain words present in the window - average hydrophobicity over the window
 
 - ex) from the optimal scheme generated by a simulated annealing, to discover physicochemical properties relevant to the formation of secondary structure
 
- ex) stastics of certain words present in the window 
decreasing nonlinearity of a prediction problem
- to switch representation from monomer to dimer or trimer  - --> gain in significant correlation with negative impact of the increased nonlinearity - DNA, RNA : base stacking energy protein : a bias associated with steric hindrance, translation kinetics, etc
 - overlapping triplet --> hidden units directly receive context information rather then learn it from training process. 
 
- to group monomers to form new alphabet in that the pattern will be clear. - --> using the orthogonal vector representation, - reducing the dimensionality of the input space, reducing the number of the adjustable parameter. meaningful grouping from the physicochemical property or estimated mutation rates.
 
 - ex) a protein can maintain its folded structure, even if the total number of amino acids in composition is reduced from 20 to 5. - --> to construct simpler sequence spaces that can be better covered by the limited amounts of data available. 
 
 
In some aplication, the whole sequence or large segments of it is casted into global preprocessed measure that can be used as input information.
- ex) prediction of fold class from the frequency of 400 dipeptide. - discrimination between exon and intron from 6-mer statistics, GC composition, sequence vocabulary
 
Output interpretation or postprocessing
- intelligent postprocessing may be as important as selecting optimal network architecture in terms of the smallest numerical generalization error. - orthogonal binary vector  - --> the number of output neurons == the number of output class 
 
 - ex) alpha helice have a minimum length of 4 amino acids  - --> small helices that are predicted can often be removed. 
 
 
- orthogonal binary vector  
6.2 Sequence Correlations and Neural Networks
ProteinStructure can be highly conserved despite a very low sequence similarity
- => global or local cooperativity of the sequence, not independent contribution of individual position 
NN can sense this cooperativity through its ability to correlate the different input values to each other. NN can complement what one can obtain by weight matrices an by HMM.
6.3 Prediction of Protein Secondary Structure
The assignment fo the secondary structure categories to the experimentally determined 3D structure
- DSSP --> repetitive pattern of potential hydrogen bonds from 3D coordinate of backbone atoms. 
- STRIDE --> hydrogen bond energy and backbon dihedral angle 
- DEFINE --> difference distance matrices for evaluating the match of interatomic distances in the protein to those from idealized secondary structures. 
6.3.1 Secondary Structure Prediction Using MLPs
Qian, Sejnowski -- fully connected MLP with a single hidden layer.
- input layer  - input window size W = 13
- the number of orthogonal binary vector |A| = 21 (20 amino acids, one terminator)
- total 13 x 21 = 273 units
 
- hidden layer : 40 sigmoidal units
- output layer : 3 sigmoidal units (alpha-helix, beta-sheet, coil classes) - winner-take-all principle --> adding non-linearity 
 
- the total number of parameters : 273 x 40 + 40 x 3 + 40 + 3 = 11,083
- the networks are initialized using random uniform weights in [-0.3, 0.3] interval. 
- trained using backpropagation with the LMS error function (recommended : normalized exponential output layer with the relative entropy as error function)
- the typical size of the training set is 20,000 residues extracted form the PDB
- a random presentation order of input windows across the training set is used to avoid performance oscillations associated with the use of contiguous window.
- performance : 33% --> 60%, Q3 : 62.7% ,Calpha = 0.35, Cbeta = 0.29, Ccoil = 0.38 - --> correlation coefficient is recommended due to the imbalance of the amounts of helix, sheet, and coil. 
 - failure  - increasing size of input
- adding additional information
- using finer secondary structure classification schemes,
- using higher order or recurrent network, or pruning method
 - second network - input window size 13 : 13 successive outputs of the first network
- input layer size 13 x 3
- hidden layer with 40 units, output layer with 3 units.
- performance Q3 = 64.3%, Calpha = 0.41, Cbeta = 0.31, Ccoil = 0.41
- clean up the output of the first network, by removing isolated assignments.
 
 
 
6.3.2 Prediction Based on Evolutionary Information and Amino Acid Composition
NN + Chou-Fasman rule
- Chou-Fasman rules -> initialize a network 
- extra data from PDB -> train extra free connection 
- performance slightly enhanced.
bayesian
- unphysical assumption that the probability of an amino acid occurring in each position in the protein is independent of the amino acid occurring elsewhere
- predictive accuracy is less than the accuracy of NN.
NN + bayesian method.
- output neurons represent the conditional probabilities of structural classes.
- introduction of a new objective function, the mutual information, that translate the notion of correlation as a measure of predictive accuracy into a useful training measure
- predictive accuracy is nearly same as the method using mean-square error, but the accuracy on the training set is significantly higher.
PHD prediction server (Rost and Samder) : 1996 Asilomar competition CASP2 --> 65-68% accuracy
- realization that sequence families contain much more useful information than single sequence [evolutionary information]  - --> > 72% accuracy, Calpha = 0.64, Cbeta = 0.53 
 
- scheme - a DB of known sequences was scanned by alignment method for similar sequences.
- the list of sequences found was filtered by a length-dependent threshold for significant sequence identity.
- a profile of amino acid profile was compiled based on all probable 3D homologues.
- this profile is used for prediction
 
- input is based on a profile made from amino acid occurrence in columns of a multiple alignment of sequences with high similarity to the query sequence. - the profles were taken from the HSSP database (Homology-derived Secondary Structure of Proteins)
- the number of 20 amino acid at each position + the number of deletion + the number of insertion + the conservation weight.
- the 13 adjacent column
 
- overall structure : 2 network layer + one layer averaging over independently trained networks. - the backpropagation training of the networks was either balanced or unbalanced.
- in the final network system, a mixture of networks trained by these two approaches was used. - --> balanced approach shows more reliable prediction of the sheet category 
 
 
- preventing the overfitting problem, one of main dangers of the Qian-Sejnowski architecture. - early stopping
- ensemble average
 
6.3.3 Network Ensembles and Adaptive Encoding
Riis and Krogh address the overfitting problem by careful design of the NN architecture
- The large number of parameters is due to the large input layer - The number of parameters is reduced by an adaptive encoding of amino acids.
- zero vector represent the N- and C-terminal spacer --> orthogonal encoding with 20 units. 
- weight sharing :  - the input layer of W x 20 is connected to a first hidden layer of M x W units , with a translation-invariant connectivity pattern.
- weight sharing is enforced during training by a obvious modification of backpropagation that weight updates are summed for weights sharing same value
- connection between input and representation layer is only 21 x M
 
 
- design a different network for each of the 3 class - alpha helices : exploit helix periodicity between first and second hidden layer, second hidden layer is fully connected to output layer ~160 parameters
- beta sheet and coli : first layer is fully connected to second layer. 300 ~ 500 parameter
- balaned training sets with the same number of positive and negative example
 
- use ensembles of networks and filtering to improve the prediction - different networks for each type of structure at each position : 15 x 3 x 5 = 225 input units
- connectivity is restricted by having one hidden unit per position and per ensemble class    :  3 x 15 = 45 units - hidden layer is fully connected to three softmax output units.
- relative entropy is the object function
 
 
- use multiple alignments with a weighting scheme - instead of profile, for which the correlations between amino acids in the window are lost,  - predictions are first from single sequences and are then combined using multiple alignments with weighting scheme
 - The weighting scheme used to compensate for DB bias is the maximum entropy weighting scheme. - the individual score in a given column can be combined by weighted average or weighted majority
 
 
 
- instead of profile, for which the correlations between amino acids in the window are lost,  
result
- the architecture with its local encoding avoid overfitting problem
- perfomance is not improved by using a number of additional inputs.
- the improvement resulting from each algorithmic component is quantified
- the network output can be interpreted as classification probabilites.
6.3.4 Secondary Structure Prediction Based on Profiles Made by Position-Specific Scoring Matrices
the profile quality depends on the alignment approach used to select the sequences behind the profile. replacing HSSP profile in PHD method
- PSI-BLAST : sequences in the initial scan of a DB based on single sequence are used to generate new search profile which in turn is used to pick up additional sequences
- PSIPRED : clever way of generating profiles for use  as improved input to the network scheme. based on position-specific scoring matrices. - increase the predictive power of the NN.
- 1998 Asilomar CASP3 performance 77%, 73%
 
6.3.5 Prediction by Averaging over 800 Different Networks
- as single networks, the large and small windows will perform worse
- when combining many differnet networks, the critical issue is how to benefit from those networks that overall are suboptimal. - --> simple averaging will make the noise from the suboptimal networks become destructive 
 
- Peterson : benefit from as many as 800 networks in an ensemble with strong architectural diversity (different window size, different numbers of hidden unit)
- by averging over the highly confident prediction only, it's possible to use many networks and prevent the noise from the suboptimal networks.
- enhancement of PSIPRED form 77.2% to 80.2%
Output Expansion
- Peterson scheme : prediction for the secondary structure in the input window, simultaneous prediction for the neighboring residues - constructing networks by training using hints that essentially further constrain the network weights.
 
- multitask learning -- network use these tasks as inductive bias for one another and thus learn better
- In proetin secondary structure, hints will be conformational categories for the adjacent residues, surface exposure of the residue and hydrophobicity
6.4 Prediction of Signal Peptides and Their Cleavage Sites
The identification problem is to some extent organism-specific. NN-based prediction is successful when Gram(+), Gram(-) and eukaryote is treated separately
- cleavage site or noncleavage site : singal sequene or mature protein :
6.4.1 SignalP
- S-score : signal peptide/nonsignal peptide network : abrupt change : symmetric window :
- C-score : cleavage site/noncleavage site network : sharp peaks : asymmetric window :
- if there are several C-score peaks of comparable strength, the true cleavage site may often be found by inspecting the S-score curve. - Y-score : simple geometric average of the C-score and a smoothed derivative of the S-score. - Yi = root(Ci Delta-d Si)
- Delta-d Si = 1/d (sigma(j=1 -> d)Si-j - sigma(j=0 -> d-1)Si+j) 
 
- Y-score is used for cleavage site and assignment of the initiation methionine
 
- Y-score : simple geometric average of the C-score and a smoothed derivative of the S-score. 
6.5 Application for DNA and RNA Nucleotide Sequences
6.5.1 The Structure and Origin of the Genetic Code
The codon assignments are correlated to the physical properties of the amino acids in a systematic and error-correcting manner.
- The first codon position -- amino acid biosynthetic pathway
- The second codon position -- hydropathic property
- The third codon position -- molecular weight or size of amino acid
- error correction (lower the chance that mutation has a hazardous effect) - degeneration --> abundance of amino acid 
- similar amino acid --> similar codon 
 
NN to the genetic code is unbiased and completely data-driven. --> no a priori realationship
- input : 64 triplet --> 12 units, output : 20 amino acids --> 20 units 
- 2 hidden units : 12 x 2 + 2 x 20 = 64 weights, 2 + 20 = 22 thresholds --> adaptive training 
backpropagation to train the feed-forward architecture --> to get a low analog network error E, but not necessarily a low classification error Ec.
- to get a low classification error - use a high learning rate for wrong classified example and a low learning rate for correctly classified example. - ex) In the first phase of training, most examples are wrongly classified.
 
- modify the presentation frequency for the different categories so that balanced situation results - ex) The single methionine codon should be included in the set six times, each cysteine codon should appear three times.
 
- have an adaptive training set, where the training examples are included or excluded after determining whether they are classified correctly.--> introduce more noise in the learning process, which helps to avoid local minima - adaptive training procedure - initialize the first and second parts of the pool with the training examples
- choose an example randomly from the pool and present it to the network
- train the network by backpropagation
- if the example is classified correctly, the go to 2> 
- if the example is misclassified, put it in the second part of the pool, thus replacing a randomly chosen example.
- repeat until Ec = 0
 
 
- adaptive training procedure 
 
- use a high learning rate for wrong classified example and a low learning rate for correctly classified example. 
internal representation of 2 hidden units --> 3 groups dividd the GES scale of transfer free energy into 3 interval (except arginine)
genetic code is inherently nonlinear.
The weight of the trained network
- the second codon position > the first > the third 
- 2 hidden units share the discrimination task
- the correlation of the second codon to hydrophobicity  - --> Blobel and Cavalier-Smith : genes and ribosomes associted with the surface of liposome-like vesicle => cotranslational insertion of membrane protein 
 
6.5.2 Eukaryotic Gene Finding and Intron Splice Site Prediction
- prediction of splice donor and acceptor site : 15-60 nucleotide
- prediction of coding region and non-coding region : exon 100-150 nucleotide
- the pattern strength or regularity is the major factor influencing the potential accuracy of their detection
- codon usage specific to a given organism creates a fairy strong 3-periodicity   - --> mammal : third position, bacteria : first position 
 
- compensating realtionship between the strength of the donor and acceptor site pattern and the strength of the pattern present in the associated coding region - --> NetGene : 2 local splice site network, exon prediction network (301 nucleotide window) 
 
6.5.3 Examples of Gene Structure Prediction by Sensor Integration
The use of a combination of sensors for detection of various signals related to a complex object has a long histroy in the theory of pattern recognition
GRAIL
- NN-based exmple of sensor integration used for coding-region recognition
- GRAIL II : discrete coding -region candidates, rather than a fixed-size sliding window - one of input feature, a measure of the length of the coding-region candidate
 
- performace improved by more complex sensor indicators, but not NN. - a fifth-order nonhomogeneous Markov chain for 6-mer based evaluation of the coding potential in DNA
 
- recognition of coding-region candidate, exon assembly, detection of indel error, detection of CpG islands, recognition of PolIII promoter, polyadenylation site
- intron/exon and splice-site indicator are weighted by NN to approxiamte the log-likelihood.
- to find the combination of introns and exons that maximize the likihood, dynamic programming alogrithm is used.
- robustness of the method to substitution and frame-shift errors - score the sequence for splice site, codon usage, local composition complexity, 6-tuple frequency, length distribution, periodic asymmetry
- DP to enforce the constraint that introns and exons must be adjacent and nonoverlapping and find the highscored combination
 
6.5.4 Prediction of Intron Splice Sites by Combining Local and Global Sequence Information
1991 NetGene : 3 networks
- global network (301nt)  - combining 2 lcoal donor (15nt) and acceptor site's(41nt) network
- perform a coding/noncoding prediction
 
- the sharpness of the transition of the global exon-to-intron signal regulates the actual threshold - --> improve the ratio between true and false positive and acceptor site assigned 
 - ex) with abruptly decreasing exon activity, donor site should be promoted and acceptor site be suppressed.
 
- coding/nocoding output :  - the first derivative(delta) of the coding output neuron  - summing activities to the right of a given point
- subtracting the corresponding sum for the left side, then dividing this difference by the number of addends.
 
- in order to reduce the nubmer of cases where the individual sums covered both introns and exons,  75bp , half the average length of the internal exons in the training set, was used as the scope when summing - To assess possible weighting between the strength of the exon signal and the output - assign splicing donor site : Odonor > eDdelta + cD 
- assign splicing acceptor site : Oacceptor > eAdelta + cA - cA, cD : cutoff parameter
- eA, eD : control the impact of the exon signal
 
- optimization : when cutoff assignment was controlled by the exon signal, false positive was lowered
- exons smaller than 75bp had weak exon level but strong donor and acceptor level.
 
 
- To assess possible weighting between the strength of the exon signal and the output 
 
- the first derivative(delta) of the coding output neuron  
Arabidopsis thaliana NetPlantGene
6.5.5 Doing Sequence Analysis by Inspecting the Order in Which Neural Networks Learn
- Important information on the internal structure of the training material can be obtained.
- NN learn first the linearly separable part of the data and outliers that deviate from the prevailing patterns later. - the first phase : linearly separable prat of the data
- the second phase : networks often unlearn some of the example learned in the first phase.
 
- the order in which a set of example is learned frist by network reveals information on the relative nonlinearity of  each example. - --> used to identify abnormal examples that deviate strongly from the prevailing patterns. 
 
- the success of the learning  was monitored  - taking the training set as a whole : monitor the decrease in the total network error E
- inspecting each window configuration : monitor cutoff value (mostly 0.5) separating the two output category assignment  - -> point to those inputs not being classified correctly 
 
 
6.6 Prediction Performance Evaluation
It is often relevant to measure accuracy of prediction at different levels.
- ex) protein secondary structure may be evalutaed at the mean per-chain level
At higher levels, the measures tend to be more complicated and problem-specific.
- ex) For secondary structure prediction, SOV(segment overlap measure) can be applied.
How do we assess the accuracy of M or how do we compare M to D?
- The issue of prediction accuracy is strongly related to the frequency of occurrence of each class.
- the amino acid positions are weighted and treated equally (the independence and equivalence assumption) - true positve(TP), true negative(TN), false positive(FP), false negative(FN)
- even with 4 numbers alone, it is not immediately clear how a given prediction method fares.
 
6.7 Different Performance Measures
6.7.1 Percentages
- PCP(D,M) = 100 x TP / (TP + FN)
- PCN(D,M) = 100 x TN / (TN + FP)
- Qalpha, Qbeta, Qcoil
6.7.2 Hamming Distance
- in the binary case - HD(D,M) = Sum(i)|di - mi|
- equivalent to the total number of errors FP + FN
 
- in the non-binary case, the Hamming distance is called the L1 distance
6.7.3 Quadratic "Distance"
- quadratic, Euclidean, LMS(least mean square)
- Q(D,M) = (D-M)2 = Sum(i)(di - mi)2
- binary case , Hamming distance
- non-binary case, negative log-likelihood approach for a Gaussian model - P(di|mi) = 1/(sigma root(2 phi)) exp(-(di - mi)2/2sigma2)
 
- drawback - Gaussian model is not relevant for prediction problem
- LMS has limited dynamic range - => logarithmic variation - LQ(Q,M) = -Sum(i)log[1-(di-mi)2] 
 
 
 
6.7.4 Lp Distances
- LP(D,M) = [Sum(i)|di -mi|p]1/p 
- applied to any numerical values. - p=1 Hamming distance
- p=2 Euclidean distance
- p->infin max(i)|di - mi| 
 
- binary case (FP + FN)1/p
6.7.5 Correlation
- Pearson correlation coefficient - stastistics
- Matthews correlation coefficient - secondary structure
- C(D,M) = Sum(i)(di - davg)(mi - mavg)/(sigmaD sigmaM) - davg = Sum(i)di/N mavg = Sum(i)mi/N
 - -1 : total disagreement
- 0 : total random
- 1 : total agreement
 
- the correlation coefficient has a global form
- binary case - C(D,M) = (TPxTN - FPxFN)/root((TP + FN)(TP + FP)(TN + FP)(TN + FN))
- much more balanced evaluation of the prediction than percentage
- simple test for deciding whether the prediction is more correlated with the data than a random guess. - chi2 = N x C2(D,M)
 
 
- C is symmetric to FP and FN
6.7.6 Approximate Correlation
6.7.7 Relative Entropy
- relative entropy, cross entropy, KL(Kullback-Leibler)
- probability vector X = (x1..xM), Y = (y1..yM), xi, yi >= 0, Sum(i)xi = Sum(i)yi = 1 - H(X,Y) = Sum(i=1 -> M)(xi log(xi/yi)) = -H(X) - Sum(i)(xi logyi) 
- positive, 0 if X = Y, not symmetric, global measure
 
- H(X,X+ep) = -Sum(i)xi[log(1+epi/xi)] ~ Sum(i)(epi2/xi) - if X is uniform, relative entropy is LMS
 
- in secondary structure - H(D,M)=Sum(i=1->N)[di log(di/mi) + (1-di)log((1-di)/(1-mi))] 
- just the sum of the relative entropies at each position i
 
6.7.8 Mutual Information
- Z = (X,Y) joint random variable - I(X,Y) = H(Z,XY)
- the reduction in uncertainty of one variable when the other is observed.
- between the priori and posterior distribution
 
- I(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X) = H(X) + H(Y) - H(Z) = I(Y,X)
- in secondary structure - I(D,M) = -H(TP/N, TN/N, FP/N, FN/N) - TP/Nlog[davg mavg] - FN/Nlog[davg(1-mavg)] - FP/Nlog[(1-davg)mavg] - TP/Nlog[(1-davg)(1-mavg)] 
 
- 0 <= I(D,M) <= H(D) - normalized mutual information coefficient
- 0 <= IC(D,M) = I(D,M)/H(D) <= 1 
- mutual information is symmetric in FP and FN, mutual information coefficinet isn't symmetric
 
6.7.9 Sensitivity and Specificity
- sensitivity x = TP/(TP + FN)
- sensitivity of negative category TN/(FP + TN)
- specificity y = TP/(TP + FP)
- C(D,M) = (Nxy - TP)/root((Nx - TP)(Ny - TP))
6.7.10 Summary
- in binary case.   - Percentage, Hamming distance, quadratic distance, Lp : two out of 4 number TP, FP, TN, FN
- correlation coefficient, mutual information : 4 number
 
- in continuous case. - correlation coefficient, relative entropy.
 
 BioHackersNet
 BioHackersNet