Distances creates a table of the pairwise distances within a group of aligned sequences.
Distances writes a matrix of the pairwise evolutionary distances between aligned sequences. The distances are expressed as substitutions per 100 bases or amino acids. Several methods may be chosen to correct the distances for multiple substitutions at a site. For nucleic acid sequences, these methods are Kimura's two-parameter method, the Tajima-Nei method, the Jin-Nei gamma distance method, and the Tamura method; for protein sequences, the Kimura method; and for either type of sequence, the Jukes-Cantor method. It is also possible to obtain an uncorrected distance.
Here is a session using Distances to determine distances between the aligned sequences in the file hum_gtr.msf.
% distances
DISTANCES for what aligned sequences ? hum_gtr.msf{*}
Reading sequences...
gtr1_human: 548 total, 548 read
gtr1_human: 548 total, 548 read
gtr1_human: 548 total, 548 read
gtr1_human: 548 total, 548 read
gtr1_human: 548 total, 548 read
Distances will be computed for 5 protein sequences.
Which distance correction method to use ?
1 Uncorrected distance
2 Jukes-Cantor distance
3 Kimura protein distance
Choose the method to use: (* 3 *)
What should I call the distance matrix file (* hum_gtr.distances *) ?
Computing distances using Kimura method...
1 x 2: 48.61 1 x 3: 45.50
1 x 4: 65.74 1 x 5: 107.70
2 x 3: 61.53 2 x 4: 74.57
2 x 5: 113.82 3 x 4: 68.93
3 x 5: 104.43 4 x 5: 110.86
Statistics on pairwise distances:
5 of 10 pairs have distances exceeding 70.0.
%
Here is the 5 x 5 distance matrix created during the example session:
DISTANCES between protein sequences in: hum_gtr.msf{*} October 20, 1998 13:00
Correction method: Kimura protein distance
Distances are: estimated number of substitutions per 100 amino acids
Symmatrix version 1
Number of matrices: 1
//
Matrix 1, dimension: 5
Key for column and row indices:
1 gtr1_human
2 gtr3_human
3 gtr4_human
4 gtr2_human
5 gtr5_human
Matrix 1: Part 1
1 2 3 4 5
____________________________________________________________ ..
| 1 | 0.00 48.61 45.50 65.74 107.70
| 2 | 0.00 61.53 74.57 113.82
| 3 | 0.00 68.93 104.43
| 4 | 0.00 110.86
| 5 | 0.00
If you are interested in putting your own distance information into
this matrix format, for example to draw the tree for non-sequence derived
distances using GrowTree, the easiest way to
do so would be to make a template matrix with some short random sequences
(one character in length is enough) and then replace the data points in
the matrix with your own data points.
In case you are planning on doing this frequently or have a large number of data points and feel writing a script to convert your distance matrix to the GCG distance matrix would save you some time, here is the basic format of a GCG distance matrix:
Heading: At the top you can put your own comments. Then the heading needs to contain a line giving the version as follows "Symmatrix version 1", the format described here is for version 1. Next a line giving the number of matrices contained in this file is needed: "Number of matrices: 1", currently only one matrix per file is processed by GrowTree. Next you can put any amount of comments followed by two backslashes ("//") on a line by itself. Then you give the matrix number M and the number of dimensions (e.g. sequences) D as follows "Matrix M, dimension: D". After this you again can put comments and then start listing which entity (e.g. sequence) will get assigned which column number by starting with a line saying "Key for column", followed by a blank line, followed by as many lines as there are dimensions, each listing a column number followed by an entity name. You end this heading section with two dots "..".
Matrix: If you have more than 12 dimensions the matrix is split into several parts, each having 12 columns of data points. Each part has as many rows as there are dimensions. This is important, but might easily be missed, since some of the bottom rows in some parts of a multipart matrix will be empty. Each row has the first 10 characters reserved for labeling, after that it contains the not yet listed data points for the respective columns separated by white space. Each row needs to be on one line only. A line containing two dots ".." separates each part of a multipart matrix. You can have comments after one part and before the two dots.
Distances accepts multiple sequences (two or more) all of the same type. You can specify multiple sequences in a number of ways: by using a list file, for example @project.list; by using an MSF or RSF file, for example project.msf{*}; or by using a sequence specification with an asterisk (*) wildcard, for example GenEMBL:*. The function of Distances depends on whether your input sequence(s) are protein or nucleotide. Programs determine the type of a sequence by the presence of either Type: N or Type: P on the last line of the text heading just above the sequence. If your sequence(s) are not the correct type, turn to Appendix VI for information on how to change or set the type of a sequence.
PileUp creates a multiple sequence alignment from a group of related sequences using progressive pairwise alignments. It can also plot a tree showing the clustering relationships used to create the alignment. LineUp creates and edits multiple sequence alignments. Pretty displays multiple sequence alignments.
The Wisconsin Package includes several programs for evolutionary analysis of multiple sequence alignments. Distances creates a matrix of pairwise distances between the sequences in a multiple sequence alignment. Diverge measures the number of synonymous and nonsynonymous substitutions per site of two or more aligned protein coding regions and can output matrices of these values. GrowTree reconstructs a tree from a distance matrix or a matrix of synonymous or nonsynonymous substitutions. PAUPSearch reconstructs phylogenetic trees from a multiple sequence alignment using parsimony, distance, or maximum likelihood criteria; PAUPDisplay can manipulate and display the trees output by PAUPSearch and can also plot the trees output by GrowTree.
The sequences must be aligned properly for Distances to work. Since Distances does not create alignments, it is your responsibility to ensure that the sequences specified by a list file or wild-card file specification are in alignment before using them as input to Distances . One way to verify this is to use Pretty to display the sequences; if the Pretty output shows an acceptable alignment, the sequences are suitable for use with Distances.
Distances examines each pair of aligned sequences symbol-by-symbol and counts the number of exact matches, partial matches, and gap symbols. If the sequences are nucleic acids, transitions (purine-purine or pyrimidine-pyrimidine substitutions) and transversions (purine-pyrimidine substitutions) are also tallied. These counts are used, where appropriate, to compute the distance.
When sequences are very closely related, the observed distance and
the actual distance between two sequences are equivalent. As the time since
the sequences diverged increases, the probability that more than one substitution
occurred at a single site also increases. Therefore for all but closely
related sequences, the observed distance between the sequences underestimates
the true distance.
In order to construct a valid tree, the observed distances must be corrected to account for multiple substitutions at a single site. A number of methods have been devised to make this correction. Each makes different assumptions about the substitution process.
This method computes the observed distance between sequences, with
no correction for multiple substitutions. This uncorrected distance
is sometimes referred to as the p-distance. It can be used for either
nucleic acid or protein sequences, and gap positions can be factored into
the calculation or ignored. A match score is computed by summing the number
of exact matches. If -AMBIGuous is used, partial matches
between ambiguous symbols also contribute to the match score as fractional
scores (for example, the nucleotide W matched with A would score 0.5, while
N matched with A would score 0.25). The similarity S is computed
by dividing the match score
by the number of positions scored plus the
number of gap positions times the gap penalty. The distance is 1 - S.
Gaps are ignored unless a nonzero value is specified for -GAPweight.
End gaps are penalized as much as internal gaps, so if you choose to apply
a gap penalty and gaps exist at the beginning and/or end of some of the
sequences in the alignment, make sure to set the beginning and ending coordinates
to exclude these regions.
This method for correcting distances can be used for nucleic acid
or protein sequences. Gap positions can be factored into the equation by
specifying a nonzero value for -GAPweight, and partial
matches between ambiguous symbols can contribute to the match score if
-AMBIGuous is used. The uncorrected distance D is computed
and then corrected to account for multiple substitutions at a site using
the equation below. The parameter b is 3/4 for nucleic acid sequences,
19/20 for protein sequences. End gaps are penalized as much as internal
gaps, so if you choose to apply a gap penalty and gaps exist at the beginning
and/or end of some of the sequences in the alignment, make sure to set
the beginning and ending coordinates to exclude these regions.
This method applies to nucleic acid sequences only. It uses the
same equation as the Jukes-Cantor method, except that the parameters are
calculated somewhat differently: the value of the parameter b varies
with the base composition of the sequence pairs. In addition, only exact
matches are considered in computing the match score, and gap positions
are always ignored. In the equations below, A=1, T=2, C=3, G=4.
h = S(i = A,C) S(k = T,G) ((1)/(2)
pairfreq[i,k](2) * fraction[i] * fraction[k])
distance = -b ln( 1 - (D)/(b) )
Tajima and Nei, Mol. Biol. Evol. 1; 269-285 (1984), equation 6.
This method applies only to nucleic acids and takes into consideration
the fact that transition substitutions (purine-purine or pyrimidine-pyrimidine)
often occur much more frequently than transversion substitutions (purine-pyrimidine).
Gap positions and ambiguous symbols other than R (purine) and Y (pyrimidine)
are not scored.
This method applies only to nucleic acids and assumes that substitution
occurs at any site along the sequence with equal probability. It takes
different rates of transitions and transversions into account and also
takes into account deviation of G+C content from the expected value of
50 percent. Gap positions and ambiguous symbols are not scored.
theta1 = fraction G+C in sequence 1
theta2 = fraction G+C in sequence 2
C = theta1 + theta2 - 2 * theta1 * theta2
distance = -C ln(1 - (P)/(C) - Q) - 0.5(1 - C) ln(1 - 2Q)
This is another method that applies only to nucleic acids and that
takes transitions and transversions into account. Gap positions and ambiguous
symbols other than R and Y are not scored. This method is designed to be
used when the substitution rate varies extensively from site to site. The
shape parameter a is the square of the inverse of the coefficient
of variation.
P = transitions / nScored
Q = transversions / nScored
distance = (1)/(2)a [(1 - 2P - Q)((-1/a))
+
(1)/(2) (1 - 2Q)((-1/a)) - (3)/(2)
]
Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990).
This method applies only to proteins. The formula calculates distances
based on the relationship between observed amino acid substitutions and
actual (corrected) substitutions that was derived by Dayhoff and coworkers.
Gap positions are ignored, and only exact matches contribute to the match
score.
M. Kimura, The Neutral Theory of Molecular Evolution, Cambridge
University Press, Cambridge, 1983.
The single most critical step in tree reconstruction is the sequence alignment. If the alignment is poorly done, no amount of care or tweaking of analysis parameters will guarantee a correct tree. Multiple alignments that are created by computerized methods such as PileUp will need to be inspected and edited by hand, using an editor such as LineUp or SeqLab. Be especially careful with nucleic acid sequences that are coding regions, since computerized alignment methods have no knowledge of codon boundaries. They may insert a gap whose length is not a multiple of three or may insert a gap in the middle of a codon, for example.
Once the alignment is satisfactory, you must decide whether to use the entire alignment, or only portions of it. Only homologous regions of the sequences should be used to reconstruct a tree. Any regions of an alignment that contain data for which no homologs occur in the other sequences should be eliminated from consideration. For example, if there are gap characters at the beginning or end of one or more sequences in the alignment, the sequence data at the extremes of the alignment should not be used, since the longer sequences contain regions that have no homologs in the shorter sequences. Similarly, regions in the interior of the alignment that contain gaps in some of the sequences should probably be edited out of the alignment before trying to reconstruct a tree.
Some biological phenomena can interfere with tree reconstruction. Gene duplication is one of them. When genes are duplicated (by polyploidy or by regional duplication), one of the copies often accumulates mutations and either acquires a different function than the original gene or becomes a pseudogene. In this situation, it is often unclear which of the alternative loci will give the correct tree for the functional gene. Another complication is recombination: if recombination has occurred between sequences in the data set, no single tree can correctly explain the data.
Some data sets can also confound the existing methods for tree construction. For example, a set of sequences consisting of mostly closely related sequences with a few very divergent sequences cannot be analyzed using parsimony or a distance method based on an improperly corrected distance matrix. These methods will systematically group the widely diverged sequences together as sister groups, even if they actually belong to different lineages. If you don't want to drop the diverged sequences from the analysis, you will need to add sequences to the alignment that bridge the distance between the more distant sequences and the group of closely related sequences, or use a distance method based on a properly corrected distance matrix.
Another consideration when computing distances between coding regions is whether to use all three nucleotides in each codon or just the first or second. The substitution rate at the third codon position is usually much higher than that at the other two positions because of the degeneracy of the genetic code. In these cases, it might be best to use just the first position or just the first two positions of each codon to compute the distances.
It is important to use the proper correction method when computing distances, unless the sequences are all very closely related. Some guidelines for choosing a correction method are listed under the SUGGESTIONS topic.
If the aligned sequences are not in an MSF file format, use Pretty to display the aligned sequences you pass to Distances. If they look properly aligned in the Pretty display, they will work sensibly with Distances.
To get the best nucleotide alignments of coding regions, you also should align the sequences at the protein level and adjust the nucleotide alignment to conform to the amino acid alignment. You can do this manually using LineUp or SeqLab.
One way of detecting the presence of recombination in your sequence set is to reconstruct trees from different sections of the alignment. If different trees are found for different sections, it's possible that recombination has occurred.
To check the distance distribution of your sequences, create an uncorrected distance matrix from the alignment (using Distances) and examine the contents. If there are mostly closely related sequences with a few very divergent sequences, you must either add sequences to the alignment to bridge the distance between the more distant sequences and the group of closely related sequences, or you must use a distance method based on a properly corrected distance matrix.
Jin and Nei, Mol. Biol. Evol. 7; 82-102 (1990), give a set of guidelines for choosing a distance correction method for nucleic acid sequences. Here is a summary of their suggestions.
First compute the distances using the Jukes-Cantor method. If all the distances are less than or equal to 10 substitutions per 100 bases, there is no need to use another method (all the correction methods calculate about the same distances for closely related sequences). If the distances are greater than 10 substitutions per 100 bases, choose a correction method based on the following criteria:
- If the Jukes-Cantor distances are between 30 and 100 substitutions and there is evidence that the substitution rate varies extensively from site to site, use the Jin-Nei gamma distance with -APARAMeter=1.0. If the distances lie between 30 and 100 and the frequencies of the four nucleotides deviate substantially from equality, use the Tajima-Nei distance.
- If the Jukes-Cantor distance is greater than 100 for many pairs of sequences, the tree that will be constructed from the distance data will not be reliable. Depending on your data, and the reason that you are computing the distances, one of the following suggestions may help:
b. For coding regions, align the protein sequences and compute the distances as amino acid substitutions.
c. If you know that a certain region of the sequence is evolving very rapidly compared to the rest of the sequence, edit the alignment with LineUp to eliminate this region, and recompute the distances.
All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
Minimal Syntax: % distances [-INfile=]hum_gtr.msf{*} -Default
Prompted Parameters:
[-OUTfile=]hum_gtr.distances names the output file
Correction Methods for Nucleic Acid Sequences
-MENu=1 uncorrected distance
2 Jukes-Cantor distance
3 Kimura 2-parameter distance
4 Jin-Nei gamma distance
5 Tajima-Nei distance
6 Tamura distance
Correction Methods for Protein Sequences
-MENu=1 uncorrected distance
2 Jukes-Cantor distance
3 Kimura protein distance
Local Data Files: None
Optional Parameters:
-BEGin=1 -END=100 sets the range of interest
-FILe=hum_gtr.report names the table of counts used to calculate distances
-AMBIGuous considers partial matches between ambiguous
symbols
-POSition=5 sets base position(s) to consider
-GAPweight=0.0 sets gap penalty (uncorrected and Jukes-Cantor only)
-APARAMeter=1.0 sets 'a' parameter (Jin-Nei gamma distance only)
-NOMONitor suppresses screen display of the progress of the
analysis
None.
You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.
sets the distance correction method to use. For nucleic acid sequences,
these are (in order): uncorrected distance, Jukes-Cantor distance, Kimura
2-parameter distance, Jin-Nei gamma distance, Tajima-Nei distance, and
Tamura distance. For protein sequences, these are: uncorrected distance,
Jukes-Cantor distance, and Kimura protein distance.
sets the beginning position for all input sequences. When the beginning
position is set from the command line, Distances ignores beginning positions
specified for individual sequences in a list file.
sets the ending position for all input sequences. When the ending
position is set from the command line, Distances ignores ending positions
specified for sequences in a list file.
creates a table of the counts used to calculate the distances: number
of positions scored, exact matches, ambiguous symbol matches, transitions,
transversions, gap positions, etc.
considers partial matches between ambiguous symbols when calculating
distances (uncorrected and Jukes-Cantor only).
allows you to consider a single specified codon position (1, 2,
or 3), the first and second positions only (4), or all three codon positions
(5) when calculating distances between nucleic acid sequences.
allows you to assign a gap penalty when using the Jukes-Cantor or
uncorrected distance methods.
allows you to vary the value of the shape parameter a in
the equation used by the Jin-Nei gamma distance correction method.
suppresses screen display of the progress of the analysis.
[ Program Manual | User's Guide | Data Files | Databases ]
Technical Support: support-us@accelrys.com
or support-eu@accelrys.com
Copyright (c) 1982-2002 Accelrys Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.
Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.
All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.