PSIBLAST*

[ Program Manual | User's Guide | Data Files | Databases ]

Table of Contents
FUNCTION
DESCRIPTION
EXAMPLE
OUTPUT
INTERPRETING OUTPUT
INPUT FILES
RELATED PROGRAMS
RESTRICTIONS
ALGORITHM
Building PSSMS
Composition-based Statistics
CONSIDERATIONS
SUGGESTIONS
FILTERING OUT LOW COMPLEXITY SEQUENCES
AMINO ACID SCORING
COMMAND-LINE SUMMARY
CITING BLAST
ACKNOWLEDGEMENT
LOCAL DATA FILES
PARAMETER REFERENCE

FUNCTION

[ Top | Next ]

PSIBLAST iteratively searches one or more protein databases for sequences similar to one or more protein query sequences. PSIBLAST is similar to BLAST except that it uses position-specific scoring matrices derived during the search.

DESCRIPTION

[ Previous | Top | Next ]

PSIBLAST, or Position-Specific Iterated BLAST, uses the methods described in Altschul, et al. Nucleic Acids Res. 25(17): 3389-3402 (1997) and Schaffer, et al. Nucleic Acids Res. 29(14): 2994-3005 (2001) to search for similarities between protein query sequences and all the sequences in one or more protein databases.

PSIBLAST uses position-specific scoring matrices (PSSMs) to score matches between query and database sequences, in contrast to BLAST which uses pre-defined scoring matrices such as BLOSUM62. PSIBLAST may be more sensitive than BLAST, meaning that it might be able to find distantly related sequences that are missed in a BLAST search.

PSIBLAST can repeatedly search the target databases, using a multiple alignment of high scoring sequences found in each search round to generate a new PSSM for use in the next round of searching. PSIBLAST will iterate until no new sequences are found, or the user specified maximum number of iterations is reached, whichever comes first. Normally, the first round of searching uses a standard scoring matrix, effectively performing a blastp search.

PSIBLAST is a statistically driven search method that finds regions of similarity between your query sequence and database sequences and produces gapped alignments of those regions. Within these aligned regions, the calculated score is higher than some level that you would expect to occur by chance alone.

You are prompted to set a maximum expectation level for each search round. The expectation of a sequence is the probability of the current search finding a sequence with as good a score by chance alone. Therefore setting the maximum expectation level to 10.0, the default, limits the reported sequences to those with scores high enough to have been found by chance only ten or fewer times.

You are also prompted to specify a maximum expectation threshold that sequences can score and still be used to build PSSMs. Typically, this threshold is a smaller value than the maximum expectation level and the default is 0.005.

It is possible to bypass the initial blastp step either by providing a PSSM saved from a previous search or by specifying a set of aligned sequences which are then used to generate the initial PSSM. It is also possible to save a PSSM for use with BLAST in order to search nucleotide database with a protein query using the PSSM as scoring matrix.

You can specify any number of protein databases to PSIBLAST. In the current release, if you want to specify multiple protein databases you must do so on the command line. In other words, you cannot specify more than one database from the interactive menu. For example:

% psiblast -INfile2=PIR,SWPLUS

You can also specify multiple protein queries using any valid multiple sequence specification. For example:

% psiblast -INfile1=hsp70.msf{*}

EXAMPLE

[ Previous | Top | Next ]

Here is a session using BLAST to find the sequences in PIR with similarities to a myoglobin gene:


% psiblast

 PSIBLAST with what query sequence(s) ? mywhp.pep

                  Begin (* 1 *) ?
                End (*   153 *) ?

 Search for query in what sequence database:

   1) pir      p Protein Information Resource
   2) swplus   p SWISS-PROT + SP-TREMBL
   3) genpept  p GenPept (Translated GenBank)
 Please choose one (* 1 *):

 Ignore hits expected to occur by chance more than (* 10.0 *) times?

 Maximum expectation for inclusion in PSSMs (* 0.005 *) ?

 Maximum number of interations (* 2 *) ?

 Limit the number of sequences in my output to (* 500 *) ?

 What should I call the output file (* mywhp.blastpgp *) ?

 1  Searching database "pir" with query "pir1:mywhp"

    CPU time (sec): 116.2
       Output file: mywhp.blastpgp

 Number of query sequences searched: 1
                     CPU time (sec): 116.4

%

OUTPUT

[ Previous | Top | Next ]

Below is part of the output from the search in the example session:

The output has four parts: 1) an introduction that tells where the search occurred and what database(s) and query were compared; 2) a list of the sequences in the database(s) containing HSPs (high-scoring segment pairs) whose scores were least likely to have occurred by chance (the entries in this list have begin and end ranges on them unless -NOFRAGments is specified); 3) a display of the alignments of the HSPs showing identical and similar residues; and 4) a complete list of the parameter settings used for the search.

The list and the alignments of high scoring sequences are sorted first showing the matches from the first round of iteration, followed by the matches found in each successive round and sequences not found in previous rounds.

Immediately before the display of the results of the final search round, there is separator line which reads:

     Final Round  ..

Only the sequences listed below this line are treated as list file members. If you wish to include sequences from earlier rounds in the list file, or to exclude some of the existing members you must manually edit the PSIBLAST output.

By default, PSIBLAST looks for alignments that contain gaps. If you only look for alignments that do not contain gaps, there will often be more than one segment pair associated with each database sequence


///////////////////////////////////////////////////////////////////////////////

BLASTP 2.2.1 [Aug-1-2001]

Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs",  Nucleic Acids Res. 25:3389-3402.
Query= PIR1:MYWHP
         (153 letters)

Database: pir
           219,241 sequences; 76,174,552 total letters

Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .done

Results from round 1

                                                         Score    E
 Sequences producing significant alignments:             (bits)  Value ..

PIR1:MYWHP  Begin: 1 End: 153
!myoglobin [validated] - sperm whale                        268  3e-72
PIR1:MYWHW  Begin: 1 End: 153
!myoglobin - dwarf sperm whale                              258  4e-69
///////////////////////////////////////////////////////////////////////////////

PIR2:S20270  Begin: 3 End: 145
!hemoglobin alpha chain - Antarctic dragonfish (Gymno...     39  0.004
PIR1:HAKOAW  Begin: 7 End: 146
!hemoglobin alpha-A chain - white stork                      39  0.004

!Searching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .done

!Results from round 2

!                                                        Score    E
!Sequences producing significant alignments:             (bits)  Value

!Sequences used in model and found again:

PIR2:A29392
!hemoglobin alpha chain - Indian short-nosed fruit bat      235  4e-62
PIR2:A29702
!hemoglobin alpha chain - pallid bat                        235  4e-62
PIR2:A29391
///////////////////////////////////////////////////////////////////////////////

PIR1:MYTTM
!myoglobin - map turtle                                     202  2e-52
PIR1:MYOY
!myoglobin - aardvark                                       202  2e-52

!Sequences not found previously or not previously below threshold:

PIR1:HAEMA
!hemoglobin alpha chain - Amazon manatee                    227  7e-60
PIR1:HAMQP
!hemoglobin alpha chain - hanuman langur                    227  8e-60
///////////////////////////////////////////////////////////////////////////////

PIR1:HBLRS
!hemoglobin beta chain - slow loris                         203  1e-52
PIR1:HBHO
!hemoglobin beta chain [validated] - horse                  203  2e-52
\\End of List

Results from round 1

>PIR1:MYWHP myoglobin [validated] - sperm whale
          Length = 153

 Score =  268 bits (686), Expect = 3e-72
 Identities = 153/153 (100%), Positives = 153/153 (100%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
Sbjct: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
Sbjct: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
           GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
Sbjct: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153

>PIR1:MYWHW myoglobin - dwarf sperm whale
          Length = 153

 Score =  258 bits (660), Expect = 4e-69
 Identities = 148/153 (96%), Positives = 151/153 (97%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLSEGEWQLVLHVWAKVEAD+AGHGQDILIRLFK HPETLEKFDRFKHLK+EAEMKASED
Sbjct: 1   VLSEGEWQLVLHVWAKVEADIAGHGQDILIRLFKHHPETLEKFDRFKHLKSEAEMKASED 60

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
Sbjct: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYKELGYQG 153
            DFGADAQGAM+KALELFRKDIAAKYKELGYQG
Sbjct: 121 ADFGADAQGAMSKALELFRKDIAAKYKELGYQG 153

///////////////////////////////////////////////////////////////////////////////
Results from round 2

>PIR2:A29392 hemoglobin alpha chain - Indian short-nosed fruit bat
          Length = 141

 Score =  235 bits (601), Expect = 4e-62
 Identities = 37/147 (25%), Positives = 58/147 (39%), Gaps = 6/147 (4%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLS  +   V   W KV  +   +G + L R+F S P T   F  F           S
Sbjct: 1   VLSPADKTNVKAAWDKVGGNAGEYGAEALERMFLSFPTTKTYFPHFDLAH------GSPQ 54

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           +K HG  V  AL   +         L  L+  HA K ++     + +S  ++  L +  P
Sbjct: 55  VKGHGKKVGDALTNAVSHIDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLANHLP 114

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147
            DF      +++K L      + +KY+
Sbjct: 115 SDFTPAVHASLDKFLASVSTVLTSKYR 141

>PIR2:A29702 hemoglobin alpha chain - pallid bat
          Length = 141

 Score =  235 bits (601), Expect = 4e-62
 Identities = 40/147 (27%), Positives = 60/147 (40%), Gaps = 6/147 (4%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLS  +   V   W KV      +G + L R+F S P T   F  F      A++K
Sbjct: 1   VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- 57

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
              HG  V  ALG  +         L  L+  HA K ++     + +S  ++  L   HP
Sbjct: 58  ---HGKKVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHP 114

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147
           GDF      +++K L      + +KY+
Sbjct: 115 GDFTPAVHASLDKFLASVSTVLVSKYR 141

///////////////////////////////////////////////////////////////////////////////
>PIR1:HAEMA hemoglobin alpha chain - Amazon manatee
          Length = 141

 Score =  227 bits (581), Expect = 7e-60
 Identities = 35/147 (23%), Positives = 57/147 (37%), Gaps = 6/147 (4%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLS+ +   V   W K+      +G + L R+F S P T   F  F           S
Sbjct: 1   VLSDEDKTNVKTFWGKIGTHTGEYGGEALERMFLSFPTTKTYFPHFDLSH------GSGQ 54

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
           +K HG  V  AL   +         L  L+  HA + ++     + +S  ++  L S
Sbjct: 55  IKAHGKKVADALTRAVGHLEDLPGTLSELSDLHAHRLRVDPVNFKLLSHCLLVTLSSHLR 114

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147
            DF      +++K L      + +KY+
Sbjct: 115 EDFTPSVHASLDKFLSSVSTVLTSKYR 141

///////////////////////////////////////////////////////////////////////////////
>PIR1:HAMQP hemoglobin alpha chain - hanuman langur
          Length = 141

 Score =  227 bits (580), Expect = 8e-60
 Identities = 36/147 (24%), Positives = 58/147 (38%), Gaps = 6/147 (4%)

Query: 1   VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED 60
           VLS  +   V   W KV      +G + L R+F S P T   F  F      A++K
Sbjct: 1   VLSPADKTNVKAAWGKVGGHGGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKG--- 57

Query: 61  LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP 120
              HG  V  AL   +         L  L+  HA K ++     + +S  ++  L +  P
Sbjct: 58  ---HGKKVADALTNAVAHVDDMPHALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP 114

Query: 121 GDFGADAQGAMNKALELFRKDIAAKYK 147
            +F      +++K L      + +KY+
Sbjct: 115 AEFTPAVHASLDKFLASVSTVLTSKYR 141

///////////////////////////////////////////////////////////////////////////////
  Database: pir
    Posted date:  Aug 27, 2001  6:21 PM
  Number of letters in database: 76,174,552
  Number of sequences in database:  219,241

Lambda     K      H
   0.316    0.196    0.662

Lambda     K      H
   0.267   0.0601    0.140

Matrix: BLOSUM62
Gap Penalties: Existence: 11, Extension: 1
Number of Hits to DB: 66,397,381
Number of Sequences: 219241
Number of extensions: 3896070
Number of successful extensions: 10016
Number of sequences better than 10.0: 1540
Number of HSP's better than 10.0 without gapping: 1363
Number of HSP's successfully gapped in prelim test: 192
Number of HSP's that attempted gapping in prelim test: 7585
Number of HSP's gapped (non-prelim): 1593
length of query: 153
length of database: 76,174,552
effective HSP length: 102
effective length of query: 51
effective length of database: 53,811,970
effective search space: 2744410470
effective search space used: 2744410470
T: 11
A: 40
X1: 16 ( 7.3 bits)
X2: 38 (14.6 bits)
X3: 64 (24.7 bits)
S1: 41 (21.0 bits)
S2: 63 (28.3 bits)

The PSIBLAST output is a list file that is suitable for input to any GCG program that allows indirect file specifications. For information about indirect file specification, see Chapter 2 of the User's Guide, Using Sequence Files and Databases.

INTERPRETING OUTPUT

[ Previous | Top | Next ]

Bit Score

Each aligned segment pair has a normalized score expressed in bits that lets you estimate the magnitude of the search space you would have to look through before you would expect to find an HSP score as good as or better than this one by chance. If the bit score is 30, you would have to score, on average, about 1 billion independent segment pairs (2(30)) to find a score this good by chance. Each additional bit doubles the size of the search space. This bit score represents a probability; one over two raised to this power is the probability of finding such a segment by chance. Bit scores represent a probability level for sequence comparisons that is independent of the size of the search.

The size of the search space is proportional to the product of the query sequence length times the sum of the lengths of the sequences in the database. This product, referred to as N in Altschul's publications, is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is about 0.13. PSIBLAST uses estimates of K produced before it runs by random simulation (Altschul & Gish, Methods in Enzymology 266; 460-480 (1996)).

E Value

There is a probability associated with each pairwise comparison in the list and with each segment pair alignment. The number shown in the list is the probability that you would observe a score or group of scores as high as the observed score purely by chance when you do a search against a database of this size.

An ideal search would find hits that go from extremely unlikely to ones whose best scores should have occurred by chance alone (that is, with probabilities approaching 1.0).

PSIBLAST Parameters

At the end of the output is a listing of parameter settings along with some trace information about the search. Some of these parameters are described in this document, but to get more complete documentation on these parameters, look at the BLAST release notes on the World Wide Web at

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html.

INPUT FILES

[ Previous | Top | Next ]

PSIBLAST accepts any number of protein sequences as input. The search set is a specially formatted database. See the GCGToBLAST entry in the Program Manual for information on how to create a local database that PSIBLAST can search from a set of sequences in GCG format.

RELATED PROGRAMS

[ Previous | Top | Next ]

BLAST searches one or more nucleic acid or protein databases for sequences similar to one or more query sequences of any type. BLAST can produce gapped alignments for the matches it finds.

NetBLAST searches for sequences similar to a query sequence. The query and the database searched can be either peptide or nucleic acid in any combination. NetBLAST can search only databases maintained at the National Center for Biotechnology Information (NCBI) in Bethesda, Maryland, USA.

GCGToBLAST combines any set of GCG sequences into a database that you can search with BLAST.

FastA does a Pearson and Lipman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). For nucleotide searches, FastA may be more sensitive than BLAST.

TFastA does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences. TFastA translates the nucleotide sequences in all six reading frames before performing the comparison. It is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

FastX does a Pearson and Lipman search for similarity between a nucleotide query sequence and a group of protein sequences, taking frameshifts into account. FastX translates both strands of the nucleic sequence before performing the comparison. It is designed to answer the question, "What implied protein sequences in my nucleic acid sequence are similar to sequences in a protein database?"

TFastX does a Pearson and Lipman search for similarity between a protein query sequence and any group of nucleotide sequences, taking frameshifts into account. It is designed to be a replacement for TFastA, and like TFastA, it is designed to answer the question, "What implied protein sequences in a nucleotide sequence database are similar to my protein sequence?"

SSearch does a rigorous Smith-Waterman search for similarity between a query sequence and a group of sequences of the same type (nucleic acid or protein). This may be the most sensitive method available for similarity searches. Compared to BLAST and FastA, it can be very slow.

FrameSearch searches a group of protein sequences for similarity to one or more nucleotide query sequences, or searches a group of nucleotide sequences for similarity to one or more protein query sequences. For each sequence comparison, the program finds an optimal alignment between the protein sequence and all possible codons on each strand of the nucleotide sequence. Optimal alignments may include reading frame shifts.

WordSearch identifies sequences in the database that share large numbers of common words in the same register of comparison with your query sequence. The output of WordSearch can be displayed with Segments.

ProfileSearch and MotifSearch use a profile (derived from a set of aligned sequences) instead of a query sequence to search a collection of sequences.

HmmerSearch uses a profile hidden Markov model as a query to search a sequence database to find sequences similar to the family from which the profile HMM was built. Profile HMMs can be created using HmmerBuild.

FindPatterns uses a pattern described by a regular expression to search a collection of sequences. Motifs looks for sequence motifs by searching through proteins for the patterns defined in the PROSITE Dictionary of Protein Sites and Patterns. Motifs can display an abstract of the current literature on each of the motifs it finds.

RESTRICTIONS

[ Previous | Top | Next ]

You can only use protein queries and protein databases.

You cannot specify more than one query sequence if you are using the -REStorecheckpoint option.

Checkpoint files created using the -SAVcheckpoint are platform-specific binary files. For this reason checkpoint files created on one operating system will not work correctly if specified using -REStorecheckpoint when running PSIBLAST on a different type of system.

When restoring a checkpoint file you must use the exact same query sequence as was used for the search that produced the checkpoint file.

The query sequence must be present in a multiple alignment used to jumpstart a search. Both copies of the query must have the same number of sequence characters, however they may differ in the numbers and positions of gaps.

A jumpstart alignment may not have more than 500 sequences and the total length of the alignment (including gaps) multiplied by the number of sequences may not exceed 1,000,000.

You can only restore a single checkpoint file with a single run of PSIBLAST. You can only specify a single jumpstart multiple alignment with a single round of PSIBLAST.

Because of the way PSIBLAST must estimate certain statistical parameters (see the ALGORITHM topic in the BLAST chapter), the number of scoring matrices available for use with PSIBLAST is limited. Currently, valid choices for the -MATRix parameter are BLOSUM62 (the default), BLOSUM45, BLOSUM80, PAM30, and PAM70.

Gap creation and gap extension penalties are supported in limited combinations depending upon which scoring matrix is in use. The following table shows the allowed combinations for amino acids. The first values listed are the defaults for each scoring matrix.





Scoring Matrix Gap Opening Penalty Gap Extension Penalty




BLOSUM62 11 1

7 2
8 2
9 2
10 1
12 1



BLOSUM80 10 1

6 2
7 2
8 2
9 1
11 1



BLOSUM45 14 2

10 3
11 3
12 3
13 3
12 2
13 2
15 2
16 1
17 1
18 1
19 1



PAM30 9 1

5 2
6 2
7 2
8 1
10 1



PAM70 10 1

6 2
7 2
8 2
9 1
11 1




ALGORITHM

[ Previous | Top | Next ]

For the most part, the description of the BLAST search algorithm given in the BLAST chapter is applicable to PSIBLAST. There are three main characteristics that are unique to a PSIBLAST search: the use of PSSMs, iterative searching and composition-based statitics.

PSSM-based searches use the PSSM as both the query sequence and the scoring matrix. For a given register of comparison between a PSSM and a sequence, the scores for the residues at each position in the target sequence come from the value corresponding to that residue at that position in the PSSM.

Building PSSMS

[ Previous | Top | Next ]

After each search round, high-scoring sequences are used to create a multiple alignment that is then used to calculate match scores for the PSSM. When building the PSSMs part of each score is based upon observed amino acid frequencies in the multiple alignment, and part is base on prior knowledge of amino acid substitutability. The prior information, represented as "pseudocounts", is dervied from a standard scoring matrix, such as BLOSUM62. Pseudocounts are particularly useful when the sequences included in the multiple alignments do not constitute an adequate sample of the protein family that they represent.

You can control the relative contribution of the alignments and pseudocounts with the pseudocount constant.

Composition-based Statistics

[ Previous | Top | Next ]

PSIBLAST differs from ordinary blastp by taking the amino acid compositions of of the query and database sequences into account when computing E-values. This is done because, for gapped alignments, the precomputed lambda and K values used by blastp are based upon comparisons of a large number of "random protein sequences" generated using standard amino acid frequencies. With this approach, it is possible for the lambda values to be greater than is warranted for a pair of sequences under consideration, especially, when the sequences have a similar, slightly biased amino acid composition. This can lead to a calculated E-value that is significantly smaller (i.e. better) than is justified. The same problem can arise when using PSSM-based comparisons.

The specific method used to take composition into account is detailed in Schaffer, et al. Nucleic Acids Res. 29(14): 2994-3005 (2001)

CONSIDERATIONS

[ Previous | Top | Next ]

Specifying the Number of Rounds to Iterate

When run with the default settings, PSIBLAST will perform two search rounds, the first of which is a blastp-style search. You may specify up to ten iterations, which will cause PSIBLAST to perform up to ten rounds. However, if after any round of searching no new matches were found, no more iterations are performed (a condition known as "convergence"). If you specify -ROUNDs=0 then then PSIBLAST will iterate until convergence occurs. Usually, there is little to be gained by specifying more than five iterations, because the chance of finding false positive matches increases with the number search rounds. Considering that each search round takes as long as a single equivalent run of BLAST, you should consider breaking the job into a series of low-round searches, saving the PSSM in a checkpoint file at each step. Then, upon examination of the output you can decide whether to restore the PSSM and continue searching.

E-values change when PSSMs are used

Do not expect E-values for a given database sequence to remain constant between search rounds. This is particularly the case between the first and second search rounds. In the first search round, matches between query and database sequences are scored using a static scoring matrix. In contrast, successive search rounds determine scores by comparing database sequences to the PSSM. In addition, if more sequences are added to the set used to make PSSMs as the search iterates, the scores for matching sequences may change. It is not possible to predict whether the score for an individual sequence will increase or decrease between search rounds.

Save PSSMS in checkpoint files

The -SAVecheckpoint file allows you to save the PSSM and specify with a later search using the -REStorecheckpoint parameter. This is particularly useful when you wish to change the search conditions. For example, you could search a database using a PSSM that was based on sequences found by searching a different database. The main restrictions to observe are: 1) the exact same query sequence used when the PSSM was created must be used when ever the checkpoint file is restored; and 2) the same operating system must used for all searches involving a given checkpoint file.

Using Multiple Alignments to Jumpstart PSIBLAST

The composition of the first PSSM that is built tends to guide the direction of the search, yet the validity of the multiple alignment scheme used by PSIBLAST has some drawbacks compared to dedicated multiple alignment approaches such as the one used by PILEUP. For this reason, you may wish to create a multiple alignment and then specify it to PSIBLAST using the -JUMPstart parameter. This has the additional benefit of allowing you to use a "seed" PSSM that is not based on the content of the target database, which might be useful when searching different databases.

Bit Scores and the Size of the Search

Altschul has shown that for sequences that have diverged by a certain amount, there is an informativeness (or ability to discriminate between chance scores and significant scores) associated with each residue pair in the segment pair. This informativeness is the amount of information obtainable from each residue pair in a real alignment that can be used to distinguish the real alignment from a random one. This informativeness can be expressed in bits. The sum of the information available from each residue pair in a segment is the segment pair's score in bits. Such scores are intuitively understandable as the significance of a segment pair score. To express such scores as a fraction you would divide 1 by 2 to the number of bits in the score. For example, if a segment pair has a bit-score of 16, then the appropriate fraction (1/2(16)=1/65,536) would suggest that you should see a score this high by chance about once for every 65,000 independent segment pairs you examine.

For nucleotide sequences that have not diverged, there should be an informativeness of about 2 bits per nucleotide pair. For protein sequences that have not diverged, the informativeness should be slightly over 4 bits per amino acid pair. (The informativeness per pair goes down as the sequences diverge and a segment pair score is maximally informative only when a scoring matrix appropriate to the extent of divergence between the sequences is used to calculate the score.)

The bit scores are absolute, but the expectation of finding any particular score depends on the size of the search space. The number of places where a segment pair might originate is proportional to the product of the length of the query times the sum of the lengths of all the sequences searched. This product is multiplied by a coefficient K to get the size of the search space. When searching protein databases with protein queries, K is approximately 0.13.

For a query sequence of length 300 aa searching a database of 12 million residues, the size of the search space would be 300 x 12,000,000 x 0.13 or 468,000,000. For a search this size, a score that only occurs once in every 65,000 potential segment pairs (that is, with a bit score of 16) would be expected to occur about 7,200 times by chance alone.

If the database being searched is highly redundant (as it might be if it contained several hundred homologous cytochromes), then size of the search space calculated by these methods will overestimate the size of the real search space.

Increasing Program Speed Using Multithreading

This program is multithreaded. It has the potential to run faster on a machine equipped with multiple processors because different parts of the analysis can be run in parallel on different processors. By default, the program assumes you have one processor, so the analysis is performed using one thread. You can use -PROCessors to increase the number of threads up to the number of physical processors on the computer.

Under ideal conditions, the increase in speed is roughly linear with the number of processors used. But conditions are rarely ideal. If your computer is heavily used, competition for the processors can reduce the program's performance. In such an environment, try to run multithreaded programs during times when the load on the system is light.

As the number of threads increases, the amount of memory required increases substantially. You may need to ask your system administrator to increase the memory quota for your account if you want to use more than two threads.

Never use -PROCessors to set the number of threads higher than the number of physical processors that the machine has -- it does not increase program performance, but instead uses up a lot of memory needlessly and makes it harder for other users on the system to get processor time. Ask your system administrator how many processors your computer has if you aren't sure.

When Blastpgp Produces No Output

You may see an error indicating that blastpgp produced no output (blastpgp is the name of the PSIBLAST executable provided by NCBI). One of the possible causes of this condition is the presence of a file in your home directory called ".ncbirc" which contains an invalid path to the NCBI data directory. The NCBI data directory should contain seqcode.val, gc.code, BLOSUM62, and perhaps some other data files. If your home directory does indeed contain such a file, we recommend that you either rename it (the safest option), edit it to update the path to the NCBI data directory (this takes some effort, but that path is contained in the logical name "NCBI"), or delete it (thesimplest option). Your system administrator should be able to help you do this if you have trouble, or you may contact support at support-us@accelrys.com

SUGGESTIONS

[ Previous | Top | Next ]

Using Checkpoint files with BLAST

Checkpoint files created with PSIBLAST can be specified to BLAST using -REStorecheckpoint in order to perform single-round PSSM-based searchs of a nucleotide databases. The same query and filter settings must be used for both the PSI-BLAST and BLAST searches.

Jumpstarting

If the alignment used to jumpstart a search is in an MSF or RSF file, then you should consider specifying the query sequence from the same file. For example:


% psiblast -in1=calm.msf{calmhuman} -jump=calm.msf{*}

You can use -JUMPstart with multiple query sequences as long as they are all present in the multiple alignment. (You cannot specify more than one multiple alignment with a single run of PSIBLAST.

List Size Limit

A list size that is too small to display all the significant hits is a common problem. To see the unlisted hits you must run the search again with the list size limit set high enough to include everything significant.

Segment Pair Alignment Limit

For each round, PSIBLAST displays alignments of segment pairs from the top 250 sequences in the list. You can adjust this limit with -ALIgnments. PSIBLAST will not show alignments for sequences not present in the list.

Sensitivity

PSIBLAST uses a word size of three for proteins, which is appropriate for a wide range of searches, but you can adjust the synonym threshold T downwards to two in order to increase sensitivity at the price of speed. Read the PARAMETER REFERENCE topic for more information on -HITEXTTHRESHold and -EXPect.

Batch Queue

Using BLAST to search a large local database can take a long time. You may want to run searches in the batch queue. You can specify that this program run at a later time in the batch queue by using -BATch. Run this way, the program prompts you for all the required parameters and then automatically submits itself to the batch or at queue. For more information, see "Using the Batch Queue" in Chapter 3, Using Programs in the User's Guide.

E-Values compared to BLAST

If a single round of searching is specified, then PSIBLAST just performs a blastp search. However, the reported scores and E-values will probably differ from those generated by peforming a blastp search with BLAST. This is because PSIBLAST computes the statistical significance of a match by taking into account the composition of the query and database sequences, where as BLAST does not. Please refer to Schaffer, et al. (Nucleic Acids Res. 29(14): 2994-3005 (2001)) for a detailed discussion of composition-based statistics.

FILTERING OUT LOW COMPLEXITY SEQUENCES

[ Previous | Top | Next ]

PSIBLAST always filters out regions of low complexity from database sequences using the SEG filter program (Wootton and Federhen, Computers in Chemistry 17: 149-163 (1993); Wootton and Federhen, Methods in Enzymology 266: 554-571 (1996)). For a general discussion of the role of filtering in search strategies, see Altschul et al., Nature Genetics 6: 119-129 (1994).

Short repeats and low complexity sequences, such as glutamine-rich regions, confound most database searching methods. For PSIBLAST, the random model against which the significance of segment pair scores is evaluated assumes that at each position, each residue has a probability of occurring which is proportional to its composition in the database as a whole. Low complexity or highly repetitive sequences are inconsistent with this assumption.

Aminos acid characters in regions of low complexity sequence are substitued with the letter X. Here is an example of a sequence aligned to a filtered copy of itself to show which parts are filtered out:

  1 MAAKIFCLIMLLGLSASAATASIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60
  1 MAAKIFCLIMXXXXXXXXXXXXIFPQCSQAPIASLLPPYLSPAMSSVCENPILLPYRIQQ 60

 61 AIAAGILPLSPLFLQQSSALLQQLPLVHLLAQNIRAQQLQQLVLANLAAYSQQQQFLPFN 120
 61 AIAAGIXXXXXXXXXXXXXXXXXXXXXXXXXXNIRXXXXXXXXXXXXXXYSQQQQFLPFN 120

121 QLAALNSAAYLQQQQLLPFSQLAAAYPRQFLPFNQLAALNSHAYVQQQQLLPFSQLAAVS 180
121 QXXXXXXXXXXXXXXXXPFSQLAAAYPRQFLPFNQLAALNSHAYVXXXXXXPFSQLAAVS 180

181 PAAFLTQQQLLPFYLHTAPNVGTLLQLQQLLPFDQLALTNPAAFYQQPIIGGALF 235
181 PAAFLTQQQLLPFYLHTAPNVGTXXXXXXXXXXXXXXXTNPAAFYQQPIIGGALF 235

By default PSIBLAST does not filter query sequences, in contrast to BLAST which does. You can turn query sequence filtering on using -FILter but this should be done only when you plan to use a PSSM from PSIBLAST with BLAST to perform a PSI-TBLASTN search. An alternative approach in such cases is to use -NOFILter when running the PSI-TBLASTN search.

You can also mask selected positions in the query by using -LOWercasemask, which replaces lowercase letters in query with the letter X. of the query sequence.

AMINO ACID SCORING

[ Previous | Top | Next ]

For the first search round, PSIBLAST normally uses the BLOSUM62 scoring matrix from Henikoff and Henikoff (Proc. Natl. Acad. Sci. USA 89; 10915-10919 (1992)) whenever the sequences being compared are proteins (including cases where nucleotide databases or query sequences are translated into protein sequences before comparison). You can use other BLOSUM45, BLOSUM80, or the more traditional PAM70 and PAM30 scoring matrices with -MATrix, for example -MATrix=PAM40. Each matrix is most sensitive for finding homologs at the corresponding PAM distance. The seminal paper on this subject is Stephen Altschul's "Amino acid substitution matrices from an information theoretic perspective" (J. Mol. Biol. 219; 555-565 (1991)). If you are new to this literature, an easier place to start reading might be Altschul et al., "Issues in searching molecular sequence databases" (Nature Genetics, 6; 119-129 (1994)).

COMMAND-LINE SUMMARY

[ Previous | Top | Next ]

All parameters for this program may be added to the command line. Use -CHEck to view the summary below and to specify parameters before the program executes. In the summary below, the capitalized letters in the parameter names are the letters that you must type in order to use the parameter. Square brackets ([ and ]) enclose parameter values that are optional. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.


Minimal Syntax: % psiblast [-INfile1=]pir:mywhp  -Default

Prompted Parameters:

-BEGin=1 -END=153         sets the ranges of interest in  query sequences
[-INfile2=]pir            specifies database(s) to search
-EXPect=10.0              ignores scores that would occur by chance
                             more than 10 times
-THRESHold=0.005          sets e-value threshold for inclusion in a PSSM
-ROUNDs=2                 sets the number of iterations (0 for no imposed limit)
-LIStsize=500             sets maximum number of sequences listed in the output
[-OUTfile=]mywhp.blastpgp names the output file

Local Data Files:

[-DATa2=blast.ldbs]         names the list of available local databases
[-DATa3=blast.sdbs]         names the list of available site-specific databases
Optional Parameters:
-ALIgnments=250               sets number of sequences for which to show
                                 alignments
-PROCessors=1                 sets the number of processors to use
-GAPweight=0                  sets gap creation penalty
-LENgthweight=0               sets gap extension penalty
-REStorecheckpoint[=mywp.chk] read in checkpoint file
-SAVecheckpoint[=mywhp.chk]   save checkpoint file
-JUMPstart=hsp70.msf{*}       jumpstart with specified alignment
-TABle[=mywhp.psitable]       write PSSM to a file as an ASCII table
-NOFRAgments                  suppresses showing list file entries as fragments
-VIEW=0                       selects alignment view type (0-8 allowed)
-NATive                       produces unmodified BLAST2 output
-HTML                         uses HTML for output format
-FILter                       filters low complexity segments out of
                              query sequences using SEG
-LOWercasemask                masks lowercase characters in query sequence
-MATRix=blosum62              assigns the substitution matrix for proteins
-PSEudoconst=9                set relative empahsis given to pseudocounts
-SWAlign                      compute locally optimal Smith-Waterman alignments
-WORdsize=0                   sets word size (0 selects program default)
-HITEXTTHRESHold=0            sets minimum score to extend hits [T]
-HITWindow=40                 sets multiple hist window size [A]
-TRIGger=22.0                 sets number of bits to trigger gapping
-XDRopoff=0                   sets X dropoff value for gapped alignments [X2]
-BESthits=0                   sets number of best hits from a region to keep [K]
-OLDSTATistics                don't use composition-based statistics
-EFFdbsize=0                  sets effective database size (0 for real size)
-APPend="string"              appends "string" to pass-through command line
-BATch                        submits program to batch queue
-DBReport                     lists valid databases then exits

CITING BLAST

[ Previous | Top | Next ]

The original paper describing BLAST is Altschul, Stephen F., Gish, Warren, Miller, Webb, Myers, Eugene W., and Lipman, David J. (1990). Basic local alignment search tool. J. Mol. Biol. 215; 403-410. PSIBLAST was first described in Altschul, Stephen F., Madden, Thomas L., Schaffer, Alejandro A., Zhang, Jinghui, Zhang, Zheng, Miller, Webb, and Lipman, David J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25(17); 3389-3402.

ACKNOWLEDGEMENT

[ Previous | Top | Next ]

BLAST was written by Warren Gish, formerly of the National Center for Biotechnology Information (NCBI), in collaboration with Stephen Altschul, Webb Miller, Eugene Myers, David Lipman, and David States. The document you are now reading was written by John Devereux, with modifications by Ted Slater and Eric Cabot.

Blastpgp (NCBI's implementation of PSIBLAST) was written for NCBI by Tom Madden and Alejandro Schaffer. Eric Cabot developed the PSIBLAST client by extensively modifying the BLAST client written by Ted Slater for Version 10.0 of the Wisconsin Package. Some portions were taken from the original GCG Wisconsin Package BLAST client written by Scott Rose. The output post-processor for release 10.0 was written by Ron Stewart.

We are extremely grateful to Stephen Altschul, Tom Madden, Alejandro Schaffer and Warren Gish for their careful and original work on BLAST and PSIBLAST, and for their critical comments on the documentation that you are now reading. We are also very grateful to NCBI for making these programs and services available to the molecular biology community.

LOCAL DATA FILES

[ Previous | Top | Next ]

The files described below supply auxiliary data to this program. The program automatically reads them from a public data directory unless you either 1) have a data file with exactly the same name in your current working directory; or 2) name a file on the command line with an expression like -DATa1=myfile.dat. For more information see Chapter 4, Using Data Files in the User's Guide.

PSIBLAST reads two files, blast.ldbs (local databases), and blast.sdbs (site-specific databases). These together list the search sets in the menu. We update blast.ldbs when we send database updates to your institution. If you have sequences of local interest that you would like to search with PSIBLAST, read the documentation for GCGToBLAST to see how to create local BLAST-searchable databases, then fetch the file blast.sdbs, and add the name of the local search set so that it appears in the menu.

PARAMETER REFERENCE

[ Previous | Top | Next ]

You can set the parameters listed below from the command line. For more information, see "Using Program Parameters" in Chapter 3, Using Programs in the User's Guide.

Following some of the optional parameters described below is a letter or short expression in parentheses. These are the names of the corresponding parameters at the bottom of your PSIBLAST output.

-EXPect=10.0

This parameter, for which there is a prompt if you don't set it on the command line, lets you influence the number of hits in your output having scores that would be expected to have occurred by chance alone. There is nothing to prevent many biologically significant but statistically insignificant segment pairs from being screened out, so you may sometimes want to increase this parameter in order to have an opportunity to see them.

-THRESHold=0.005

After each round of searching, matches whose expectation scores are less than or equal to the specified value are used to construct the PSSM for the next round. Sequences with scores that exceed the threshold but not the setting of -EXPect will still be reported. You are prompted to set the threshold value if you do not set it on the command line.

-ROUNDs=2

This parameter governs the maximum number of times that the search is iterated. The cycle of searching and PSSM building will repeat until the specified number of rounds have occured or until the search "converges" (i.e. until no more new sequences can be added to the PSSM). Setting -ROUNDs=0 causes the iterations to stop only upon convergence. Failure of a search to converge by 10 rounds suggests that the PSSM may have become "corrupted", meaning that too many unrelated sequences (i.e. false positives) have been included. You can minimize the risk of corruption by using checkpoint files with a series of search runs and low settings of the -ROUNDs parameter. See the descriptions of the -REStorecheckpoint and -SAVcheckpoint parameters for additional details.

Since the first round of searching uses a standard scoring matrix (e.g. BLOSUM62), specifying only a single round is the equivalent of using BLAST to perform a blastp search. It is, however, possible to perform a single-round, PSSM-dependent search by using either -RESstorecheckpoint or -JUMPstart.

-LIStsize=500

By default, the PSIBLAST output list file will contain up to 500 sequences (or fragments thereof, depending upon the state of -FRAgments), even if more than 500 sequences had scores above the cutoff score. The list is sorted in order of increasing probability, that is, with the most significant sequences first. Use -LIStsize to change the number of sequences in your output to any value between 0 (for blastpgp's program defaults) and 1000.

-ALIgnments=250

By default, BLAST displays the alignments of HSPs from the best 250 sequences in the list. Use -ALIgnments to change the number of sequences for which alignments are shown in your output to any value between 0 and 1000.

-PROCessors=2

tells the program to use 2 threads for the database search on a multiprocessor computer. Check with your system manager for the number of processors available at your site. Never set the number of processors greater than what you have available.

-GAPweight=11

sets the penalty for adding a gap to the alignment. See the RESTRICTIONS topic for more information about setting the gap opening penalty.

-LENgthweight=1

sets the penalty for lengthening an existing gap in the alignment. See the RESTRICTIONS topic for more information about setting the gap extension penalty.

-REStorecheckpoint[=mywp.chk]

Read a checkpoint file from an earlier search and use the stored PSSM as the scoring matrix for the first search round. After the first round of searching, PSSMS are built using the normal rules. It is essential to use the exact same query sequence as was used to construct the checkpoint file although you do not have to search against the same database.

If you are running PSIBLAST interactively and do not specify the name of a checkpoint file with the -REStorecheckpoint parameter, you are prompted for one. You cannot specify multiple queries when using -REStorecheckpoint.

Checkpoint files have a hardware-dependent binary format. Therefore it is unlikely that you will be able to restore a checkpoint on a platform that is different from the one used when it was created.

-SAVecheckpoint[=mywhp.chk]

Save a representation of the PSSM and other details of the last search round into a file that can be used to initiate another search at a later time, possibly using different databases and parameter settings.

You can specify a filename with -SAVecheckpoint only in the case of a single query sequence. With multiple queries, or if no name is specified, checkpoint file names are based on the names of the query sequences. For example, with a query named "sw:calm_human" the checkpoint file would be named "calm_human.chk".

A checkpoint file generated from an earlier search can be used to provide the PSSM for the first round of the the current search. It is essential to use the exact same query sequence as was used to construct the checkpoint file although you do not have to search against the same database. With a restored checkpoint file, the first search round uses the PSSM from the file as the scoring matrix. After the first search round, PSSMS are built using the normal rules.

Checkpoint files provide a mechanism by which successive single-round, PSSM-dependent searches can be performed, permitting you to examine the results between searchs.

Checkpoint files are also useful for performing successive single-round, PSSM-dependent searches affording you an opportunity to examine the results of one search before initiating another.

-JUMPstart=hsp70.msf{*}

This option allows you to specify a group of aligned sequences that will be used to build a PSSM that then used with the first search round. After the first round, PSSMs are build using the normal rules.

You can use any valid multiple sequence specification (e.g. MSF, RSF, list files, and database and filename wildcards) as long as it represents a set of mulitply aligned sequences. If the sequences are not aligned, then PSIBLAST will probably yield incorrect results. Currently, alignments are limited to minimum of 2 and a maximum of 500 sequences, and the product of the number of sequences times the alignment length, after endgapping, may not exceed 1 million.

The alignment must contain a sequence with the exact same content and length as the query sequence although the two copies of the query sequence may have different names. The alignment may contain gaps since positions corresponding to gaps in the alignment copy of the query sequence are simply ignored when building the PSSM.

The -JUMPstart parameter is ignored if -REStore is also specified. However, in contrast to the -REStore, you can use the same jumpstart alignment for searches that use multiple queries. For example, the following command lines all use valid syntax:

% psiblast -INfile1=hsp70.msf{*} -JUMPstart=hsp70.msf{*}

% psiblast -INfile1=hsp70.msf{s*} -JUMPstart=hsp70.msf{*}

% psiblast -INfile1=hsp70.msf{s*} -JUMPstart=hsp70.msf{s*}

The characters within a given column of the jumpstart alignment must all be the same case. Positions corresponding to columns that are represented with upper case characters will be scored using the standard scoring matrix (e.g. BLOSUM62) instead of the PSSM. Note: this is one of the few examples where the case of a sequence character is significant in the Wisconsin Package.

-TABle[=mywhp.psitable]

writes a text file containg a representation of the PSSM used with the final round of searching. If no filename is specified, then filenames are based on the names of the query sequence and have the extension ".psitable".

You cannot read the table into any programs in the Wisconsin Package but it may be of use since after examination, you might want to mask regions of the query sequence using -LOWercasemask and then re-run the search.

-NOFRAgments

suppresses the appearance of begin and end ranges on each output list file entry based on the alignment between the entry and the query sequence.

-VIEW=0

sets the alignment view type. Acceptable values are 0 through 8, which correspond to the following:

0 = pairwise (the default);

1 = showing identities as dots

2 = showing insertions

3 = showing identies as dots and gapping for insertions;

4 = gapping for insertions;

5 = with endgaps and showing insertions

6 = with endgaps flat master-slave and gapping for insertions

7 = XML output

8 = tab-delimited summary table

The specification of the XML output is available from NCBI at:

ftp://ftp.ncbi.nlm.nih.gov/toolbox/xml/ncbixml.txt

Here are descriptions of the columns in the tab-delimited format:

1 = Query sequence name

2 = Database sequence name

3 = Percent of positions that are identical

4 = Alignment length

5 = Number of mismatches (alignment length - identities - gapped positions)

6 = Number of gaps of any length

7 = Start of alignment for query sequence

8 = End of alignment for query sequence

9 = Start of alignment for database sequence

10 = End of alignment for database sequence

11 = Expectation

12 = Score (bits)

-NATive

produces unmodified BLAST2 output.

-HTML

uses HTML format for output. This parameter has no effect if you use -VIEW=7 (XML output) or -VIEW=8 (tab-delimited output).

-FILter

filters low complexity segments out of sequences using the SEG algorithm. Filtered residues are replaced with the letter X and are ignored when calculating scores. Normally it is not necessarily to filter query sequences since database sequences are always filtered. Use this parameter only when you plan on saving a PSSM in order to run a PSI-TBLASTN search with the program BLAST.

-LOWercasemask

masks lowercase characters in the query sequence by replacing them with the letter X during the search. Masked residues are ignored when calculating scores. This is one of the few cases in the Wisconsin Package where the uppercase and lowercase characters in input sequences can produce different results.

-MATrix=BLOSUM62

sets the amino acid substitution matrix to use for the first round and for pseudocounts. PSIBLAST normally uses the BLOSUM62 amino acid substitution matrix from Henikoff and Henikoff. Other valid options are BLOSUM45, BLOSUM80, PAM30, and PAM70.

-PSEudoconst=9

sets the relative emphasis given to pseudocounts derived from a scoring matrix such as BLOSUM62 versus the observed amino acid frequences of the multiple alignment when constructing PSSMs. The relative emphasis on pseudocounts increases with a value known as the "pseudocount constant" that is set with this parameter.

-SWAlign

Use the Smith-Waterman algorithm for displayed aligments and calculating bit scores and expectation values. The default heuristic algorithm is quicker however it may completely miss some significant alignments and may even produce non-optimal alignemnts for some of sequence similarities that are found. For purposes of speed, the full Smith-Waterman alignment is only used for matching sequences identified using the heuristic algorithm.

-WORdsize=0

sets the size of the short regions of similarity between sequences that PSIBLAST initially searches for. If -WORdsize=0, PSIBLAST uses the default value of 3. Lower the word size to 2 results in a more sensitive search at the expense of a longer search time.

-HITEXTTHRESHold=0

sets the threshold for extending hits using the two-hit method. Words with scores at least this high can be extended as ungapped alignments.

-HITWindow=40

Sets the maximum distance allowed for two non-overlapping sequence segments on the same diagonal, when looking for matches between the query and a database sequence.

-TRIGger=22.0

sets number of bits that an initial ungapped alignment must score in order for it to be extended as a gapped alignment.

-XDRopoff=0 [X2]

sets the X2 dropoff value for gapped alignments (in bits). Gapped alignments are extended until the score drops below this value. This limits the (computationally expensive) extension of hits. Use -XDRopoff=0 for default behavior.

-BESthits=0

sets the maximum number of hits from a given region of the query sequence. Only the highest scoring hits from the region are kept. With -BESthits=0, the maximum number is set internally. This parameter can be used to counter the tendency of highly abundant, conserved regions to be so prevalent in the output that the detection of other domains would be precluded.

-OLDSTATistics

suppresses composition based statistics. Note, however that other subtle differences exist between blastp and PSIBLAST so it is unlikely that E-values from the first round of a PSIBLAST search will agree with those from a blastp search run using the program BLAST.

-EFFdbsize=0

sets the effective database size. A value of 0 selects the program default.

-APPend="string"

The GCG Wisconsin Package implementation of BLAST is what is known as a "wrapper" program. After collecting your input parameters, the wrapper calls the locally-built implementation of BLAST from NCBI called blastall. If you are familiar with the interface to the blastall program as it was originally written, you can pass parameters to it directly using this parameter. Please call us if there are additional parameters you want to use with BLAST that you would like to look more like GCG parameters.

-BATch

submits the program to the batch queue for processing after prompting you for all required user inputs. Any information that would normally appear on the screen while the program is running is written into a log file. Whether that log file is deleted, printed, or saved to your current directory depends on how your system manager has set up the command that submits this program to the batch queue. All output files are written to your current directory, unless you direct the output to another directory when you specify the output file.

-DBReport

lists valid databases then exits without searching.

The release notes for PSIBLAST and BLAST can be found at

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html
.

Printed: January 9, 2002 13:45 (1162)

[ Program Manual | User's Guide | Data Files | Databases ]


Technical Support: support-us@accelrys.com
or support-eu@accelrys.com

Copyright (c) 1982-2002 Accelrys Inc. A subsidiary of Pharmacopeia, Inc. All rights reserved.

Licenses and Trademarks Wisconsin Package is a trademark and GCG and the GCG logo are registered trademarks of Accelrys Inc.

All other product names mentioned in this documentation may be trademarks, and if so, are trademarks or registered trademarks of their respective holders and are used in this documentation for identification purposes only.

Accelrys Inc.

www.accelrys.com/bio