annotateMSA

The annotateMSA script provides utilities to automatically annotate sequence headers (for a FASTA file) with taxonomic information. Currently this can be done in one of two ways:

  1. For Pfam alignments, annotations can be extracted from the file pfamseq.txt (please download from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.txt.gz)

  2. For Blast alignments, annotations can be added using the NCBI Entrez utilities provided by BioPython. They can be based on GI or accession numbers that are used to query NCBI for taxonomy information (note that this approach requires a network connection).

To extract GI or accession numbers, use the scripts alnParseGI.py or alnParseAcc.py, respectively.

For both the Pfam and NCBI utilities, the process of sequence annotation can be slow (on the order of hours, particularly for NCBI entrez with larger alignments). However, the annotation process only needs to be run once per alignment.

Keyword Arguments
-i, --input

Some input sequence alignment, Default: Input_MSA.fasta

-o, --output

Specify an output file, Default: Output_MSA.an

-a, --annot

Annotation method. Options are ‘pfam’ or ‘ncbi’. Default: ‘pfam’

-l, --idList

This argument is necessary for the ‘ncbi’ method. Specifies a file containing a list of GI numbers corresponding to the sequence order in the alignment; a number of “0” indicates that a GI number wasn’t assigned for a particular sequence.

-g, --giList

Deprecated. Identical to ‘–idList’ and kept to keep the CLI consistent with older versions of pySCA.

-p, --pfam_seq

Location of the pfamseq.txt file. Defaults to path2pfamseq (specified at the top of scaTools.py)

-m, --delimiter

Character(s) used for separating fields in the sequence headers of the annotated output. Default: ‘|’

Examples:

annotateMSA -i PF00186_full.txt -o PF00186_full.an -a 'pfam'
annotateMSA -i DHFR_PEPM3.fasta -o DHFR_PEPM3.an -a 'ncbi' -l DHFR_PEPM3.gi
By

Rama Ranganathan, Kim Reynolds

On

9.22.2014

Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds

This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.