The annotateMSA script provides utilities to automatically annotate sequence headers (for a FASTA file) with taxonomic information. Currently this can be done in one of two ways:
For Pfam alignments, annotations can be extracted from the file pfamseq.txt (please download from: ftp://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/database_files/pfamseq.txt.gz)
For Blast alignments, annotations can be added using the NCBI Entrez utilities provided by BioPython. They can be based on GI or accession numbers that are used to query NCBI for taxonomy information (note that this approach requires a network connection).
To extract GI or accession numbers, use the scripts alnParseGI.py or alnParseAcc.py, respectively.
For both the Pfam and NCBI utilities, the process of sequence annotation can be slow (on the order of hours, particularly for NCBI entrez with larger alignments). However, the annotation process only needs to be run once per alignment.
- Keyword Arguments
- -i, --input
Some input sequence alignment, Default: Input_MSA.fasta
- -o, --output
Specify an output file, Default: Output_MSA.an
- -a, --annot
Annotation method. Options are ‘pfam’ or ‘ncbi’. Default: ‘pfam’
- -l, --idList
This argument is necessary for the ‘ncbi’ method. Specifies a file containing a list of GI numbers corresponding to the sequence order in the alignment; a number of “0” indicates that a GI number wasn’t assigned for a particular sequence.
- -g, --giList
Deprecated. Identical to ‘–idList’ and kept to keep the CLI consistent with older versions of pySCA.
- -p, --pfam_seq
Location of the pfamseq.txt file. Defaults to path2pfamseq (specified at the top of scaTools.py)
- -m, --delimiter
Character(s) used for separating fields in the sequence headers of the annotated output. Default: ‘|’
annotateMSA -i PF00186_full.txt -o PF00186_full.an -a 'pfam' annotateMSA -i DHFR_PEPM3.fasta -o DHFR_PEPM3.an -a 'ncbi' -l DHFR_PEPM3.gi
Rama Ranganathan, Kim Reynolds
Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds
This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.