scaProcessMSA¶

The scaProcessMSA script conducts the basic steps in multiple sequence alignment (MSA) pre-processing for SCA, and stores the results using the python tool pickle:

  1. Trim the alignment, either by truncating to a reference sequence (specified with the -t flag) or by removing excessively gapped positions (set to positions with more than 40% gaps)

  2. Identify/designate a reference sequence in the alignment, and create a mapping of the alignment numberings to position numberings for the reference sequence. The reference sequence can be specified in one of four ways:

    1. By supplying a PDB file - in this case, the reference sequence is taken from the PDB (see the pdb kwarg)

    2. By supplying a reference sequence directly (as a fasta file - see the refseq kwarg)

    3. By supplying the index of the reference sequence in the alignment (see the refseq kwarg)

    4. If no reference sequence is supplied by the user, one is automatically selected using the scaTools function chooseRef.

    The position numbers (for mapping the alignment) can be specified in one of three ways:

    1. By supplying a PDB file - in this case the alignment positions are mapped to structure positions

    2. By supplying a list of reference positions (see the refpos kwarg)

    3. If no reference positions are supplied by the user, sequential numbering (starting at 1) is assumed.

  3. Filter sequences to remove highly gapped sequences, and sequences with an identity below or above some minimum or maximum value to the reference sequence (see the parameters kwarg)

  4. Filter positions to remove highly gapped positions (default 20% gaps, can also be set using –parameters)

  5. Calculate sequence weights and write out the final alignment and other variables

Key Arguments
--alignment, -a

Input_MSA.fasta (the alignment to be processed, typically the headers contain taxonomic information for the sequences).

--pdb, -s

PDB identifier (ex: 1RX2)

--pdbdir, -b

directory where PDB files are stored

--chainID, -c

chain ID in the PDB for the reference sequence

--species, -f

species of the reference sequence

--refseq, -r

reference sequence, supplied as a fasta file

--refpos, -o

reference positions, supplied as a text file with one position specified per line

--refindex, -i

reference sequence number in the alignment, COUNTING FROM 0

--parameters, -p

list of parameters for filtering the alignment: [max_frac_gaps for positions, max_frac_gaps for sequences, min SID to reference seq, max SID to reference seq] default values: [0.2, 0.2, 0.2, 0.8] (see filterPos and filterSeq functions for details)

--selectSeqs, -n

subsample the alignment to (1.5 * the number of effective sequences) to reduce computational time, default: False

--truncate, -t

truncate the alignment to the positions in the reference PDB, default: False

--matlab, -m

write out the results of this script to a matlab workspace for further analysis

--dest, -d

destination for output files

Example:

scaProcessMSA -a PF00071_full.an -s 5P21 -c A -f 'Homo sapiens'
By

Rama Ranganathan

On

8.5.2014

Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds

This program is free software distributed under the BSD 3-clause license, please see the file LICENSE for details.