# scaTools¶

pySCA - A SCA toolbox in Python

By

On

August 2014

version

6.1

Copyright (C) 2015 Olivier Rivoire, Rama Ranganathan, Kimberly Reynolds

class scaTools.Annot(descr, species, taxo, seq='')[source]

Bases: object

A class for annotating sequences

Attributes

desc

species

species string

taxo

taxonomy string

seq

sequence

scaTools.AnnotNCBI(alg_in, alg_out, id_list, email='your.email@youruniversity.edu', delimiter='|')[source]

Phylogenetic annotation of a NCBI alignment (in FASTA format). Uses GI numbers or accession numbers embedded in the headers of the multiple sequences alignment. Requires that all the fields are consistent for all sequences.

Arguments

• input NCBI sequence alignment

• output file name for the annotated alignment

• file containing a list of sequences identifiers

• type of sequence identifier used

• email address for querying database

Key Arguments

• alg_in = path to the input alignment file

scaTools.AnnotPfam(pfam_in, pfam_out, pfam_seq='pfamseq.txt', delimiter='|')[source]

Phylogenetic annotation of a Pfam alignment (in FASTA format) using information from pfamseq.txt. The output is a FASTA file containing phylogenetic annotations in the header (to be parsed with ‘|’ as a delimiter).

Note: the headers for the original alignment take the form >AAA/x-y. If two entries have same AAA but correspond to different sequences only one of the two sequences will be represented (twice) in the output - this should however not practically be an issue.

Arguments

• input Pfam sequence alignment

• output file name for the annotated Pfam alignment

Key Arguments

• pfam_seq = path to the file pfamseq.txt

scaTools.AnnotPfamDB(pfam_in, pfam_out, pfam_db='pfamseq.db', delimiter='|')[source]

Phylogenetic annotation of a Pfam alignment (in FASTA format) using information from pfamseq.txt. The output is a FASTA file containing phylogenetic annotations in the header (to be parsed with ‘|’ as a delimiter).

Note: the headers for the original alignment take the form >AAA/x-y. If two entries have same AAA but correspond to different sequences only one of the two sequences will be represented (twice) in the output - this should however not practically be an issue.

Arguments

• input Pfam sequence alignment

• output file name for the annotated Pfam alignment

Key Arguments

• pfam_db = path to the file pfamseq.db

scaTools.MSAsearch(hd, algn, seq, species=None)[source]

Identify the sequence in the alignment that most closely corresponds to the species of the reference sequence, and return its index.

Arguments

• alignment sequences

• selected reference sequence (often from a PDB file)

Key Arguments

species

species of the reference sequence (Used to speed up alignment searching when possible)

Example:

strseqnum = MSASearch(hd, alg0, pdbseq, 'Homo sapiens')

scaTools.MultiBar(x, colors='wbrgymc', width=0.5)[source]

Multiple bar diagram (plots contributions to each bar from different elements in x as different colors). This can be useful if you’d like to inspect how sector positions are distributed among independent components/eigenmodes. The argument x is a tuple, specifying the number of elements in each bar to be each color. The example below makes a graph with four bars, where the first bar has 99 white elements, 1 blue element and 1 red element.

Example:

x = [[99, 1, 1], [6, 13, 2], [0, 0, 13], [1, 7, 5]]
sca.MultiBar(x)

class scaTools.Pair(p, x, d)[source]

Bases: object

A class for a pair of positions.

Attributes

pos

a pair of amino acid positions (ex: [1,3], supplied as argument p)

DI

the direct information between the two positions (argument x)

dist

the physical distance between the positions (argument d)

class scaTools.Secton(positions)[source]

Bases: object

A class for sectons.

Attributes

pos

a list of positions

num

number of positions in the secton

connected(distmat, threshold)[source]

Check the structural connectivity based on the principle that if $$M_{ij}$$ is the adjacency matrix of a graph, $$M^n_{ij}$$ is the number of paths of length $$n$$ between i and j, which must be > 0 for $$n$$ = number of nodes when i and j are in the same connected component.

dist(distmat)[source]

returns the distance between the position pair

class scaTools.Unit[source]

Bases: object

A class for units (sectors, sequence families, etc.)

Attributes

name

string describing the unit (ex: ‘firmicutes’)

items

set of member items (ex: indices for all firmicutes sequence in an alignment)

col

color code associated to the unit (for plotting)

vect

an additional vector describing the member items (ex: a list of sequence weights)

scaTools.alg2bin(alg, N_aa=20)[source]

Translate an alignment of matrix of size M sequences by L positions where the amino acids are represented by numbers between 0 and N_aa (obtained using lett2num) to a binary array of size M x (N_aa x L).

Example:

Abin = alg2bin(alg, N_aa=20)

scaTools.alg2binss(alg, N_aa=20)[source]

Translate an alignment of matrix of size M sequences by L positions where the amino acids are represented by numbers between 0 and N_aa (obtained using lett2num) to a sparse binary array of size M x (N_aa x L).

Example:

Abin = alg2binss(alg, N_aa=20)

scaTools.basicICA(x, r0, Niter, tolerance=1e-15)[source]

Basic ICA algorithm, based on work by Bell & Sejnowski (infomax). The input data should preferentially be sphered, i.e., x.T.dot(x) = 1

Arguments

x

LxM input matrix where L = # features and M = # samples

r

learning rate / relaxation parameter (e.g. r=.0001)

Niter

number of iterations (e.g. 1000)

Returns

w

unmixing matrix

change

record of incremental changes during the iterations.

Note r and Niter should be adjusted to achieve convergence, which should be assessed by visualizing ‘change’ with plot(range(iter), change)

Example:

[w, change] = basicICA(x, r, Niter)

scaTools.chooseKpos(Lsca, Lrand)[source]

Given the eigenvalues of the sca matrix (Lsca), and the eigenvalues for the set of randomized matrices (Lrand), return the number of significant eigenmodes.

scaTools.chooseRefSeq(alg)[source]

This function chooses a default reference sequence if none is given by taking the sequence which has the mean pairwise sequence identity closest to that of the entire alignment.

Example:

i_ref = chooseRefSeq(msa_num)

scaTools.clean_al(alg, code='ACDEFGHIKLMNPQRSTVWY', gap='-')[source]

Replaces any character that is not a valid amino acid by a gap.

Arguments amino acid sequence alignment

Key Arguments

code

list of valid amino acid characters (case sensitive)

gap

gap character for replacement

scaTools.cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)[source]

Output tab-delimited text that can be read in by cytoscape. The goal is to enable graph representations of the SCA couplings, where residues are nodes, and couplings are edges. Within cytoscape, the graph can be color-coded or weighted by Csca, Di, sector definition or Vp.

Example:

cytoscapeOut(ats, cutoff, Csca, Di, sectors, Vp, outfilename)

scaTools.dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=0.0001)[source]

Direct information from the matrix of couplings $$J_{ij}$$ (called by directInfo). Ref: Marcos et al, PNAS 2011, 108: E1293-E1301

Arguments

i

position 1

j

position 2

Jmat

coupling matrix

frq

frequency

Naa

number of amino acids

Example:

DI = dirInfoFromJ(i, j, Jmat, frq, Naa=20, epsilon=1e-4)

scaTools.directInfo(freq1, freq2, lbda=0.5, freq0=array([0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]), Naa=20)[source]

Calculate direct information as in the Direct Coupling Analysis (DCA) method proposed by M. Weigt et collaborators (Ref: Marcos et al, PNAS 2011, 108: E1293-E1301).

Example:

DI = directInfo(freq1, freq2, lbda=.5, freq0=np.ones(20)/21, Naa=20)

scaTools.eigenVect(M)[source]

Return the eigenvectors and eigenvalues, ordered by decreasing values of the eigenvalues, for a real symmetric matrix M. The sign of the eigenvectors is fixed so that the mean of its components is non-negative.

Example:

eigenVectors, eigenValues = eigenVect(M)

scaTools.figColors()[source]

Color code for figUnits.

Example:

figColors()

scaTools.figMapping(Csca, tX, kpos, sectors, subfam)[source]

Function that automates finding the top $$k_{pos}$$ independent components, projection, and plotting.

Useful to get a representation of the sectors/subfamilies mapping for a given kpos.

Arguments

Csca

the sca matrix

tX

the projected alignment

kpos

the number of independent components to consider

sectors

list of Unit elements for each sector

subfam

list of Unit elements for each sequence family

Returns

Vpica

the independent components of Csca

Example:

Vpica = figMapping(Csca, tX, kpos, sectors, subfam)

scaTools.figStruct(pdbid, sectors, ics, ats, chainid='A', outfile='output/sectors.pml', quit=1)[source]

Make and display an image of the sectors (within a python notebook). By default quit PyMol after running it, unless the option ‘quit=0’ is given. The default name and location of the output can also be changed.

Example:

figStruct(pdbid, sectors, ats, chainid='A', outfile = settings.path2output+'sectors.pml', quit=1)

scaTools.figUnits(v1, v2, units, marker='o', dotsize=9, notinunits=1)[source]

2d scatter plot specified by ‘units’, which must be a list of elements in the class Unit. See figColors for the color code. Admissible color codes are in [0 1] (light/dark gray can also be obtained by using -1/+1). For instance: 0->red, 1/3->green, 2/3-> blue.

Arguments

v1

xvals

v2

yvals

units

list of elements in units

Key Arguments

marker

plot marker symbol

dotsize

specify marker/dotsize

notinunits
• if set to 1: the elements not in a unit are represented in white

• if set to 0 these elements are not represented

• if set to [w1,w2] : elements with coordinates w1,w2 are represented in white in the background.

Example:

figUnits(v1, v2, units, marker='o', gradcol=0, dotsize=9, notinunits=1)

scaTools.figWeights(U1, U2, weight)[source]

A 2d scatter plot with color indicating weight.

Example:

figWeights(U1, U2, weight)

scaTools.filterPos(alg, seqw=[1], max_fracgaps=0.2)[source]

Truncate the positions of an input alignment to reduce gaps, taking into account sequence weights.

Arguments

alg

An MxL list of sequences

Keyword Arguments
seqw

vector of sequence weights (default is uniform weights)

max_fracgaps

maximum fraction of gaps allowed at a position

Returns:
alg_tr

the truncated alignment

selpos

the index of retained positions (indices start at 0 for the first position)

Example:

alg_tr, selpos = filterPos(alg, seqw, max_fracgaps=.2)

scaTools.filterSeq(alg0, sref=0.5, max_fracgaps=0.2, min_seqid=0.2, max_seqid=0.8)[source]

Take in an alignment (alg0, assumed to be filtered to remove highly gapped positions), a reference sequence, the maximum fraction of gaps allowed per sequence (max_fracgaps), the minimum and maximum sequence identities to the reference sequence (min_seqid and max_seqid), and return (1) alg, the alignment filtered to remove sequences with more than max_fracgaps (i.e. partial seqs), (2) seqw, a vector of weights for each sequence, (3) seqkeep, the indices of the original alignment (alg0) retained in alg:

Note: if sref is set to 0.5, filterSeq calls chooseRefSeq to automatically select a reference sequence.

Example:

alg, seqw, seqkeep = filterSeq(alg0, iref, max_fracgaps=.2, min_seqid=.2, max_seqid=.8)

scaTools.freq(alg, seqw=1, Naa=20, lbda=0, freq0=array([0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))[source]

Compute amino acid frequencies for a given alignment.

Arguments

alg

a MxL sequence alignment (converted using lett2num)

Keyword Arguments
seqw

a vector of sequence weights (1xM)

Naa

the number of amino acids

lbda

lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)

freq0

expected average frequency of amino acids at all positions

Returns

freq1

the frequencies of amino acids at each position taken independently (Naa*L)

freq2

the joint frequencies of amino acids at pairs of positions (freq2, Naa*L * Naa*L)

freq0

the average frequency of amino acids at all positions (Naa)

Example:

freq1, freq2, freq0 = freq(alg, seqw, lbda=lbda)

scaTools.icList(Vpica, kpos, Csca, p_cut=0.95)[source]

Produces a list of positions contributing to each independent component (IC) above a defined statistical cutoff (p_cut, the cutoff on the CDF of the t-distribution fit to the histogram of each IC). Any position above the cutoff on more than one IC are assigned to one IC based on which group of positions to which it shows a higher degree of coevolution. Additionally returns the numeric value of the cutoff for each IC, and the pdf fit, which can be used for plotting/evaluation.

Example:

icList, icsize, sortedpos, cutoff, pd = icList(Vsca,Lsca,Lrand)

scaTools.lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')[source]

Translate an alignment from a representation where the 20 natural amino acids are represented by letters to a representation where they are represented by the numbers 1,…,20, with any symbol not corresponding to an amino acid represented by 0.

Example:

msa_num = lett2num(msa_lett, code='ACDEFGHIKLMNPQRSTVWY')

scaTools.makeATS(sequences, refpos, refseq, iref=0, truncate=False)[source]

If specified, truncate the alignment to the structure (assumes MSAsearch has already been run to identify the reference sequence (iref)) and produce a mapping (ats) between alignment positions and the positions in the reference sequence (refpos).

Arguments

• sequences

• reference positions

• reference sequence

• iref, the index of the sequence in the alignment with the highest identity to the reference

Key Arguments

truncate

(bool) truncate the alignment to positions that align to the reference sequence

Example:

sequences_trun, ats_new = sca.makeATS(sequences_full, ats_pdb, seq_pdb, i_ref)

scaTools.numConnected(Vp, k, distmat, eps_list=array([0.5, 0.49, 0.48, 0.47, 0.46, 0.45, 0.44, 0.43, 0.42, 0.41, 0.4, 0.39, 0.38, 0.37, 0.36, 0.35, 0.34, 0.33, 0.32, 0.31, 0.3, 0.29, 0.28, 0.27, 0.26, 0.25, 0.24, 0.23, 0.22, 0.21, 0.2, 0.19, 0.18, 0.17, 0.16, 0.15, 0.14, 0.13, 0.12, 0.11, 0.1, 0.09, 0.08, 0.07, 0.06, 0.05, 0.04, 0.03, 0.02, 0.01]), dcontact=5)[source]

Calculates the number of positions in the largest connected component for groups of positions i with $$V_p[i,k] > eps$$ and $$V_p[i,k] > V_p[i,kk]$$, for $$kk != k$$ and eps in eps_list. Useful for looking evaluating the physical connectivity of different sectors or sub-sectors.

Arguments
Vp

A set of eigenvectors or independent components

k

The eigenvector or independent component to consider

distmat

Distance matrix (computed by pdbSeq)

Key Arguments
eps_list

the range of values of eps for which the group is non-empty

dcontact

the distance cutoff for defining contacts

Example:

eps_range, num_co, num_tot = numConnected(Vp, k, distmat, eps_list = np.arange(.5,0,-.01), dcontact=8)

scaTools.parseAlgHeader(header, delimiter='|')[source]

Use this instead of splitting on the “|” delimiter if the “|” character shows up inside one of the fields.

Run this function on a single header, not all the headers returned in the list from readAlg.

scaTools.pdbSeq(pdbid, chain='A', path2pdb='.', calcDist=1)[source]

Extract sequence, position labels and matrix of distances from a PDB file.

Arguments
pdbid

PDB identifier (four letters/numbers)

chain

PDB chain identifier

path2pdb

location of the PDB file

calcDist

calculate a distance matrix between all pairs of positions, default is 1

Example:

sequence, labels, dist = pdbSeq(pdbid, chain='A', path2pdb)

scaTools.posWeights(alg, seqw=1, lbda=0, N_aa=20, freq0=array([0.073, 0.025, 0.05, 0.061, 0.042, 0.072, 0.023, 0.053, 0.064, 0.089, 0.023, 0.043, 0.052, 0.04, 0.052, 0.073, 0.056, 0.063, 0.013, 0.033]), tolerance=1e-12)[source]

Compute single-site measures of conservation, and the sca position weights, $$\frac {\partial {D_i^a}}{\partial {f_i^a}}$$

Arguments
• alg = MSA, dimensions MxL, converted to numerical representation with lett2num

• seqw = a vector of M sequence weights (default is uniform weighting)

• lbda = pseudo-counting frequencies, default is no pseudocounts

• freq0 = background amino acid frequencies $$q_i^a$$

Returns:
• Wia = positional weights from the derivation of a relative entropy, $$\frac {\partial {D_i^a}}{\partial {f_i^a}}$$ (Lx20)

• Dia = the relative entropy per position and amino acid (Lx20)

• Di = the relative entropy per position (L)

Example::

Wia, Dia, Di = posWeights(alg, seqw=1,freq0)

scaTools.projAlg(alg, Proj)[source]

Projection of an alignment (alg) based on a projector (Proj). The input alignment should already be converted to numeric representation using lett2num.

Example:

tX = projAlg(msa_num, Proj)

scaTools.projUica(msa_ann, msa_num, seqw, kica=6)[source]

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the sequence space of another (msa_num, seqw).This is useful to compare the sequence space of one alignment to another.

Example:

Uica_ann, Uica = projUpica(msa_ann, msa_num_ seqw, kica=6)

scaTools.projUpica(msa_ann, msa_num, seqw, kpos)[source]

Compute the projection of an alignment (msa_ann) on the kpos ICA components of the SCA matrix of another (msa_num, seqw). This is useful to compare the sequence space (as projected by the positional correlations) of one alignment to another.

Example:

Upica_ann, Upica = projUpica(msa_ann, msa_num_ seqw, kpos)

scaTools.randAlg(frq, Mseq)[source]

Generate a random alignment with Mseq sequences based on the frequencies frq[i,a] of amino acids with a = 0,1,…,Naa (0 for gaps).

Example:

msa_rand = randAlg(frq, Mseq)

scaTools.randSel(seqw, Mtot, keepSeq=[])[source]

Random selection of Mtot sequences, drawn with weights and without replacement. The seed for the random number generator is fixed to ensure reproducibility.

Arguments
• seqw = the sequence weights

• Mtot = the total number of sequences for selection

Keyword Arguments
• keepSeq = an (optional) list of sequnces to keep. This can be useful if you would like to retain the reference sequence for example.

Example::

selection = randSel(seqw, Mtot, [iref])

scaTools.randomize(msa_num, Ntrials, seqw=1, norm='frob', lbda=0, Naa=20, kmax=6, tolerance=1e-12)[source]

Randomize the alignment while preserving the frequencies of amino acids at each position and compute the resulting spectrum of the SCA matrix.

Arguments

msa_num

a MxL sequence alignment (converted to numerical representation using lett2num)

Ntrials

number of trials

seqw

vector of sequence weights (default is to assume equal weighting)

norm

either ‘frob’ (frobenius) or ‘spec’ (spectral)

lbda

lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)

Naa

number of amino acids

kmax

number of eigenvectors to keep for each randomized trial

tolerance

workaround for roundoff error that occurs when calculating gap probability fr0

Returns

Vrand

eigenvectors for the $$\tilde {C_{ij}^{ab}}$$ matrix of the randomized alignment (dimensions: Ntrials*Npos*kmax)

Lrand

eigenvalues for the $$\tilde {C_{ij}^{ab}}$$ matrix of the randomized alignment (dimensions: Ntrials*Npos)

Example:

Vrand, Lrand, Crand = randomize(msa_num, 10, seqw, Naa=20, kmax=6)

scaTools.readAlg(filename)[source]

Read in a multiple sequence alignment in FASTA format, and return the headers and sequences.

scaTools.rotICA(V, kmax=6, learnrate=0.1, iterations=100000)[source]

ICA rotation (using basicICA) with default parameters and normalization of outputs.

Example::

Vica, W = rotICA(V, kmax=6, learnrate=.0001, iterations=10000)

scaTools.scaMat(alg, seqw=1, norm='frob', lbda=0, freq0=array([0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905, 0.04761905]))[source]

Computes the SCA matrix.

Arguments

alg

A MxL multiple sequence alignment, converted to numeric representation with lett2num

Keyword Arguments

seqw

A vector of sequence weights (default: uniform weights)

norm

The type of matrix norm used for dimension reduction of the SCA correlation tensor to a positional correlation matrix. Use ‘spec’ for spectral norm and ‘frob’ for Frobenius norm. The frobenius norm is the default.

lbda

lambda parameter for setting the frequency of pseudo-counts (0 for no pseudo counts)

freq0

background expectation for amino acid frequencies

Returns

Cp

the LxL SCA positional correlation matrix

tX

the projected MxL alignment

projMat

the projector

Example:

Csca, tX, projMat = scaMat(alg, seqw, norm='frob', lbda=0.03)

scaTools.seqProj(msa_num, seqw, kseq=15, kica=6)[source]

Compute three different projections of the sequences based on eigenvectors of the sequence similarity matrix.

Arguments
• msa_num = sequence alignment (previously converted to numerical representation using lett2num)

• seqw = a vector of sequence weights

Keyword Arguments
• kseq = number of eigenvectors to compute

• kica = number of independent components to compute

Returns:
• Useq[0]/Uica[0] = use no weight

• Useq[1]/Uica[1] = use sequence weights

• Useq[2]/Uica[2] = use sequence weights and positional weights

Example:

Useq, Uica = sca.seqProj(msa_num, seqw, kseq = 30, kica = 15)

scaTools.seqSim(alg)[source]

Take an MxL alignment (converted to numeric representation using lett2num) and compute a MxM matrix of sequence similarities.

Example::

simMat = seqSim(alg)

scaTools.seqWeights(alg, max_seqid=0.8, gaps=1)[source]

Compute sequence weights for an alignment (format: list of sequences) where the weight of a sequence is the inverse of the number of sequences in its neighborhood, defined as the sequences with sequence similarity below max_seqid. The sum of the weights defines an effective number of sequences.

Arguments

alg

list of sequences

Keyword Arguments

max_seqid

gaps

If gaps == 1 (default), considering gaps as a 21st amino acid, if gaps == 0, not considering them.

Example:

seqw = seqWeights(alg)

scaTools.singleBar(x, loc, cols, width=0.5)[source]

Single bar diagram, called by MultiBar.

Example:

singleBar(x, loc, cols, width=.5)

scaTools.sizeLargestCompo(adjMat)[source]

Compute the size of the largest component of a graph given its adjacency matrix. Called by numConnected (Done by actually listing all the components)

Example::

scaTools.svdss(X, k=6)[source]

Singular value decomposition for sparse matrices (top k components). The singular values are ordered by decreasing values, the sign of the singular vectors is fixed, and the convention is that X = u.dot(s).dot(v.T):

Example:

u, s ,v = svdss(X, k=6)

scaTools.truncDiag(M, dmax)[source]

Set to 0 the elements of a matrix M up to a distance dmax from the diagonal.

Example:

Mtr = truncDiag(M, dmax)

scaTools.weighted_rand_list(weights, Nmax, keepList)[source]

Generate a random list of at most Nmax elements with weights (numpy array) but without replacement. Called by randSel.

Example:

selection = weighted_rand_list(weights, Nmax, [iref])

scaTools.weighted_rand_sel(weights)[source]

Generate a random index with probability given by input weights. Called by weighted_rand_list.

Example:

index = weighted_rand_sel(weights)

scaTools.writePymol(pdb, sectors, ics, ats, outfilename, chain='A', inpath='.', quit=1)[source]

Write basic a pymol script for displaying sectors and exporting an image.

Example:

writePymol(pdb, sectors, ics, ats, outfilename, chain='A',inpath=settings.path2structures, quit=1)