ortho_seqs
Miles Woollacott
What is ortho_seqs? What does ortho_seqs do?
Results
ortho_seqs is a command line interface (CLI) written in Python that applies the tensor-based orthogonal polynomial method* to sequence & phenotype data.
ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50
*See Nafees et al., 2020 for method application and Rice, 2020 for method background
(Image from https://github.com/snafees/ortho_seqs)
Why is ortho_seqs important?
Regressions of Phenotype Onto the X Order Conditional Polynomial (rFonXD)
Matrix of Covariances
ortho_seqs converts each site in a sequence into a vector, which allows for mathematical computations and analyses.
rFon1D: measures the impact of having a given amino acid/nucleotide at that site. This impact can be positive or negative.
Regression onto second order conditional polynomial (rFon2D): measures the impact of having a given nucleotide pair at pairs of sites (not supported for proteins yet).
Covariance measures the relationship between two items.
Having a positive covariance for a pair of site → values of the first item increase with the second (and vice versa).
B CELL
BCR
B Cell
Pathogen
Heavy (H) Chain
Light (L) Chain
Immune System
Innate
Adaptive
B Cells
BCRs
CDRH3
CDRL3
Heavy (H) Chain
(Image Adapted From “AIRR-seq data analysis and processing” by Victor Greiff, Slide 9)
Variable
Constant
Application Time!
cov2 Dataset
T Cells
(Antibodies)
(Simplified Explanation)
CDRH3
CDRH2
CDRH1
Application Time!
cov2 Dataset
CLI Input:
The cov2 dataset is a conglomeration of antibodies that neutralize SARS-CoV-2, along with the sequences, IC50, and EC50 values (where provided).
ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50
IC50 = Half Maximal Inhibitory Concentration
EC50 = Half Maximal Effective Concentration
Application Time!
cov2 Dataset
*rFon1D values between -2500 and 2500 were omitted for graph legibility
Updates to ortho_seqs
What I’ve Done:
Sample DNA Sequences
After Padding
5 Initial Dimensions
4 Dimensions
AGGAT
TGA
GGGTA
AGGAT
TGAnn
GGGTA
Dimensions: 4
(A, G, T, n)
Sites: 5
(5 columns)
Pop Size: 3
(3 Sequences)
A C G T n
A G T n
“C” is never used in any sequence, and is thus removed to reduce runtime.
Updates to ortho_seqs
What Will Be Done:
...and more!
Special Thanks:
Contact Me: