1 of 9

ortho_seqs

Miles Woollacott

2 of 9

What is ortho_seqs? What does ortho_seqs do?

Results

ortho_seqs is a command line interface (CLI) written in Python that applies the tensor-based orthogonal polynomial method* to sequence & phenotype data.

ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50

*See Nafees et al., 2020 for method application and Rice, 2020 for method background

(Image from https://github.com/snafees/ortho_seqs)

3 of 9

Why is ortho_seqs important?

Regressions of Phenotype Onto the X Order Conditional Polynomial (rFonXD)

Matrix of Covariances

ortho_seqs converts each site in a sequence into a vector, which allows for mathematical computations and analyses.

rFon1D: measures the impact of having a given amino acid/nucleotide at that site. This impact can be positive or negative.

Regression onto second order conditional polynomial (rFon2D): measures the impact of having a given nucleotide pair at pairs of sites (not supported for proteins yet).

Covariance measures the relationship between two items.

Having a positive covariance for a pair of site → values of the first item increase with the second (and vice versa).

4 of 9

B CELL

BCR

B Cell

Pathogen

Heavy (H) Chain

Light (L) Chain

Immune System

Innate

Adaptive

B Cells

BCRs

CDRH3

CDRL3

Heavy (H) Chain

(Image Adapted From “AIRR-seq data analysis and processing” by Victor Greiff, Slide 9)

Variable

Constant

Application Time!

cov2 Dataset

T Cells

(Antibodies)

(Simplified Explanation)

CDRH3

CDRH2

CDRH1

5 of 9

Application Time!

cov2 Dataset

CLI Input:

The cov2 dataset is a conglomeration of antibodies that neutralize SARS-CoV-2, along with the sequences, IC50, and EC50 values (where provided).

ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50

IC50 = Half Maximal Inhibitory Concentration

EC50 = Half Maximal Effective Concentration

Source: https://science.sciencemag.org/content/suppl/2020/06/15/science.abc7520.DC1

6 of 9

Application Time!

cov2 Dataset

*rFon1D values between -2500 and 2500 were omitted for graph legibility

7 of 9

Updates to ortho_seqs

What I’ve Done:

Sample DNA Sequences

After Padding

5 Initial Dimensions

4 Dimensions

User-Friendly Updates and Efficiency

Automatic Padding
--dm, --sites, --pop_size flag Removal
Dimension Reducer
No Overwriting Data from --out_dir

Features

Covariance Histogram
Covariance .csv File

--min_pct flag

rFon1D Bar Plot
--alphbt_input flag

AGGAT

TGA

GGGTA

AGGAT

TGAnn

GGGTA

Dimensions: 4

(A, G, T, n)

Sites: 5

(5 columns)

Pop Size: 3

(3 Sequences)

A C G T n

A G T n

“C” is never used in any sequence, and is thus removed to reduce runtime.

8 of 9

Updates to ortho_seqs

What Will Be Done:

User-Friendly Updates and Efficiency

GUI
Histogram Improvements
Runtime Efficiency
Only one file for both seq and phi

Features

Third-Order Calculations for DNA
Second-Order Calculations for Proteins

...and more!

9 of 9

Special Thanks:

Contact Me:

Saba Nafees: Mentor
Advisors: Eric Waltari, Joan Wong
Code review: Pranathi Vemuri
Server help: Saransh Kaul
CZ Biohub

miles.woollacott@gmail.com