1 of 9

ortho_seqs

Miles Woollacott

2 of 9

What is ortho_seqs? What does ortho_seqs do?

Results

ortho_seqs is a command line interface (CLI) written in Python that applies the tensor-based orthogonal polynomial method* to sequence & phenotype data.

ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50

*See Nafees et al., 2020 for method application and Rice, 2020 for method background

3 of 9

Why is ortho_seqs important?

Regressions of Phenotype Onto the X Order Conditional Polynomial (rFonXD)

Matrix of Covariances

ortho_seqs converts each site in a sequence into a vector, which allows for mathematical computations and analyses.

rFon1D: measures the impact of having a given amino acid/nucleotide at that site. This impact can be positive or negative.

Regression onto second order conditional polynomial (rFon2D): measures the impact of having a given nucleotide pair at pairs of sites (not supported for proteins yet).

Covariance measures the relationship between two items.

Having a positive covariance for a pair of site → values of the first item increase with the second (and vice versa).

4 of 9

B CELL

BCR

B Cell

Pathogen

Heavy (H) Chain

Light (L) Chain

Immune System

Innate

Adaptive

B Cells

BCRs

CDRH3

CDRL3

Heavy (H) Chain

(Image Adapted From “AIRR-seq data analysis and processing” by Victor Greiff, Slide 9)

Variable

Constant

Application Time!

cov2 Dataset

T Cells

(Antibodies)

(Simplified Explanation)

CDRH3

CDRH2

CDRH1

5 of 9

Application Time!

cov2 Dataset

CLI Input:

The cov2 dataset is a conglomeration of antibodies that neutralize SARS-CoV-2, along with the sequences, IC50, and EC50 values (where provided).

ortho_seq orthogonal-polynomial ../data/cov2/big_mabs_ic50s/ic50_seq.txt --pheno_file ../data/cov2/big_mabs_ic50s/ic50_phi.txt --molecule protein --poly_order first --out_dir /data_sm/home/miles/results/CDRH3/cov2_unique_ic50

IC50 = Half Maximal Inhibitory Concentration

EC50 = Half Maximal Effective Concentration

6 of 9

Application Time!

cov2 Dataset

*rFon1D values between -2500 and 2500 were omitted for graph legibility

7 of 9

Updates to ortho_seqs

What I’ve Done:

Sample DNA Sequences

After Padding

5 Initial Dimensions

4 Dimensions

  • User-Friendly Updates and Efficiency
    • Automatic Padding
    • --dm, --sites, --pop_size flag Removal
    • Dimension Reducer
    • No Overwriting Data from --out_dir
  • Features
    • Covariance Histogram
    • Covariance .csv File
      • --min_pct flag
    • rFon1D Bar Plot
    • --alphbt_input flag

AGGAT

TGA

GGGTA

AGGAT

TGAnn

GGGTA

Dimensions: 4

(A, G, T, n)

Sites: 5

(5 columns)

Pop Size: 3

(3 Sequences)

A C G T n

A G T n

“C” is never used in any sequence, and is thus removed to reduce runtime.

8 of 9

Updates to ortho_seqs

What Will Be Done:

  • User-Friendly Updates and Efficiency
    • GUI
    • Histogram Improvements
    • Runtime Efficiency
    • Only one file for both seq and phi
  • Features
    • Third-Order Calculations for DNA
    • Second-Order Calculations for Proteins

...and more!

9 of 9

Special Thanks:

Contact Me:

  • Saba Nafees: Mentor
  • Advisors: Eric Waltari, Joan Wong
  • Code review: Pranathi Vemuri
  • Server help: Saransh Kaul
  • CZ Biohub
  • miles.woollacott@gmail.com