1 of 14

Global gene expression of NSCLC TIL

Students: Bogdanova Irina, Dilman Gleb

Supervisor: Esaulova Ekaterina

2 of 14

Introduction

Tumor-infiltrating lymphocytes (TIL) are white blood cells that have left the bloodstream and migrated towards a tumor.

TIL therapy is a type of cell-based immunotherapy that may be used to treat head and neck squamous cell carcinoma, melanoma, lung cancer, genitourinary cancers and a growing list of other malignancies.

It uses patient’s own immune cells from the microenvironment of the solid tumor to kill tumor cells.

3 of 14

Introduction

PD-1 blockade unleashes CD8 T cells, but factors in the tumour microenvironment can inhibit these T cell responses.

Single-cell transcriptomics have revealed global T cell dysfunction programs TIL. The majority of TIL do not recognize tumour antigens, and little is known about transcriptional programs of TIL.

Authors of article which we study identify T cell clones using the functional expansion of specific T cells assay in neoadjuvant anti-PD-1-treated non-small cell lung cancers (NSCLC)

4 of 14

Objectives:

Reproduction of the study of the transcription profile of tumor-infiltrating lymphocytes (TIL) described in the article ‘Transcriptional programs of neoantigen-specific TIL in anti-PD-1-treated lung cancers' (doi: 10.1038/s41586-021-03752-4 ).

5 of 14

Tasks:

Downloading preprocessed data from GEO (single-cell RNA-seq) and annotations to them, understanding the data structure (3 count matrix files) Samples selection QC (genes number, mitochondrial and ribosomal genes counts, number of cells based filtration) Searching for variable genes, PCA, conducting UMAP Identification and annotation of T-cell clusters by expression of marker genes, results visualization TIL expression profile assessment.

6 of 14

Methods

We used Scanpy (python toolkit for single cell analysis, analog of Seurat for R): https://scanpy.readthedocs.io/en/stable/

This pipeline is very suitable for out purposes:

https://scanpy-tutorials.readthedocs.io/en/latest/pbmc3k.html

Now we have our own pipeline for data we analyze:

https://colab.research.google.com/drive/14h3puI9-0yEthxm4DVQOPx4ffzs3ZJh4?usp=sharing

Let’s describe our pipeline step by step.

7 of 14

Methods: what do we start with?

We start with already preprocessed data published by authors on GEO:

https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE173351

Data obtained by the authors with Cell Ranger:

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

8 of 14

Methods: what data structure do we analize?

Expression profile is a count matrix:

unique string for every cell and unique column for every gene. More about count matrixes in single-cell RNA seq. See here: https://hbctraining.github.io/scRNA-seq/lessons/02_SC_generation_of_count_matrix.html

It’s VERY big: about 60000 cells and 30000 genes.

Scanpy has a special class AnnData for count matrix processing:

https://anndata.readthedocs.io/en/latest/

How does count matrix can be look like (just an example). But we have cells by strings! Not by columns!

9 of 14

Methods: QC

Some genes and cells should be removed because they make data noisy. Besides, data table is too big.

So, we removed: mitochondrial genes, high abundance lincRNA genes, genes linked with poorly supported transcriptional models and TCR (TR) genes (TRA/TRB/TRD/TRG, to avoid clonotype bias)

10 of 14

Methods: dimension reduction with PCA

Two main axes after PCA

Our data is a sample of points in space with dimension equal to gene number. It’s too hard to produce something useful from such structure. So let’s try principal component analysis (PCA) - findф linear combinations of basis vectors along which our sample is most variable.

Thanks to Scanpy it can be calculated with a couple strings of code.

Varians ratio by PCA axes (in logarithmized scale)

11 of 14

Methods: vizualiztion and clusterization with UMAP

UMAP is a strong statistical algorithm for visualization and clusterization of many dimension data. The main point is to present data in two dimension space in such a manner that close point are stay close and far are stay far.

See more: https://umap-learn.readthedocs.io/en/latest/

Now we have clusters of genes with alike transcriptional profiles

12 of 14

Methods: annotation clusters to cell types with marker genes

The final step of our work is annotation. UMAP gave as profile of several genes which we use define which cluster corresponds to which cell type. It’s important that different clusters can be the same type. Number of clusters after UMAP much depends on sample as we found.

Every picture demonstrates one gene profile

Marker genes and clusters correlation

13 of 14

Results

Cells groups are clustered and annotated with cell types. This step of general study is over. Our aim was accomplished.

Expression profile of marker genes

14 of 14

Github