1 of 21

Cross-species project: �pilot work

Jon Manning, BBSRC-NSF update meeting

May 2021

2 of 21

Background

  • Single-cell Expression Atlas (SCXA) has curated (i.e. author-provided) and calculated cell groups across a wide variety of species.
    • Calculated markers for each cell group wrt all other groups combined.
  • We would like to relate cell-groups across experiments without species being a barrier
  • Queries like: ”Show me all cell groups you have that are similar to this one”

3 of 21

Two approaches

  1. Marker overlap
    • Take top 100 differential genes
    • Map between species using orthology mappings
    • Rank thresholded cluster similarity between experiments by marker gene overlap

  • SAMap (Alec Tarashansky)
    • Rank thresholded SAMap scores
    • https://github.com/atarashansky/SAMap
    • Workflow with parallelized BLAST: https://github.com/ebi-gene-expression-group/samap-workflow

4 of 21

Marker overlap

  • Follows a standard Scanpy workflow
  • Multiple parameter combinations tried for:
    • p value threshold
    • number of genes
    • minimum cluster overlap
  • Best: top 100 genes irrespective of p value, 5% overlap threshold.

5 of 21

More complex approach: SAMap

https://doi.org/10.1101/2020.09.28.317784

https://github.com/atarashansky/

  1. Generate a gene/gene bipartite graph across species
  2. Merge to single PC space
  3. Compute expression correlations in PC space with k-nearest cross-species neighbours
  4. Reweight homology graph
  5. Re-assign cross-species cell neighbours
  6. Re-calculate gene/ gene correlations, update homology graph again.
  7. Refine mutual connectivity of cells between species.
  8. Recalculate gene expression correlations, reweight homology graph, produce final atlas alignments.

6 of 21

Evaluation approach

  • Compare mouse Tabula Muris to human experiments with annotated cell types.
  • Rank methods by largest proportion of known intersecting cell types detected.

M

H

7 of 21

1. Marker overlap method

8 of 21

Difficulty: standard SCXA marker sets are mismatched by context

Markers are only markers relative to something else

9 of 21

Importance of context for marker comparison�e.g. what’s the signature for a pancreatic B cell?

E-MTAB-5061: Human pancreas

Markers = “What genes make a B cell look different to average expression across the rest of the pancreas?”

OR

”What genes define the un-pancreasiness of the B cell?”

Pancreas

B cell

Acinar

A cell

D cell

E-ENAD-15: whole mouse

Markers = “What genes make a B cell look different to average expression across the rest of the mouse?”

OR

“What genes define the un-mousiness of the B cell?”

Mouse

Pancreas

B cell

Acinar

A cell

D cell

Blood

Kidney

Heart

Kidney

10 of 21

What happens if we try?

Markers for human B cells from the pancreas context overlap with marker genes for a variety of pancreas cell types from the whole-mouse context.

Markers for pancreatic cell types in the mouse dataset are less specific to those cell types within the pancreas, since the background contained a relatively low number of pancreatic cells.

The same applies to other hierarchical differences, e.g. pancreas vs sub-parts endocrine/ exocrine pancreas.

Common markers with B cells wrt mouse

Common markers with B cells wrt mouse pancreas

11 of 21

Easy, I’ll just split experiments by organism part and re-calculate markers before comparison

12 of 21

Not quite- granularity can be different, so we need to merge more specific terms

E-ENAD-15 granularity

E-MTAB-5061 granularity

So we need to merge these categories for comparison

13 of 21

Solution

  • Only compare marker genes generated for the same context.
  • Recipe:
    1. For two experiments sharing at least some organism parts, find exactly matching terms.
    2. For terms that don’t match, compare lists of ancestor terms and look for matches.
    3. Assign to both experiments the most specific shared terms.
    4. Generate marker genes for cell types within matching contexts (organism parts).
    5. Compare marker lists to derive cross-species relationships between cell types/ clusters.

Snakemake implementation: https://github.com/ebi-gene-expression-group/cross-species-cellgroup-comparison

14 of 21

Basic validation

  • Use experiments with cell type annotations.
  • Determine overlapping cell types between experiment pairs.
  • Match cell types between experiments by markers:
    • Minimum p value: 0.05
    • Minimum proportion overlap: 0.05
    • Rank matched types by proportion overlap
  • If markers are a bad way of matching, we’ll pick the wrong matches in the other dataset.

15 of 21

Results

16 of 21

Pilot results: Between human and mouse, SAMap peforms similarly to marker overlap

Experiment 1 (mouse)

Experiment 2 (human)

Common organism part

Intersecting

cell types

Predicted intersecting

(top rank, markers)

Predicted intersecting (top rank, SAMap)

E-ENAD-15

E-GEOD-125970

colon

2

2

2

E-GEOD-83139

pancreas

5

5

4

E-HCAD-10

kidney

4

4

4

E-HCAD-1

lung

4

3

3

E-HCAD-1

spleen

3

1

1

E-MTAB-5061

pancreas

7

7

7

E-MTAB-8410

ascending colon

2

2

2

17 of 21

Example: pancreas

18 of 21

Example: kidney

19 of 21

Interim conclusions

  • At close evolutionary distance of human vs mouse, a simple marker overlap performs as well as more complex methods.
  • Results from marker overlap are subjectively quite noise.
  • Expect more distant relationships to be more difficult, and benefit more from complex methodology.
  • Important to have context-matched markers

20 of 21

Next steps

  • Examine annotation issues
    • e.g. what if only e.g. A and B cells had been extracted from pancreas? How useful are B cell markers then?
  • Human/ mouse is an easy case. Extend to fly as curation proceeds
    • Curation ongoing in SCXA, external datasets could be used in the meantime
  • Try with ‘unknowns’, e.g. comparing clusters with known cell types to make predictions.

F

M

H

21 of 21

Acknowledgements

  • Supervision: Irene Papatheodorou, Pablo Moreno
  • Gene Expression Team
  • Funders