1 of 21

Cross-species project: �pilot work

Jon Manning, BBSRC-NSF update meeting

May 2021

2 of 21

Background

Single-cell Expression Atlas (SCXA) has curated (i.e. author-provided) and calculated cell groups across a wide variety of species.

Calculated markers for each cell group wrt all other groups combined.

We would like to relate cell-groups across experiments without species being a barrier
Queries like: ”Show me all cell groups you have that are similar to this one”

3 of 21

Two approaches

Marker overlap

Take top 100 differential genes
Map between species using orthology mappings
Rank thresholded cluster similarity between experiments by marker gene overlap

SAMap (Alec Tarashansky)

Rank thresholded SAMap scores
https://github.com/atarashansky/SAMap
Workflow with parallelized BLAST: https://github.com/ebi-gene-expression-group/samap-workflow

4 of 21

Marker overlap

Follows a standard Scanpy workflow
Multiple parameter combinations tried for:

p value threshold
number of genes
minimum cluster overlap

Best: top 100 genes irrespective of p value, 5% overlap threshold.

5 of 21

More complex approach: SAMap

https://doi.org/10.1101/2020.09.28.317784

https://github.com/atarashansky/

Generate a gene/gene bipartite graph across species
Merge to single PC space
Compute expression correlations in PC space with k-nearest cross-species neighbours
Reweight homology graph
Re-assign cross-species cell neighbours
Re-calculate gene/ gene correlations, update homology graph again.
Refine mutual connectivity of cells between species.
Recalculate gene expression correlations, reweight homology graph, produce final atlas alignments.

6 of 21

Evaluation approach

Compare mouse Tabula Muris to human experiments with annotated cell types.
Rank methods by largest proportion of known intersecting cell types detected.

M

H

7 of 21

1. Marker overlap method

8 of 21

Difficulty: standard SCXA marker sets are mismatched by context

Markers are only markers relative to something else

9 of 21

Importance of context for marker comparison�e.g. what’s the signature for a pancreatic B cell?

E-MTAB-5061: Human pancreas

Markers = “What genes make a B cell look different to average expression across the rest of the pancreas?”

OR

”What genes define the un-pancreasiness of the B cell?”

Pancreas

B cell

Acinar

A cell

D cell

E-ENAD-15: whole mouse

Markers = “What genes make a B cell look different to average expression across the rest of the mouse?”

OR

“What genes define the un-mousiness of the B cell?”

Mouse

Pancreas

B cell

Acinar

A cell

D cell

Blood

Kidney

Heart

Kidney

10 of 21

What happens if we try?

Markers for human B cells from the pancreas context overlap with marker genes for a variety of pancreas cell types from the whole-mouse context.

Markers for pancreatic cell types in the mouse dataset are less specific to those cell types within the pancreas, since the background contained a relatively low number of pancreatic cells.

The same applies to other hierarchical differences, e.g. pancreas vs sub-parts endocrine/ exocrine pancreas.

Common markers with B cells wrt mouse

Common markers with B cells wrt mouse pancreas

11 of 21

Easy, I’ll just split experiments by organism part and re-calculate markers before comparison

12 of 21

Not quite- granularity can be different, so we need to merge more specific terms

E-ENAD-15 granularity

E-MTAB-5061 granularity

So we need to merge these categories for comparison

13 of 21

Solution

Only compare marker genes generated for the same context.
Recipe:

For two experiments sharing at least some organism parts, find exactly matching terms.
For terms that don’t match, compare lists of ancestor terms and look for matches.
Assign to both experiments the most specific shared terms.
Generate marker genes for cell types within matching contexts (organism parts).
Compare marker lists to derive cross-species relationships between cell types/ clusters.

Snakemake implementation: https://github.com/ebi-gene-expression-group/cross-species-cellgroup-comparison

14 of 21

Basic validation

Use experiments with cell type annotations.
Determine overlapping cell types between experiment pairs.
Match cell types between experiments by markers:

Minimum p value: 0.05
Minimum proportion overlap: 0.05
Rank matched types by proportion overlap

If markers are a bad way of matching, we’ll pick the wrong matches in the other dataset.

15 of 21

Results

16 of 21

Pilot results: Between human and mouse, SAMap peforms similarly to marker overlap

Experiment 1 (mouse)	Experiment 2 (human)	Common organism part	Intersecting cell types	Predicted intersecting (top rank, markers)	Predicted intersecting (top rank, SAMap)
E-ENAD-15	E-GEOD-125970	colon	2	2	2
	E-GEOD-83139	pancreas	5	5	4
	E-HCAD-10	kidney	4	4	4
	E-HCAD-1	lung	4	3	3
	E-HCAD-1	spleen	3	1	1
	E-MTAB-5061	pancreas	7	7	7
	E-MTAB-8410	ascending colon	2	2	2

17 of 21

Example: pancreas

18 of 21

Example: kidney

19 of 21

Interim conclusions

At close evolutionary distance of human vs mouse, a simple marker overlap performs as well as more complex methods.
Results from marker overlap are subjectively quite noise.
Expect more distant relationships to be more difficult, and benefit more from complex methodology.
Important to have context-matched markers

20 of 21

Next steps

Examine annotation issues

e.g. what if only e.g. A and B cells had been extracted from pancreas? How useful are B cell markers then?

Human/ mouse is an easy case. Extend to fly as curation proceeds

Curation ongoing in SCXA, external datasets could be used in the meantime

Try with ‘unknowns’, e.g. comparing clusters with known cell types to make predictions.

F

M

H

21 of 21

Acknowledgements

Supervision: Irene Papatheodorou, Pablo Moreno
Gene Expression Team
Funders