1 of 26

Surveying the landscape of effector gene prediction

Laura Harris, EMBL-EBI

Maria Costanzo, Broad Institute

Knowledge

Portal

Network

GWAS Catalog

2 of 26

Effector gene prediction is a key output of GWAS

  • Important output of GWA studies
  • Link to drug discovery targets
  • Key to defining gold standard (solved) loci in complex disease

  • Different methods may not produce same outcome
  • This can be impossible to assess due to missing data
  • No standard exist for reporting or methods

3 of 26

Current resource landscape

  • GWAS Catalog only includes mapped genes
  • Predicted effector genes (PEG) not curated due to lack of clear provenance & poor reporting standards

Goal: a FAIR PEG data standard that works for all 3 resources & the community, enabling:

      • meta analysis to define gold standard lists of genes involved in traits
  • data integration and reuse –computational ingest, submission to knowledgebases, supporting AI/ML/KG use.
  • hypothesis generation, confirming independent findings, drug target identification
  • PEG are curated by the A2F Knowledge Portal & Open Targets
  • Curation is excessively manual & suffers from missing data
  • Output is not findable, accessible interoperable or reusable

4 of 26

Predicting effector genes for complex diseases and traits

5 of 26

Gene prioritizations and predicted effector gene (PEG) lists

6 of 26

First list of predicted effector genes (PEGs) in the T2D Knowledge Portal

7 of 26

Heuristic to combine evidence types into a categorization of evidence strength

8 of 26

Investigating trends in gene prioritization

5,140 papers loaded by GWAS Catalog from 2012-2022

169 papers with systematic gene prioritization, across 157 traits

Scan titles and abstracts

Mention of gene prioritization

Scan full text

  • Multiple evidence types aggregated
  • All GWAS significant loci investigated

9 of 26

Investigating trends in gene prioritization

Number of papers incorporated into the GWAS Catalog that include systematic gene prioritization (blue bars, left vertical axis) and percent of total papers added to the GWAS Catalog (red bars, right vertical axis), by year of publication.

10 of 26

Variant-centric evidence

  • Nearest gene (including coding variant impact)

  • Chromatin conformation: does the causal variant contact a gene-specific regulatory element?

  • Epigenomic annotations: does the causal variant lie within an annotated regulatory region?

  • QTLs: does the causal variant impact levels of a gene’s transcript or protein product, splice forms of the transcript, other molecular properties of a gene?

Are any of these specific to disease-relevant tissues?

Start by identifying the causal variant, find evidence about its impact

11 of 26

Gene-centric evidence

  • “Guilt by association”: co-regulation, gene set membership, or protein-protein interactions with known disease-associated genes/proteins; differential expression in disease

  • Perturbation: does mutation of a model organism ortholog or KO in a cell line confer a disease-related phenotype? Are there Mendelian disease mutations in the gene?

  • Gene burden: is there a significant burden of common or rare variants in the gene associated with disease risk?

  • Literature/ online resources

Start with genes in GWAS loci, find evidence about their function

12 of 26

Pipelines

  • Combined SNP2Gene
  • DEPICT
  • Ei (Effector index)
  • FUMA SNP2GENE and GENE2FUNC
  • Gene Priority Score (GPS)
  • Open Targets L2G (Locus to Gene)
  • PoPS (Polygenic Priority Score)
  • ProGeM (Prioritization of candidate causal genes at molecular QTLs)

13 of 26

How many evidence types are used per study?

14 of 26

Are there trends in usage of specific evidence types?

15 of 26

Gene prioritizations vs. PEG lists

Mouse mutant phenotype evidence for genes at GWAS loci

eQTL evidence for genes at GWAS loci

DEPICT gene prioritization scores for genes at GWAS loci

Gene set enrichment analysis for genes at GWAS loci

Tissue and cell type annotation enrichment for genes at GWAS loci

Tissue-specific expression evidence for genes at GWAS loci

25% of papers included gene prioritization only

75% of papers integrated all evidence in a PEG list

16 of 26

Some PEG lists are presented only as images

Table in image format

Graphics presented without their underlying data

17 of 26

A major difference in information content: all genes per locus vs. top gene only

All genes per locus (71% of papers)

18 of 26

A major difference in information content: all genes per locus vs. top gene only

Top gene per locus (29% of papers)

19 of 26

Scoring system vs. no scoring

Scoring system (29% of papers)

20 of 26

Scoring system vs. no scoring

No scoring system (71% of papers)

21 of 26

Comparing predictions for the same trait

  • Manually convert into a table with one row per gene
  • Add a column with sum of number of evidence types
  • Compare with a supplementary table to find the locus for each gene

PEG list 1

PEG list 2

22 of 26

Comparing predictions for the same trait

Find shared loci between lists 1 and 2

    • 9 loci in common

How often does the highest-ranked gene at a locus from list 1 match the top-ranked gene from list 2?

    • Concordance at 6/9 loci

PEG list 1: multiple genes per locus

PEG list 2: top gene only

23 of 26

24 of 26

Minimal standards for PEG list metadata

  • Document the GWAS from which the significant loci are derived
    • Phenotype definition and ontology term mapping
    • Sample number, with numbers of cases/controls if applicable
    • Genome build
    • Ancestry
    • Publication

  • Specify the boundary coordinates of loci
  • Describe how each evidence type was generated, including references/links to input data and bioinformatic methods
  • Describe criteria for significance of each evidence type
  • Document the scoring system or heuristic used to categorize gene priority

25 of 26

Minimal standards for PEG list content and format

  • Present a single PEG list that includes all evidence types
  • Format the list as a downloadable text or spreadsheet file without any graphical elements
  • Present evidence for all the genes at each locus (rather than only for the highest-priority gene)
  • Identify the sentinel variant (rsID; chromosome:coordinate:alleles; genome build) for each GWAS locus considered
  • Use standard HGNC nomenclature for gene names and include Ensembl GeneIDs
  • Summarize the total weight of evidence for each gene
  • When using automated pipelines, take care not to count duplicate evidence types as independent observations (e.g., DEPICT incorporates Gene Ontology and mouse mutant phenotype annotations)

26 of 26

PEG standard development timeline

Initial

community

workshop

Sept 2024

Submit landscape manuscript

Refinement of draft standard

Convening working group

Landscape article published in NG

1st WG meeting

Recap & feedback on draft standard

2nd WG meeting

Focus on data matrix

3rd WG meeting

Focus on metadata

Kickoff benchmarking activity

4th WG

meeting

Benchmarking

results

PEG list format

Dec 2024

Spring 2025

April 2025

May 2025

June 2025

July 2025

Sept 2025

Oct 2025

5th WG meeting

PEG list format

Summary & run-through

Ancilliary session