1 of 45

Annotating MAVE results

with the

Ensembl Variant Effect Predictor

Sarah Hunt

Variation Resources Coordination, EMBL-EBI

Multiplex assays of variant effects (MAVEs):

Approaches, Analysis and Interpretation - 2025/11

2 of 45

MAVE data assessment

You have scores for large numbers of sequence changes you MAVE assay - how do your results correlate with the available knowledge about the bases assayed?

There are useful reference datasets available for results checking

Here: describe available resources, and how to integrate them using the Ensembl Variant Effect Predictor

The Ensembl Variant Effect Predictor. McLaren W. et al. Genome Biology 17:122(2016)

3 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

4 of 45

Ensembl VEP - web interface

https://www.ensembl.org/Tools/VEP

  • Multiple input formats supported
  • Maximum job: 50M file
    • equivalent to ~2 million lines in VCF
  • Analysis jobs managed by scheduling system
  • Simple to use
  • Least configurable option

5 of 45

Ensembl VEP - web interface

Filtering options

Summary information

Results per variant/ genomic feature

6 of 45

Ensembl VEP - REST interface

https://rest.ensembl.org/

  • language-agnostic programmatic access
  • ideal for:
    • integrating information on a single variant into a webpage
    • integration into small scale annotation pipelines
  • limits: 55,000 requests per hour (~15 per second)

7 of 45

Ensembl VEP - command line interface

  • Fastest option
  • Total data privacy
  • Reference data downloadable in compact data files (caches)
  • Docker image + VM supported
  • Nextflow workflow available

  • Highly customisable in both reference data used and functionality applied (via plugins)

8 of 45

Variant input formats supported

Ensembl VEP is optimized to annotate variants described in Variant Call Format (VCF) and ordered by genomic location

Identifiers from a range of popular resources can be used as input for on-line analysis:

HGVS at genomic, transcript and protein level is supported, though protein level descriptions may give ambiguous results

NCBI SPDI format is supported at genomic level

A simple position-based tab delimited format is supported

9 of 45

Variant filtering

Ensembl VEP includes utilities to filter results using customisable criteria

filter_vep

--ontology --filter "Consequence is coding_sequence_variant”

--filter "AF < 0.01 or not AF”

--filter ”CADD > 24”

Available as a

command line utility or via the Ensembl VEP web interface

10 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

11 of 45

Molecular consequence prediction

Variants are mapped to transcript structures

The location type is identified

The expected molecular consequence of the change is predicted

12 of 45

Sequence Ontology

Standards enable understanding and comparisons across different resources

The use of Sequence Ontology terms to describe a variants predicted molecular consequence has been agreed by many of the major variant databases and tool providers including Ensembl, UCSC Genome Browser, dbSNP, SnpEff

13 of 45

More detailed terms (*) added due to Lord et al 2019 - Pathogenicity and selective constraint on variation near splice sites (10.1101/gr.238444.118)

Variants in/near splice sites

splice_donor_5th_base_variant *

splice_polypyrimidine_tract_variant *

splice_donor_variant

splice_acceptor_variant

splice_region_variant

splice_donor_region_variant *

'Splice site variant’ is not reported – more specific terms are used

14 of 45

Gene annotation – transcript choice

Two groups create gene annotation for multiple species

  • Both use a variety of evidence to determine transcript structures
  • Tissue specific expression → many different splice isoforms

Many species have community-derived gene annotation

Transcript choice has a large impact on variant annotation

  • Both groups provide default transcripts but considering these alone will miss information
  • Best approach – annotate variants against a comprehensive set of transcripts and consider transcript quality information when determining variant impact.

McCarthy DJ,, et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 2014;6:26.

15 of 45

Human transcript sets

MANE - Matched Annotation from NCBI and EMBL-EBI:

    • A set of standard default transcripts for variant reporting in protein coding genes
    • Matches the GRCh38 assembly
    • Identical between NCBI RefSeq and Ensembl/GENCODE

Two tiers :

  • MANE Select - one per gene, representative of biology at each locus; well-supported, expressed, conserved
  • MANE Plus Clinical - alternate transcripts to capture additional clinically reportable variants

A joint NCBI and EMBL-EBI transcript set for clinical genomics and research.

Morales J, et al.Nature. 2022 Apr;604(7905):310-315. DOI: 10.1038/s41586-022-04558-8.

GENCODE Primary:

  • Contains nearly all human exons in the minimum number of transcripts, limiting the number of transcripts to be considered without losing information

16 of 45

Transcript data in the human Ensembl VEP cache

Protein mappings

  • UniProt
  • PDBe
  • protein families, domains and sites from InterProScan

Transcript quality information

    • Transcript Support Level (TSL)
    • APPRIS

InterProScan results for an Ensembl protein sequences.

17 of 45

Regulatory regions

Open chromatin

Histone modification assays

Transcription factor binding assays and motifs

Ensembl Regulatory Build

  • promoters
  • enhancers
  • transcription factor binding sites

http://www.ensembl.org/info/genome/funcgen/index.html

Consider potential impact on gene expression

18 of 45

Non-canonical open reading frame analysis

There is increasing interest in open reading frames within long noncoding RNAs and in the presumed untranslated region of protein coding genes

Efforts are underway to create catalogues of translated non-canonical ORF, assayed using Ribosome profiling

19 of 45

Non-canonical open reading frame analysis

RiboseqORFs plugin - Predicts the molecular consequences of a short variant with respect to open reading frames discovered in noncoding transcripts/regions

UTRAnnotator plugin - annotates variants in 5untranslated regions (5UTR) that create or disrupt upstream open reading frames. (Zhang et al 2021)

20 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

21 of 45

Allele frequency reference sets

Set

Individuals

1000 Genomes Project

2,504

TOPMed (heart, lung, blood, and sleep disorders)

62,784

gnomAD genomes

76,215 (v4)

gnomAD exomes

730,947 (v4)

ALFA (dbGaP - includes cases)

204,108 (v3)

AllofUs

633,540 (v7.1)

Comparison of the MAVE score for an allele to it’s population frequency is useful - if an allele is common in a large set of healthy individuals, a severely deleterious MAVE score is less likely

  • Be aware of cases in the allele frequency data sets
  • If the gene you are assaying is associated with a fairly common disease, consider this on applying frequency checks

Ensembl VEP human caches contain:

Other datasets can be added as custom annotations

22 of 45

Allele frequency reference sets

https://gnomad.broadinstitute.org/stats#diversity

https://www.researchallofus.org/data-tools/data-snapshots

Current large population frequency datasets are biased towards people of European ancestry.

This causes problems for :

  • personalised medicine in other populations
  • filtering ‘common’ variants in rare disease investigations
  • bias in pathogenicity predictors trained on population frequencies

23 of 45

Tolerance to change - conservation

Conservation measures reveal regions which are highly similar across species.

  • calculated from the genome wide alignment of the sequenced genomes of many different species
  • measures include GERP, phyloP

1 230710048 A -> G

GERP score: -3.69 (neutral)

restores ‘ancestral allele’

Available as Conservation plugin or custom annotation

24 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

25 of 45

Variant disease and phenotype associations

Multiple resources provide phenotype/ disease associations to variants

  • Data is highly heterogeneous
  • Important to consider by allele

NHGRI- EBI

GWAS

Catalog

Details available via Ensembl VEP Phenotype plugin

Experimental: Clinvar annotation mapping via Paralogs plugin

Assertions of likely pathogenicity for germline variants

Assertions of likely oncogenicity for somatic variants

Confidence ratings for assertions

Updated monthly

Statistical measures of association with a trait/ disease from large scale genome wide case- control studies

26 of 45

Functional data - interactions

IntAct shares molecular interaction data, curated from reports of assays in the scientific literature.

>72,000 records describing the effect of protein changes in human on interactions

Information described in standard formats with links to publication providing detail

https://www.ebi.ac.uk/intact/home

IntAct plugin and genome-mapped data available

Molecular Interactions Controlled Vocabulary

27 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

28 of 45

Variant scoring methods

frameshift variant

splice region variant

start lost variant

missense variant

regulatory region variant

AlphaMissense

SpliceAI

MaxEntScan

Enformer

intronic variant

REVEL

In silico scoring of likely variant pathogenicity is an active area of research and different tools focus on different variant or function types

29 of 45

Variant scoring methods

=> Agreement on pathogenicity rating may be due to similarity of method or noise in shared input/training data

  • There are many computational methods to predict if a missense variant is likely to be deleterious
  • Performance depends on data available and varies by gene/ region

  • Some use protein homology information ( eg. SIFT, PolyPhen-2, LRT )
  • Some use the same training data ( eg. FATHMM, MutationTaster, VEST3 use HGMD)
  • Some incorporate other callers ( eg.DANN, REVEL)
  • Some use deep sequencing across species ( eg. PrimateAI, AlphaMissense)

30 of 45

30

Genomic changes disrupting normal splice sites or creating novel splice sites can impact transcript structure and protein function

Ensembl VEP highlights when variants lie in regions important for splicing

Multiple tools exist to predict the disruption of existing or creation of novel (cryptic) splice sites

Variant impact on splicing

Image from Rowlands et al 2019

Machine Learning Approaches for the Prioritization of Genomic Variants Impacting Pre-mRNA Splicing.

31 of 45

Splicing predictors

SpliceAI

  • Deep learning method
  • Predicts variant gain or loss of acceptor and donor ( Δ score ≥0.8 confidently considered deleterious)
  • Ensembl precalculated scores
    • calculated for all possible SNV for MANE transcripts
    • maximum distance between the variant and gained/lost splice site: 50bp

Also available in Ensembl VEP

  • SpliceVault - precalculated predictions of exon-skipping events and activated cryptic splice sites based on the most common mis-splicing events around a splice site
  • MaxEntScan - calculates on the fly

32 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

33 of 45

Ensembl VEP data caches

RefSeq

genes

e! genes

variation

regulation

+ BAM

+ VCF

Designed for quick reference data retrieval

Transcript caches

  • Contain serialised Ensembl transcript objects
  • Use NCBI alignments to RefSeq transcripts
  • Created for Ensembl, RefSeq or both

Variant caches

  • Simple tabix’ed files with ordered data

Regulation caches

  • Contain serialised Ensembl regulatory feature objects

Info file – variation data available

Synonym file – reference sequence names

Updated with the latest data each Ensembl release

e! transcripts + protein info

e!+RefSeq

transcripts

RefSeq

transcripts

Variants +

Frequencies

+ ClinVar

+ PMIDs

Regulatory elements + motifs

cache

34 of 45

Extending Ensembl VEP - custom data sources

The Ensembl VEP API can extract reference data from standard bioinformatics file formats to use in annotation:

  • Transcript annotation in GFF3 or GTF
  • Variant frequencies, other INFO from VCF
  • Genomic features in BED

Example use cases:

  • Additional/new variant frequency sets such as AllofUs
  • Specific gnomAD panels
  • Private data sets
  • MAVE scores associated with genomic coordinates & single base changes

http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html

35 of 45

Extending Ensembl VEP - Plugins

  • Discrete modules to add specific functionality
  • Optionally added to the Ensembl VEP command line
  • Can be written quickly when a new tool or data set is available

  • Common actions:
    • Run additional algorithm
    • Look up and manipulate pre-computed values from a file
    • Look up additional data via the Ensembl API

http://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#plugins_how

36 of 45

Ensembl VEP process

Runner

Input

Buffer

Parser

Annotation

Source

Input

variants

Cache

Output

Factory

tab

JSON

VCF

Interface

InputBuffer –

Reads variants

5000 at a time

(set: BufferSize)

Annotation read in

1M chunks from cache

Fasta

GFF3/GTF, VCF, BED

e! databases

Plugins

37 of 45

Outline

  • Using Ensembl VEP
  • Predicting the effect of sequences changes on functional genomic elements
  • Using population data to assess sensitivity to change
  • Known variant impact
  • Variant scoring tools
  • How it works
  • Example command line

38 of 45

Using command line Ensembl VEP

./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz

file of variants

to annotate

define a file to write results to

use Ensembl VEP cache to look up reference data

supply genomic sequence for quick look up

39 of 45

Using command line Ensembl VEP

./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz

--af_gnomade --af_gnomadg

--gencode_primary --mane

turn on frequencies from the gnomAD exomes & genomes collections

select transcript set

(restrict to GENCODE Primary)

tag MANE transcript in the output (for filtering later)

40 of 45

Using command line Ensembl VEP

./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz

--af_gnomade --af_gnomadg

--gencode_primary --mane

--plugin AlphaMissense,file=/full/path/to/file.tsv.gz

--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz

--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz

-

include plugins to integrate precalculated results from variant scoring tools

41 of 45

Using command line Ensembl VEP

./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz

--af_gnomade --af_gnomadg

--gencode_primary --mane

--plugin AlphaMissense,file=/full/path/to/file.tsv.gz

--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz

--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz

--plugin IntAct,mapping_file=[path_to]/mutation_map.txt.gz,mutation_file=[path_to]/mutations.tsv,all=1

-

Add a plugin to highlight sites known to be important in interactions

42 of 45

Using command line Ensembl VEP

./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz

--af_gnomade --af_gnomadg

--gencode_primary --mane

--plugin AlphaMissense,file=/full/path/to/file.tsv.gz

--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz

--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz

--plugin IntAct,mapping_file=[path_to]/mutation_map.txt.gz,mutation_file=[path_to]/mutations.tsv,all=1

–plugin Conservation,/path/to/bigwigfile.bw

Add a plugin to report conservation scores

43 of 45

Summary

There are increasing numbers of datasets and tools to aid in the assessment of likely variant impact

  • they have a variety of formats & are regularly updated

Ensembl VEP simplifies the application of these resources

  • It has bundled data caches
  • It is simple to extend and configure

See MaveDB plugin for the annotation of human variants with results from MAVE assays

Arbesfeld et al 2025 doi:10.1186/s13059-025-03647-x

44 of 45

Acknowledgements

Funded by the European Union

National Human Genome Research Institute (NHGRI)

Funded by Wellcome

Aine Fairbrother-Browne

Nakib Hossain

Diana Lemos

Likhitha Surapaneni

Nakib Hossain

Jamie Allen

Mallory Freeberg

The Genome Assembly and Annotation team

EMBL-EBI IT Services

45 of 45

Variant normalisation

Be wary of different allele representations on mapping your variants to reference sets

Normalisation must be employed on frequency look up to ensure the correct allele is matched