Annotating MAVE results
with the
Ensembl Variant Effect Predictor
Sarah Hunt
Variation Resources Coordination, EMBL-EBI
Multiplex assays of variant effects (MAVEs):
Approaches, Analysis and Interpretation - 2025/11
MAVE data assessment
You have scores for large numbers of sequence changes you MAVE assay - how do your results correlate with the available knowledge about the bases assayed?
There are useful reference datasets available for results checking
Here: describe available resources, and how to integrate them using the Ensembl Variant Effect Predictor
The Ensembl Variant Effect Predictor. McLaren W. et al. Genome Biology 17:122(2016)
Outline
Ensembl VEP - web interface
https://www.ensembl.org/Tools/VEP
Ensembl VEP - web interface
Filtering options
Summary information
Results per variant/ genomic feature
Ensembl VEP - REST interface
https://rest.ensembl.org/
Ensembl VEP - command line interface
Variant input formats supported
Ensembl VEP is optimized to annotate variants described in Variant Call Format (VCF) and ordered by genomic location
Identifiers from a range of popular resources can be used as input for on-line analysis:
HGVS at genomic, transcript and protein level is supported, though protein level descriptions may give ambiguous results
NCBI SPDI format is supported at genomic level
A simple position-based tab delimited format is supported
Variant filtering
Ensembl VEP includes utilities to filter results using customisable criteria
filter_vep
--ontology --filter "Consequence is coding_sequence_variant”
--filter "AF < 0.01 or not AF”
--filter ”CADD > 24”
Available as a
command line utility or via the Ensembl VEP web interface
Outline
Molecular consequence prediction
Variants are mapped to transcript structures
The location type is identified
The expected molecular consequence of the change is predicted
Sequence Ontology
Standards enable understanding and comparisons across different resources
The use of Sequence Ontology terms to describe a variants predicted molecular consequence has been agreed by many of the major variant databases and tool providers including Ensembl, UCSC Genome Browser, dbSNP, SnpEff
More detailed terms (*) added due to Lord et al 2019 - Pathogenicity and selective constraint on variation near splice sites (10.1101/gr.238444.118)
Variants in/near splice sites
splice_donor_5th_base_variant *
splice_polypyrimidine_tract_variant *
splice_donor_variant
splice_acceptor_variant
splice_region_variant
splice_donor_region_variant *
'Splice site variant’ is not reported – more specific terms are used
Gene annotation – transcript choice
Two groups create gene annotation for multiple species
Many species have community-derived gene annotation
Transcript choice has a large impact on variant annotation
McCarthy DJ,, et al. Choice of transcripts and software has a large effect on variant annotation. Genome Med. 2014;6:26.
Human transcript sets
MANE - Matched Annotation from NCBI and EMBL-EBI:
Two tiers :
A joint NCBI and EMBL-EBI transcript set for clinical genomics and research.
Morales J, et al.Nature. 2022 Apr;604(7905):310-315. DOI: 10.1038/s41586-022-04558-8.
GENCODE Primary:
Transcript data in the human Ensembl VEP cache
Protein mappings
Transcript quality information
InterProScan results for an Ensembl protein sequences.
Regulatory regions
Open chromatin
Histone modification assays
Transcription factor binding assays and motifs
Ensembl Regulatory Build
http://www.ensembl.org/info/genome/funcgen/index.html
Consider potential impact on gene expression
Non-canonical open reading frame analysis
There is increasing interest in open reading frames within long noncoding RNAs and in the presumed untranslated region of protein coding genes
Efforts are underway to create catalogues of translated non-canonical ORF, assayed using Ribosome profiling
Non-canonical open reading frame analysis
RiboseqORFs plugin - Predicts the molecular consequences of a short variant with respect to open reading frames discovered in noncoding transcripts/regions
UTRAnnotator plugin - annotates variants in 5′ untranslated regions (5′UTR) that create or disrupt upstream open reading frames. (Zhang et al 2021)
Outline
Allele frequency reference sets
Set | Individuals |
1000 Genomes Project | 2,504 |
TOPMed (heart, lung, blood, and sleep disorders) | 62,784 |
gnomAD genomes | 76,215 (v4) |
gnomAD exomes | 730,947 (v4) |
ALFA (dbGaP - includes cases) | 204,108 (v3) |
AllofUs | 633,540 (v7.1) |
Comparison of the MAVE score for an allele to it’s population frequency is useful - if an allele is common in a large set of healthy individuals, a severely deleterious MAVE score is less likely
Ensembl VEP human caches contain:
Other datasets can be added as custom annotations
Allele frequency reference sets
https://gnomad.broadinstitute.org/stats#diversity
https://www.researchallofus.org/data-tools/data-snapshots
Current large population frequency datasets are biased towards people of European ancestry.
This causes problems for :
Tolerance to change - conservation
Conservation measures reveal regions which are highly similar across species.
1 230710048 A -> G
GERP score: -3.69 (neutral)
restores ‘ancestral allele’
Available as Conservation plugin or custom annotation
Outline
Variant disease and phenotype associations
Multiple resources provide phenotype/ disease associations to variants
NHGRI- EBI
GWAS
Catalog
Details available via Ensembl VEP Phenotype plugin
Experimental: Clinvar annotation mapping via Paralogs plugin
Assertions of likely pathogenicity for germline variants
Assertions of likely oncogenicity for somatic variants
Confidence ratings for assertions
Updated monthly
Statistical measures of association with a trait/ disease from large scale genome wide case- control studies
Functional data - interactions
IntAct shares molecular interaction data, curated from reports of assays in the scientific literature.
>72,000 records describing the effect of protein changes in human on interactions
Information described in standard formats with links to publication providing detail
https://www.ebi.ac.uk/intact/home
IntAct plugin and genome-mapped data available
Molecular Interactions Controlled Vocabulary
Outline
Variant scoring methods
frameshift variant
splice region variant
start lost variant
missense variant
regulatory region variant
AlphaMissense
SpliceAI
MaxEntScan
Enformer
intronic variant
REVEL
In silico scoring of likely variant pathogenicity is an active area of research and different tools focus on different variant or function types
Variant scoring methods
=> Agreement on pathogenicity rating may be due to similarity of method or noise in shared input/training data
30
Genomic changes disrupting normal splice sites or creating novel splice sites can impact transcript structure and protein function
Ensembl VEP highlights when variants lie in regions important for splicing
Multiple tools exist to predict the disruption of existing or creation of novel (cryptic) splice sites
Variant impact on splicing
Image from Rowlands et al 2019
Machine Learning Approaches for the Prioritization of Genomic Variants Impacting Pre-mRNA Splicing.
Splicing predictors
SpliceAI
Also available in Ensembl VEP
Outline
Ensembl VEP data caches
RefSeq
genes
e! genes
variation
regulation
+ BAM
+ VCF
Designed for quick reference data retrieval
Transcript caches
Variant caches
Regulation caches
Info file – variation data available
Synonym file – reference sequence names
Updated with the latest data each Ensembl release
e! transcripts + protein info
e!+RefSeq
transcripts
RefSeq
transcripts
Variants +
Frequencies
+ ClinVar
+ PMIDs
Regulatory elements + motifs
cache
Extending Ensembl VEP - custom data sources
The Ensembl VEP API can extract reference data from standard bioinformatics file formats to use in annotation:
Example use cases:
http://www.ensembl.org/info/docs/tools/vep/script/vep_custom.html
Extending Ensembl VEP - Plugins
http://www.ensembl.org/info/docs/tools/vep/script/vep_plugins.html#plugins_how
Ensembl VEP process
Runner
Input
Buffer
Parser
Annotation
Source
Input
variants
Cache
Output
Factory
tab
JSON
VCF
Interface
InputBuffer –
Reads variants
5000 at a time
(set: BufferSize)
Annotation read in
1M chunks from cache
Fasta
GFF3/GTF, VCF, BED
e! databases
Plugins
Outline
Using command line Ensembl VEP
./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz
file of variants
to annotate
define a file to write results to
use Ensembl VEP cache to look up reference data
supply genomic sequence for quick look up
Using command line Ensembl VEP
./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz
--af_gnomade --af_gnomadg
--gencode_primary --mane
turn on frequencies from the gnomAD exomes & genomes collections
select transcript set
(restrict to GENCODE Primary)
tag MANE transcript in the output (for filtering later)
Using command line Ensembl VEP
./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz
--af_gnomade --af_gnomadg
--gencode_primary --mane
--plugin AlphaMissense,file=/full/path/to/file.tsv.gz
--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz
--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz
-
include plugins to integrate precalculated results from variant scoring tools
Using command line Ensembl VEP
./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz
--af_gnomade --af_gnomadg
--gencode_primary --mane
--plugin AlphaMissense,file=/full/path/to/file.tsv.gz
--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz
--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz
--plugin IntAct,mapping_file=[path_to]/mutation_map.txt.gz,mutation_file=[path_to]/mutations.tsv,all=1
-
Add a plugin to highlight sites known to be important in interactions
Using command line Ensembl VEP
./vep --input_file [input_data] --output_file [output_file] --cache -–fasta Homo_sapiens.GRCh38.fa.gz
--af_gnomade --af_gnomadg
--gencode_primary --mane
--plugin AlphaMissense,file=/full/path/to/file.tsv.gz
--plugin REVEL,file=[path_to]/new_tabbed_revel_grch38.tsv.gz
--plugin SpliceAI,snv=[path_to]/spliceai_scores.raw.snv.ensembl_mane.grch38.110.vcf.gz
--plugin IntAct,mapping_file=[path_to]/mutation_map.txt.gz,mutation_file=[path_to]/mutations.tsv,all=1
–plugin Conservation,/path/to/bigwigfile.bw
Add a plugin to report conservation scores
Summary
There are increasing numbers of datasets and tools to aid in the assessment of likely variant impact
Ensembl VEP simplifies the application of these resources
See MaveDB plugin for the annotation of human variants with results from MAVE assays
Arbesfeld et al 2025 doi:10.1186/s13059-025-03647-x
Acknowledgements
Funded by the European Union
National Human Genome Research Institute (NHGRI)
Funded by Wellcome
Aine Fairbrother-Browne
Nakib Hossain
Diana Lemos
Likhitha Surapaneni
Nakib Hossain
Jamie Allen
Mallory Freeberg
The Genome Assembly and Annotation team
EMBL-EBI IT Services
Variant normalisation
Be wary of different allele representations on mapping your variants to reference sets
Normalisation must be employed on frequency look up to ensure the correct allele is matched