1 of 6

Allele normalisations

in the European Variation Archive

2 of 6

The basics

  • Independent from the VCF representation
  • Try to be as aligned with Ensembl Variation coordinates as possible: http://www.ensembl.org/info/docs/tools/vep/vep_formats.html
    • But we don’t “swap” insertions coordinates
  • Same alignment rules as dbSNP
  • Start and end represent exactly the range where the variation occurs
  • The reference or alternate allele could end up being “empty”
  • Variants in the source VCF with ALT “.” are removed due to non-specification of allele
  • Duplicate variants from a single VCF identified after normalisation are deleted due to authenticity concerns

3 of 6

Simple variations

The most basic case Single Nucleotide Variation stays the same

1 1000 A T → 1 1000-1000 A T

Multiple Nucleotide Variation coordinates represent the whole change

1 1000 ATC GGT → 1 1000-1002 ATC GGT

1 1000 ACTC AGCT → 1 1001-1003 CTC GCT

4 of 6

Insertions and deletions

Insertions coordinates represent the inserted nucleotides

1 1000 A ATC → 1 1001-1002 - TC

1 1000 AG ATC → 1 1001-1002 G TC

Deletions coordinates represent the deleted nucleotides

1 1000 ATC A → 1 1001-1002 TC -

1 1000 ATC GC → 1 1000-1001 AT G

5 of 6

Allele alignment

Following dbSNP rules, try to generate alleles as left-aligned as possible

  • First remove nucleotides from the right of the string

1 1000 AGTTC AGCC → 1 1000-1003 AGTT AGC

  • Then from the left

1 1000 AGTT AGC → 1 1002 -1003 TT C

6 of 6

Co-located variants

1 1000 TGACGTAACGATT T,TGACGTAACGGTT,TGACGTAATAC

Results in 3 different variants

1 1000-1011 TGACGTAACGAT - → last T removed

1 1010-1010 A G → last TT removed, then the prefix

1 1008-1012 CGATT TAC→ common prefix removed