1 of 23

Atelier Variant

niveau 2

Nadia BESSOLTANE - INRAe

Vivien DESHAIES -APHP

École de bioinformatique niveau 2 AVIESAN-IFB-INSERM 2023

2 of 23

https://www.genome.gov/Health/Genomics-and-Medicine/Polygenic-risk-scores#one

Définition

Il existe différents types de variations :

SNV : Single Nucleotide Variant
INDEL : INsertion ou DELetion
SV (Structural Variant)
CNV (Copy Number Variation)

3 of 23

du Fastq aux VCFs (ebaiin1)

Fastq Quality Control

-- FastQC --

Mapping

-- Bwa --

Reads (Fastq)

Reference genome (Fasta)

Processing Post Alignement

-- GATK/Picard --

small variants

-- GATK --

Structural variants

-- Delly --

Annotation

-- SnpEff --

Variant Calling

VCF

UNIX

4 of 23

VCF (variant call format)

5 of 23

du VCFs aux marqueurs (ebaiin2)

du Fastq aux VCFs (ebaiin1)

5

Fastq Quality Control

-- FastQC --

Mapping

-- Bwa --

Reads (Fastq)

Reference genome (Fasta)

Processing Post Alignement

-- GATK/Picard --

small variants

-- GATK --

Structural variants

-- Delly --

Annotation

-- SnpEff --

Variant Calling

VCF

QC & Filtres

stats descriptives

Recherche de marqueurs/QTL

Genome Wide Association Study (GWAS)

analyse phylogénétique

….

Dépend de la question biologique

UNIX

6 of 23

6

##fileformat=VCFv4.2

##FILTER=<ID=LowQual,Description="Low quality">

##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtere

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VC

##GATKCommandLine=<ID=HaplotypeCaller,CommandLine="HaplotypeCaller --min-base-quality-score 18 --emit-ref-confidence NONE

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">

##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">

##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt Vs. Ref base qualities

...

##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z-score from Wilcoxon rank sum test of Alt vs. Ref read positio

##INFO=<ID=SOR,Number=1,Type=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">

##contig=<ID=6,length=119458736>

##source=HaplotypeCaller

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SRR1262731 SRR1262732

6 2 . T A 67.64 . AC=1;AF=0.500;... GT:AD:DP:GQ:PL 0/1:3,2:5:75:75,0,105 0/1:3,2:5:75:75,

6 4 . GT G 58.60 . AC=1;AF=0.500;... GT:AD:DP:GQ:PL 0/1:1,2:3:28:66,0,28 0/1:1,2:3:28:66,

6 9 . C CA 55.60 . AC=1;AF=0.500;... GT:AD:DP:GQ:PL 0/1:7,2:9:63:63,0,279 0/1:7,2:9:63:63,

VCF header

Body

Lire un VCF sous R

METADATA

INFO

GENOTYPE

7 of 23

Lire un VCF sous R

vcfR package : Manipulate and Visualize VCF Data

vcfR documentation : https://knausb.github.io/vcfR_documentation/index.html

> ## installer le package

> install.packages(“vcfR”)

> ## charger le package

> library(vcfR)

> # lire le fichier vcf

> my.vcf <- read.vcfR(“pool.vcf”)

> # l’objet vcf appartient à quelle class

> is(my.vcf)

[1] "vcfR"

> # la liste des slots (sections)

> slotNames(my.vcf)

[1] "meta" "fix" "gt"

Objet de la classe vcfR

Trois sections :

meta-information : entête du vcf
Fixed information : information par variant mais commune à tous les échantillons (position, allèles, qualité…)
Genotype information : information de génotypage par échantillon

8 of 23

8