1 of 41

MAPpoly training section�

Marcelo Mollinari

mmollin@ncsu.edu

SCRI Workshop – January 14, 2021

This work is licensed under CC BY 4.0

2 of 41

Outline

Installation
Reading/import datasets
Filtering
Two-point analysis
Grouping
Ordering
Phasing
Modeling errors/genotype probabilities
Final checking
Map summary/results
Genotype probabilities
Preferential pairing/haplotype probabilities
Hexaploid scripts/results

3 of 41

Initial notes about MAPpoly

Primary intention was implementing a program to construct genetic maps in polyploids with high ploidy levels (i.e. > 4: sugarcane, sweetpotato, etc.)
Linkage Analysis and Haplotype Phasing in Experimental Autopolyploid Populations with High Ploidy Level Using Hidden Markov Models (Mollinari and Garcia, 2019 )
Obtain conditional probabilities for QTL analysis
It does not model double reduction: double reduced markers/individuals are treated as missing data

4 of 41

Installation

From CRAN (stable version)

install.packages("mappoly")

From GitHub (development version)

install.packages("devtools")

devtools::install_github("mmollina/mappoly", dependencies=TRUE)

5 of 41

Reading/importing data sets

Supported datasets

MAPpoly files

Discrete dosage
Dosage probability

CSV files
fitPoly files

Dosage Probability

VCF files

Discrete dosage
Dosage probability

Supported R objects

updog objects
polyRAD objects
polymapR

datasets
maps

data set

mappoly.data

6 of 41

Reading/importing datasets

MAPpoly data reading functions

1. MAPpoly

dat.mappoly <- read_geno(file.in = 'path_to_input_file')

2. MAPpoly with genotype probabilities

dat.mappoly.prob <- read_geno_prob(file.in = 'path_to_input_file)

3. CSV

dat.csv <- read_geno_csv(file.in = 'path_to_input_file', ploidy = 6)

4. fitPoly, supports genotype probabilities

dat.fitpoly <- read_fitpoly(file.in = 'path_to_input_file', ploidy = 6,

parent1 = 'P1', parent1 = 'P2')

5. VCF files, supports genotype probabilities (PL field)

dat.vcf <- read_vcf(file.in = 'path_to_input_file', ploidy = 6,

parent1 = 'P1', parent1 = 'P2',

min.gt.depth = 0, min.av.depth = 0)

7 of 41

Reading/importing datasets

1. updog

dat.updog <- import_from_updog(object = input_multidog_object)

2. polyRAD

dat.polyRAD <- polyRAD::Export_MAPpoly(object = 'polyrad_object')

3. polymapR

dat.polymapr <- import_data_from_polymapR(input.data = polymapR_screened_data,

ploidy = 6, parent1 = 'P1',

parent1 = 'P2')

map.polymapr <- import_phased_maplist_from_polymapR(maplist = phased_maplist,

mappoly.data = mappoly.data)

Importing data

Importing maps

MAPpoly data importing functions

8 of 41

Reading/importing datasets

## Downloading data from GitHub

library("mappoly")

setwd("SCRI/MAPpoly/tetra")

address <- "https://github.com/mmollina/SCRI/raw/main/data/fitpoly_tetra_call/B2721_scores.zip"

download.file(url = address, destfile = "B2721_scores.zip")

unzip(tempfl, files = "B2721_scores.dat")

## Reading fitPoly data

dat <- read_fitpoly(file.in = "B2721_scores.dat",

ploidy = 4,

parent1 = "Atlantic",

parent2 = "B1829",

verbose = TRUE)

source("get_solcap_snp_pos.R")

plot(dat)

print(dat, detailed = TRUE)

Example: reading fitPoly files

Genotype calling using fitPoly:

https://github.com/mmollina/SCRI/blob/main/data/fitpoly_tetra_call/B2721_fitpoly_call.R

Adding Solanum tuberosum genome v4.03 information:

https://github.com/mmollina/SCRI/blob/main/MAPpoly/get_solcap_snp_pos.R

9 of 41

Inspecting data loaded from fitPoly files

10 of 41

Inspecting specific marker

plot_mrk_info(dat, 738)

print_mrk(dat, 738)

plot_mrk_info(dat, "solcap_snp_c1_3722")

print_mrk(dat, "solcap_snp_c1_3722")

11 of 41

Filtering by missing data

dat <- filter_missing(input.data = dat,

type = "marker",

filter.thres = 0.05)

dat <- filter_missing(input.data = dat,

type = "individual",

filter.thres = 0.025)

12 of 41

Filtering by segregation distortion

s <- filter_segregation(input.data = dat, chisq.pval.thres = 0.05/dat$n.mrk)

s <- make_seq_mappoly(s)

plot(s)

13 of 41

Making a sequence of markers

mappoly.sequence

mappoly.data

mappoly.group

mappoly.map

mappoly.chitest.seq

mappoly.pcmap

mappoly.geno.ord

14 of 41

Two-point analysis

## Two-point analysis

nc <- parallel::detectCores() - 1

tpt10 <- est_pairwise_rf(s10, ncpus = nc, est.type = "disc")

tpt10$pairwise[90:92]

round(tpt10$pairwise[[91]], 2)

plot(tpt10, first.mrk = 1, second.mrk = 2074)

Atlantic

B1829-5

1

2074

1

2074

15 of 41

Recombination fraction matrix

m <- rf_list_to_matrix(tpt, ncpus = nc)

gen.ord <- get_genomic_order(s)

s.gen.ord <- make_seq_mappoly(gen.ord)

plot(m, ord = s.gen.ord$seq.mrk.names, fact = 10)

16 of 41

Filtering by pairwise recombination fraction

gr <- group_mappoly(m, expected.groups = 12,

comp.mat = TRUE)

gr

17 of 41

Ordering within a specific linkage group

s1 <- make_seq_mappoly(gr, arg = 1,

genomic.info = 1)

tpt1 <- make_pairs_mappoly(tpt, s1)

m1 <- make_mat_mappoly(m, s1)

##Genomic order

g.o1 <- get_genomic_order(s1)

s1.g <- make_seq_mappoly(g.o1)

plot(g.o1)

plot(m1, ord = s1.g$seq.mrk.names,

fact = 3)

Genome order

18 of 41

Ordering within a specific linkage group

mds.o1 <- mds_mappoly(m1, n = c(201,47))

plot(mds.o1)

s1.mds <- make_seq_mappoly(mds.o1)

plot(m1, ord = s1.mds$seq.mrk.names)

MDS order

19 of 41

Phasing and HMM estimation of distance

lg1.geno.map<-est_rf_hmm_sequential(input.seq = s1.g,

start.set = 3,

thres.twopt = 10,

thres.hmm = 50,

extend.tail = 30,

twopt = tpt1,

verbose = TRUE,

tol = 10e-2,

tol.final = 10e-4,

phase.number.limit = 20,

sub.map.size.diff.limit = 5,

info.tail = TRUE)

20 of 41

Phasing and HMM estimation of distance

Argument	Value
start.set	3 – 10
thres.twopt	10
thres.hmm	10 – 50
extend.tail	30 – 100
phase.number.limit	20
sub.map.size.diff.limit	2 (thousands per LG) … 10 (hundred(s) per LG)

Argument	Value
start.set	3 – 5
thres.twopt	10
thres.hmm	10 – 50
extend.tail	50 – 200
phase.number.limit	20
sub.map.size.diff.limit	2 (thousands per LG) … 10 (hundred(s) per LG)

Tetraploid

Hexaploid

21 of 41

Resulting map

plot(lg1.geno.map, mrk.names = TRUE, cex = 0.5, P = "Atlantic", Q = "B1829")

22 of 41

Resulting map – error modeling

lg1.geno.map.err <- est_full_hmm_with_global_error(lg1.geno.map, error = 0.05, tol = 10e-4)

plot(lg1.geno.map.err, mrk.names = TRUE, cex = 0.5, P = "Atlantic", Q = "B1829")

23 of 41

Resulting map: 20-30 cM segment

plot(lg1.geno.map.err,

mrk.names = TRUE,

left.lim = 19,

right.lim = 20,

P = "Atlantic",

Q = "B1829")

24 of 41

Probabilistic haplotype reconstruction

g.lg10 <- calc_genoprob(lg10.geno.map, step = 1)

h.lg10 <- calc_homoprob(g.lg10)

plot(h.lg10, ind = "B2721.110")

25 of 41

Probabilistic haplotype reconstruction

g.lg10.err <- calc_genoprob_error(lg10.geno.map.err, step = 1, error = 0.05)

h.lg10.err <- calc_homoprob(g.lg10.err)

plot(h.lg10.err, ind = "B2721.110")

26 of 41

Preferential pairing profiles

p1 <- calc_prefpair_profiles(g1)

plot(p1, min.y.prof = 0.25, max.y.prof = 0.40)

27 of 41

Parallel map construction – genomic order

#### Parallel map construction

cl <- parallel::makeCluster(12)

parallel::clusterEvalQ(cl, require(mappoly))

parallel::clusterExport(cl, "dat")

# ~12.5 minutes

MAPs.geno <- parallel::parLapply(cl, LGS, phasing_and_hmm_rf)

# ~2.5 minutes

MAPs.geno.err <- parallel::parLapply(cl, MAPs.geno,

error_model)

# ~22 seconds

genoprob <- parallel::parLapply(cl,

MAPs.geno.err,

calc_genoprob_error,

step = 1,

error = 0.05)

parallel::stopCluster(cl)

#### Functions ####

phasing_and_hmm_rf <- function(X){

dir.create("map_output", showWarnings = FALSE)

fl <- paste0("output_map_ch_", X$seq$sequence[1], ".txt")

fl <- file.path("map_output", fl)

sink(fl)

map <- est_rf_hmm_sequential(input.seq = X$seq,

start.set = 3,

thres.twopt = 10,

thres.hmm = 50,

extend.tail = 30,

twopt = X$tpt,

verbose = TRUE,

phase.number.limit = 20,

sub.map.size.diff.limit = 5)

sink()

return(map)

}

error_model <- function(X, error = 0.05, tol = 10e-4){

x <- est_full_hmm_with_global_error(input.map = X,

error = error,

tol = tol,

verbose = FALSE)

return(x)

}

#### Map results

map.out <- plot_map_list(MAPs.geno.err, col = "ggstyle")

map.out

plot_genome_vs_map(MAPs.geno.err, same.ch.lg = TRUE)

summary_maps(MAPs.geno.err)

export_map_list(MAPs.geno.err, file = "output_map.csv")

#### Preferential pairing

pp <- calc_prefpair_profiles(genoprob)

print(pp)

head(pp$prefpair.psi)

plot(pp, P = "Atlantic", Q = "B1829")

#### Haplotype probabilities

hp <- calc_homoprob(genoprob)

print(hp)

plot(hp, ind = 2, lg = 1)

plot(hp, ind = 2, lg = 1:12, use.plotly = FALSE)

#### Correspondence with genome

z<-as.numeric(colnames(gr$seq.vs.grouped.snp)[1:12])

#### Assembling linkage groups (order based on genome)

LGS<-vector("list", 12)

for(ch in 1:12){

cat("\n ~~~~~~ ch:", ch, "...\n")

lg <- which(z==ch)

s.temp<-make_seq_mappoly(gr, lg, genomic.info = 1)

tpt.temp <- make_pairs_mappoly(tpt, s.temp)

s.temp.filt <- rf_snp_filter(tpt.temp, 5, 5, 0.15, c(0.05, 1))

m.temp <- make_mat_mappoly(m, s.temp.filt)

g.o <- get_genomic_order(s.temp)

s.g <- make_seq_mappoly(g.o)

tpt.temp <- make_pairs_mappoly(tpt, input.seq = s.g)

LGS[[ch]] <- list(seq = s.g, tpt = tpt.temp)

}

28 of 41

Results

29 of 41

Results – phased map

30 of 41

Results

31 of 41

On-line resources

MAPpoly web page
GitHub page for the topics I presented in this workshop
Tetraploid: B2721 potato biparental population (Atlantic x B1829-5)
Hexaploid: BT sweetpotato population (Beauregard x Tanzania)
Detailed analytic procedures for Pereira et al. (2020)

32 of 41

�APPENDIX��Hexaploid map – Results��

33 of 41

Hexaploid map – Results

Script for hexaploid

https://github.com/mmollina/SCRI/tree/main/MAPpoly/hexa

Complete map

Unraveling the Hexaploid Sweetpotato Inheritance Using Ultra-Dense Multilocus Mapping (Mollinari et al., 2020).

Shiny applications

34 of 41

Biparental Population - BT

Beauregard

Tanzania

X

Beauregard x Tanzania
315 individuals
GBS – GBSpoly protocol (Bode Olukolu – U Tennessee)
Two reference genomes I. trifida and I. triloba (Zhangjun Fei’s group – BTI Cornell). In the available dataset we used I. trifida.
In this example we used three chromosomes: ch3, ch9 and ch12

35 of 41

Biparental Population - BT

36 of 41

Results for hexaploid sweetpotato BT population

37 of 41

Recombination fraction matrix and grouping

38 of 41

Ordered recombination fraction matrix

39 of 41

Map results

40 of 41

Homologs probability

41 of 41

Preferential pairing profiles