JCBioinformatics

http://www.tinyurl.com/JCBioinformatics

Information provided by Dr. Umer Zeeshan Ijaz

For next set of tutorials, please check http://www.tinyurl.com/JCBioinformatics2 

Tutorials (Click to go to the relevant tutorial):

09/10/2015: Info: Predicting functional profiles from 16S rRNA metagenomic datasets

04/10/2015: Info: Obtaining KEGG modules for CONCOCT clusters

04/10/2015: Info: Extracting 16S rRNA genes from metagenomics datasets using REAGO

07/09/2015: Info: Paired-end assembler and primer mismatch counting

06,13/07/2015: Tutorial: Benchmarking WGS sequencing datasets

23/07/2015:Tutorial: Collating replicates from abundance tables using 1 way subject ANOVA - within subjects

25/06/2015:Tutorial: Correlation plot of significantly up/down regulated OTUs against the environmental data

18/06/2015:Tutorial: ANOVA with ggplot2 Part-2 (Diversity Data)[Modified:19/11/2015]

11/06/2015:Tutorial: ANOVA with ggplot2 Part-1 (Environmental Data) [Modified:19/11/2015]

04/06/2015:Tutorial: Phyloseq I - plot_richness, plot_tree

28/05/2015: Tutorial: R Regression Diagnostics (Pitlatrine Longitudinal Dataset)

14,21/05/2015: Tutorial: Finding OTUs that are significantly up or down regulated between different conditions

Based on DESeq {DESeq2} package that allows negative binomial GLM fitting and Wald statistics for abundance data

Based on Kruskal-Wallis Test with FDR

23/04/2015: Tutorial: Multivariate Statistical Analysis of Microbial Communities in an Environmental Context

09,16/04/2015: Tutorial: Illumina Amplicons OTU Construction with Noise Removal

02/04/2015: Tutorial: Linux Scripting

26/03/2015: Tutorial: Bayesian Concordance Analysis

19/03/2015: Tutorial: Comparing isolates of a genome based on their gene content

12/03/2015: Tutorial: Assembling and annotating single genomes

Linux command line exercises for NGS data processing

Perl one-liners

Extracting information from GBK files

Identifying duplicates in two FASTA files (awk)

Converting "Sample[TAB]Feature[TAB]Abundance" list to tab-delimited abundance table

Dereplicating Reads

Paired-end assembler

Generating abundance tables and trees from CREST and RDP classifiers

Spatial comparison of read qualities between different sequencing runs (Illumina paired-end reads)

Subsampling FASTA and FASTQ files

Identifying low-coverage genomes in metagenomic shotgun sequencing samples (mock communities)

Variant Calling

Getting linkage information (reads that map to multiple contigs/genomes) from SAM files

Does subsampling improve assembly quality for very high coverage datasets when using Velvet without removing noise from reads?

Extracting 16S rRNA sequences from NCBI's locally installed nt database using blastdbcmd

Resolving NCBI Taxonomy using BioSQL

Illumina Amplicons Processing Workflow - Deprecated

Whole-genome Shotgun Metagenomics Sequencing Data Analysis

Binning metagenomic contigs by coverage and composition using CONCOCT

Metaproteomics Data Analysis Workflow

perbase_quality_FASTQ.sh: Per-base quality score for FASTQ file

average_quality_hist_FASTQ.sh: Average quality distribution for FASTQ file

duplication_hist_FAST[Q/A].sh: Duplication distribution for FASTQ/FASTA file

length_distribution_FAST[Q/A].sh: Length distribution for FASTQ/FASTA file        

average_GC_hist_FAST[Q/A].sh: Average GC distribution for FASTQ/FASTA file

perbase_seqcontent_FAST[Q/A].sh: Per-base sequence content for FASTQ/FASTA file

perbase_GCcontent_FAST[Q/A].sh: Per-base GC content for FASTQ/FASTA file

perbase_ATcontent_FAST[Q/A].sh: Per-base AT content for FASTQ/FASTA file

perbase_Ncontent_FAST[Q/A].sh: Per-base N content for FASTQ/FASTA file

top_kmer_FAST[Q/A].sh: Top N Kmers for FASTQ/FASTA file

compseq.sh: calculate the composition of unique words in sequences

dan.sh: calculate nucleic acid melting temperature

density.sh: calculate nucleic acid density

cpgreport.sh: identify and report CpG-rich regions in nucleotide sequence

newcpgreport.sh: identify CpG islands in nucleotide sequence

fuzznuc.sh: search for patterns in nucleotide sequences

fuzztran.sh: search for patterns in protein sequences (translated)

freak.sh: generate residue/base frequency table

etandem.sh: find tandem repeats in a nucleotide sequence

tcode.sh: identify protein-coding regions using Fickett TESTCODE statistic

getorf.sh: find and extract open reading frames (ORFs)

Bash one-liners for extracting enzyme information from annotated GBK files

Script for extracting/drawing contigs from the GBK file. These contigs contain the enzymes we are interested in

Script for extracting all known enzyme sequences from Uniprot databases

Script for finding motifs within enzyme sequences

Scripts for searching CDS regions extracted from PROKKA against NCBI's CDD databases

Extracting subset of records from FASTA/FASTQ files based on exact/pattern matches of IDs (ONE-LINERS)

R code for ecological data analysis

Making command-line scripts in python

Biopython tutorial

SNP Calling Workflow

09/10/2015: Info: Predicting functional profiles from 16S rRNA metagenomic datasets

Acknowledgement: Caitlin Jukes for her patience while I figured this out

Required R Package: http://tax4fun.gobics.de/

Associated Paper: http://dx.doi.org/10.1093/bioinformatics/btv287

If you have used my OTUs construction tutorial (09,16/04/2015: Tutorial: Illumina Amplicons OTU Construction with Noise Removal) before, then here are the steps you should follow to generate functional profiles from your OTU data.

You require the following two files:

otus.fa

otu_table.csv

Generating a biom file (this can be done on quince-srv2):

STEP 0: Getting the right version of QIIME

[MScBioinf@quince-srv2 ~/uzi/UP]$ bash

[MScBioinf@quince-srv2 ~/uzi/UP]$ export PYENV_ROOT="/home/opt/.pyenv"

[MScBioinf@quince-srv2 ~/uzi/UP]$ export PATH="$PYENV_ROOT/bin:$PATH"

[MScBioinf@quince-srv2 ~/uzi/UP]$ eval "$(pyenv init -)"

[MScBioinf@quince-srv2 ~/uzi/UP]$ export PATH=/home/opt/MLTreeMap_package_2_061/install/mltreemap_2_061/sub_binaries:$PATH

STEP 1: Assign taxonomy from SILVA

We are going to generate rep_set_taxonomy folder in the current folder that will contain the taxonomic assignments

[MScBioinf@quince-srv2 ~/uzi/UP]$ assign_taxonomy.py -i otus.fa -b /home/opt/Tax4Fun_SilvaSSURef_115_NR/SSURef_NR99_115_tax_silva_split.fasta -t /home/opt/Tax4Fun_SilvaSSURef_115_NR/SSURef_NR99_115_tax_silva_split.taxonomy -m blast -o rep_set_taxonomy

STEP 2: Generating a mapping file for your OTUs

We can’t use our OTU table with QIIME as is, so we convert the table in a format that make_otu_table.py understands. Here is my workaround:

[MScBioinf@quince-srv2 ~/uzi/UP]$ awk -F"," 'NR==1{for(i=2;i<=NF;i++){conv[i]=$i}}NR>1{printf $1;for(i=2;i<=NF;i++){for(j=1;j<=$i;j++){printf "\t"conv[i];}}printf "\n";}' otu_table.csv | tr -d "\r" > map.txt

STEP 3: Generate the biom file

[MScBioinf@quince-srv2 ~/uzi/UP]$ make_otu_table.py -i map.txt -t rep_set_taxonomy/otus_tax_assignments.txt  -o otu_table.biom

[MScBioinf@quince-srv2 ~/uzi/UP]$ exit

Now download the otu_table.biom file to your local computer

Please note that you should have ~/.ncbirc file in your home directory with the following content if for some reason assign_taxonomy.py doesn’t work:

[BLAST]

BLASTDB=/home/opt/ncbi-blast-2.2.28+/db

RED-ALERT! MAKE SURE THAT BEFORE FOLLOWING THE ABOVE STEPS YOU REMOVE UNDERSCORES “_” FROM SAMPLE NAMES FROM otu_table.csv AND REPLACE THEM WITH HYPHENS “-” OR YOU’LL BE SORRY!

Now download the following two files from the following location, unzip them and place them in a packages folder in the current directory where your RStudio is pointing to, i.e., the path returned by the getwd() command:

http://tax4fun.gobics.de/Tax4Fun/Tax4Fun_0.3.1.tar.gz

http://tax4fun.gobics.de/Tax4Fun/ReferenceData/SILVA115.zip

To install the package, run the following code in R:

#We are going to put the Tax4Fun and the associated SILVA115 in the packages subfolder
#Use the following for one time installation in R studio
library(devtools)
install
('packages/Tax4Fun')

Now generate the abund_table for otu_table.biom file that you have downloaded to a data subfolder data. The abundance table will contain the KEGG KO numbers:

library(Tax4Fun)
input_biom<-importQIIMEBiomData
("data/otu_table.biom")
Tax4FunData<-Tax4Fun
(input_biom, "packages/SILVA115/", fctProfiling = T,
       refProfile =
"UProC", shortReadMode = T, normCopyNo = T)
abund_table<-Tax4FunData$Tax4FunProfile
#We need to convert sample names such that "-" are replaced by "_" as the meta_table
#associated with our OTU table still has samples with "_" in them.
rownames(abund_table)<-gsub("-","_",rownames(abund_table))

Now load your meta_table.csv that should contain a column called Grouping containing meaningful categorical data to reflect the hypothesis under which you generated your data. In the statistical analysis that follows, I have two conditions “Category 1” and “Category 2”.

#Load meta_data
meta_table<-
read.csv("data/meta_data.csv",row.names=1,check.names=FALSE)
#A check to ensure both tables have the same samples/ordering
meta_table<-meta_table
[rownames(abund_table),]

Please make sure you have read the dissertation https://ediss.uni-goettingen.de/handle/11858/00-1735-0000-0022-5FBD-0?locale-attribute=en  and have familiarized yourself with the key concepts such as FTU which according to the author is:

“For Tax4Fun, the corresponding measure is termed fraction of taxonomic units unexplained (FTU) which reflects the amount of sequences assigned to a taxonomic unit and not transferable to KEGG reference organisms.”

Now start your analysis which uses Kruskal-Wallis test to plot the pathways that are significantly up/down regulated between your conditions and then uses those pathways in a random forest classifier to give an ordering of most important pathways that can separate your conditions (Mean Decrease Gini and Mean Decrease Accuracy plots with importance going down from left to right) :

library(ggplot2)
library(gridExtra)
library(randomForest)
library(reshape2)

#Parameters section start RED-ALERT! THIS IS THE ONLY BIT OF CODE YOU NEED TO CHANGE! DON'T YOU DARE TOUCH ANYTHING ELSE!
 
#Problem with adjusted p-values for multiple comparisons is that the adjusted p-values are
#also a function of the total number of comparisons and you may not find anything significant
#in terms of adjusted p-values if you have thousands of features among which those that
#stick out vary only slightly. In such cases limit the data to most abundant features
#or increase the kruskal.wallis.adjp.value.threshold
#For example, here I am extracting 100 most abundant pathways
abund_table<-abund_table
[,order(colSums(abund_table),decreasing=TRUE)][,1:100]
data<-
as.data.frame(abund_table)
groups<-meta_table$Grouping
kruskal.wallis.adjp.value.threshold=
0.05

#/Parameters section end

#We are going to use the Kruskal-Wallis test which is a rank-based nonparametric test that
#can be used to determine if there are statistically significant differences between two or
#more groups of an independent variable (data[,i]) on a continuous or categorical dependent variable (groups)
kruskal.wallis.table <-
 data.frame()
for (i in 1:dim(data)[2]) {
 ks.test <- kruskal.test(data[,i], g=groups)
 
# Store the result in the data frame
 kruskal.wallis.table <-
 rbind(kruskal.wallis.table,
                             
 data.frame(id=names(data)[i],
                                          p.value=
ks.test$p.value
                               
))
 
# Report number of values tested
 cat(paste("Kruskal-Wallis test for ",names(data)[i]," ", i, "/", 
         
 dim(data)[2], "; p-value=", ks.test$p.value,"\n", sep=""))
}
#Adjust p.values for multiple comparisons by applying Benjamini & Hochberg (1995)
#Other choices include "holm", "hochberg", "hommel", "bonferroni", "BH", "BY","fdr"

kruskal.wallis.table$adjp.value <-
p.adjust(kruskal.wallis.table$p.value,"BH")

#Now rearrange the table in terms of increasing adjp.value
kruskal.wallis.table<-kruskal.wallis.table
[order(kruskal.wallis.table$adjp.value),]

rm("last.significant.element")
last.significant.element <-
 max(which(kruskal.wallis.table$adjp.value <= kruskal.wallis.adjp.value.threshold))

if(!(exists("last.significant.element") && is.infinite(last.significant.element))){
 
#If you don't find anything significant, rest of the analysis won't be done
 selected <-
1:last.significant.element
 diff.cat.factor <- kruskal.wallis.table$id
[selected]

 
#Extract list of features that qualify the kruskal.wallis.adjp.value.threshold
 diff.cat <-
 as.vector(diff.cat.factor)

 
#Print first 20 records of kruskal.wallis.table
 print(kruskal.wallis.table[1:20,])

 
#Now we plot features that are significantly different between the categories
 df_kruskal_wallis<-
NULL
 
for(i in diff.cat){
   tmp<-
data.frame(data[,i],groups,
                 
 rep(
                   
 paste(i,
                           
"\np = ",
                         
 sprintf("%.5g",kruskal.wallis.table[kruskal.wallis.table$id==i,"p.value"]),
                           
" ",
                         
 cut(
                             kruskal.wallis.table
[kruskal.wallis.table$id==i,"p.value"], 
                             breaks=
c(-Inf, 0.001, 0.01, 0.05, Inf), 
                             label=
c("***", "**", "*", "")
                             
),
                           
" ",
                           
" padj = ",
                         
 sprintf("%.5g",kruskal.wallis.table[kruskal.wallis.table$id==i,"adjp.value"]),
                           
" ",
                         
 cut(
                             kruskal.wallis.table
[kruskal.wallis.table$id==i,"adjp.value"], 
                             breaks=
c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", "")),
                           sep=
""),
                   
 dim(data)[1]))
   
if(is.null(df_kruskal_wallis)){df_kruskal_wallis<-tmp} else { df_kruskal_wallis<-rbind(df_kruskal_wallis,tmp)} 
 
}
 colnames(df_kruskal_wallis)<-c("Value","Type","Features")

 p<-
ggplot(df_kruskal_wallis,aes(Type,Value,colour=Type))
 p<-p+ylab
("")
 p<-p+geom_boxplot
()
 p<-p+geom_jitter
(position = position_jitter(height = 0, width=0))
 p<-p+geom_point
(size=5,alpha=0.2)
 p<-p+theme_bw
()
 p<-p+facet_wrap
( ~ Features , scales="free_x",nrow=1)
 p<-p+theme
(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
 p<-p+theme
(strip.text.x = element_text(size = 16, colour = "black", angle = 90))
 p<-p+theme
(strip.background = element_rect(fill="white"))+theme(panel.margin = unit(0, "lines"))
 p<-p+theme
(axis.title.x=element_blank())
 pdf("Tax4Fun_KW.pdf",width=13,height=20)
 print(p)
 dev.off()
Tax4Fun_KW.jpg 


 df_FTU<-
data.frame(FTU=Tax4FunData$FTU[gsub("_","-",rownames(meta_table))],Type=meta_table$Grouping)

 q<-
ggplot(df_FTU,aes(Type,FTU,colour=Type))+ylab("FTU")
 q<-
q+geom_boxplot()+geom_jitter(position = position_jitter(height = 0, width=0))
 q<-
q+geom_point(size=5,alpha=0.2)
 q<-
q+theme_bw()
 q<-
q+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))
 q<-
q+ggtitle(paste("p =",sprintf("%.5g",kruskal.test(df_FTU$FTU,df_FTU$Type)$p.value),cut(kruskal.test(df_FTU$FTU,df_FTU$Type)$p.value, breaks=c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", ""))))
 pdf("Tax4Fun_FTU.pdf",width=4,height=4)
 print(q)
 dev.off()
Tax4Fun_FTU.jpg
 
#Now we will use Breiman's random forest algorithm that can be used in unsupervised mode for assessing proximities
 
#among data points
 subset.data<-
data[,diff.cat.factor]
 IDs_map<-
data.frame(row.names=gsub(";.*$","",colnames(subset.data)),To=colnames(subset.data))
 colnames(subset.data)<-rownames(IDs_map)
 df.groups<-
data.frame(row.names=rownames(subset.data),groups)
 val<-
randomForest( df.groups$groups ~ ., data=subset.data, importance=T, proximity=T,ntree=1500,keep.forest=F)

 
#We then extract the importance measures as produced by randomForest()
 
#From the manual of importance():
 
#"The first measure is computed from permuting OOB data:
 
#For each tree, the prediction error on the out-of-bag portion of the data is recorded
 
#(error rate for classification, MSE for regression). Then the same is done after
 
#permuting each predictor variable. The difference between the two are then averaged
 
#over all trees, and normalized by the standard deviation of the differences. If the
 
#standard deviation of the differences is equal to 0 for a variable, the division is
 
#not done (but the average is almost always equal to 0 in that case).
 
#The second measure is the total decrease in node impurities from splitting on the
 
#variable, averaged over all trees. For classification, the node impurity is measured
 
#by the Gini index. For regression, it is measured by residual sum of squares."
 imp<-importance
(val)

 df_accuracy<-
data.frame(row.names=NULL,Sample=rownames(imp),Value=abs(as.numeric(imp[,"MeanDecreaseAccuracy"])),Index=rep("Mean Decrease Accuracy",dim(imp)[1]))
 df_gini<-
data.frame(row.names=NULL,Sample=rownames(imp),Value=as.numeric(imp[,"MeanDecreaseGini"]),Index=rep("Mean Decrease Gini",dim(imp)[1]))

 
#Rearrange the features in terms of importance for ggplot2 by changing factor levels
 df_accuracy$Sample<-IDs_map
[as.character(df_accuracy$Sample),"To"]
 df_accuracy_order<-
as.character(IDs_map[rownames(imp),"To"][order(abs(as.numeric(imp[,"MeanDecreaseAccuracy"])),decreasing=T)])
 df_accuracy$Sample<-
factor(as.character(df_accuracy$Sample),levels=df_accuracy_order)

 df_gini$Sample<-IDs_map
[as.character(df_gini$Sample),"To"]
 df_gini_order<-
as.character(IDs_map[rownames(imp),"To"][order(abs(as.numeric(imp[,"MeanDecreaseGini"])),decreasing=T)])
 df_gini$Sample<-
factor(as.character(df_gini$Sample),levels=df_gini_order)


 r<-
ggplot(data=df_accuracy,aes(Sample,Value))
 r<-r+geom_bar
(fill=I("red"),stat="identity")+theme_bw()
 r<-r+theme
(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Mean Decrease Accuracy")
 r<-r+theme
(axis.title.x=element_blank())


 pdf("Tax4Fun_MDA.pdf",width=5,height=9)
 print(r)
 dev.off()
Tax4Fun_MDA.jpg
 s<-
ggplot(data=df_gini,aes(Sample,Value))
 s<-
s+geom_bar(fill=I("red"),stat="identity")+theme_bw()
 s<-
s+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+ylab("Mean Decrease Gini")
 s<-
s+theme(axis.title.x=element_blank())

 pdf("Tax4Fun_MDG.pdf",width=5,height=9)
 print(s)
 dev.off()
Tax4Fun_MDG.jpg
 df_confusion_matrix<-melt
(as.matrix(val$confusion[,-3]))
 t<-
ggplot(df_confusion_matrix, aes(X1, X2, group=X2))
 t<-
t+geom_tile(aes(fill = value))
 t<-
t+geom_text(aes(fill = value, label = value)) + theme_bw()+scale_fill_gradient(low = "white", high = "red") 
 t<-
t+theme(axis.title.x=element_blank(),axis.title.y=element_blank())
 t<-
t+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))

 pdf("Tax4Fun_CM.pdf",width=5,height=4)
 print(t)
 dev.off()
}

Tax4Fun_CM.jpg

04/10/2015: Info: Obtaining KEGG modules for CONCOCT clusters

Required Script: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/KO2MODULEclusters2.py

I have implemented a workflow for obtaining KEGG modules for CONCOCT clusters based on http://www.genome.jp/kegg/tool/map_module.html which allows one to upload a list of KEGG's K numbers for a given cluster in a tab-delimited format to their website to obtain KEGG modules and the website tells you how many 1 or 2 blocks are missing from KEGG modules. I believe this is a competitive strategy to other alternatives, i.e, HUMAnN ( http://huttenhower.sph.harvard.edu/humann ) or MinPath ( http://omics.informatics.indiana.edu/MinPath/ ) which I have used in the past ( http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics.html). However, using map_module.html is challenging as there is no REST service (to the best of my knowledge) to automatically enable reconstruction of modules from K numbers.  My proposed solution is to construct HTTP POST queries and to upload K numbers automatically using python's urllib2 package and then use BeautifulSoup package to parse the resulting HTML page to extract the information on modules and how many blocks are missing for each module (completness). Thus it is fairly simple (albeit tedious) to obtain the KEGG modules for clusters along with missing blocks information. Getting the total blocks for a given module is not an easy task as there is no REST service that gives such information. The reason is the logical expressions used in the definition of modules. The M number entry is defined by a logical expression of K numbers (and other M numbers), allowing automatic evaluation of whether the gene set is complete, i.e., the module is present, in a given genome. A space or a plus sign represents an AND operation, and a comma sign represents an OR operation in this expression. A plus sign is used for a molecular complex and a minus sign designates an optional item in the complex. Say, M00001 for instance ( http://togows.dbcls.jp/entry/orthology/M00001/definition ):

 

(K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)

If you go to http://www.genome.jp/kegg-bin/show_module?map=M00001 , in the lower left corner, you would notice that it has 9 blocks in total. So, I came up with a regular expression to calculate the total number of blocks for a given module definition in the python code (KO2MODULEclusters2.py):

>>> import re

>>> a="(K00844,K12407,K00845,K00886,K08074,K00918) (K01810,K06859,K13810,K15916) (K00850,K16370,K00918) (K01623,K01624,K11645,K16305,K16306) K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)"

>>> len(re.sub(r"\([\w\+\-\,\s]+?\)","AND",re.sub(r" \(.*?\)"," AND",re.sub(r"\(.*?\) ","AND ",re.sub(r" \(.*?\) "," AND ",re.sub(r"\([\w\+\-\,\s]+?\)","AND",re.sub(r" \([\w\+\-\,]+?\)"," AND",re.sub(r"\([\w\+\-\,]+?\) ","AND ",re.sub(r" \([\w\-\+\,]+?\) "," AND ",a)))))))).split(" "))

9

>>>

Similarly for M00002, there are 5 blocks ( http://www.genome.jp/kegg-bin/show_module?map=M00002  )

>>> a="K01803 ((K00134,K00150) K00927,K11389) (K01834,K15633,K15634,K15635) K01689 (K00873,K12406)"

>>> len(re.sub(r"\([\w\+\-\,\s]+?\)","AND",re.sub(r" \(.*?\)"," AND",re.sub(r"\(.*?\) ","AND ",re.sub(r" \(.*?\) "," AND ",re.sub(r"\([\w\+\-\,\s]+?\)","AND",re.sub(r" \([\w\+\-\,]+?\)"," AND",re.sub(r"\([\w\+\-\,]+?\) ","AND ",re.sub(r" \([\w\-\+\,]+?\) "," AND ",a)))))))).split(" "))

5

>>>

Thus using KO2MODULEclusters2.py, we obtain a CSV file with dimensions MODULES x CLUSTERS where the values are between 0 to 1, i.e., how complete the module is for a given cluster.

The contigs based metagenomic workflow is then:

-Assemble metagenomic reads into contigs

-Bin contigs into clusters using CONCOCT (http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/CONCOCT.html)

-Annotate clusters using PROKKA to identify CDS regions

-Search CDS regions against NCBI's CDD using RPSBLAST to obtain COG information

-Concatenate multiple COG sequences from both clusters and references, align using Muscle/MAFFT, generate the tree in Newick format using FastTree, and then use the R script given at 26/03/2015: Tutorial: Bayesian Concordance Analysis to visualise newly discovered genomes as well as assign them taxonomy based on LCA.

-Use rapsearch to match CDS regions against KEGG database to obtain K numbers

-Construct an abundance table of K numbers for metagenomic samples

-Run KO2MODULEclusters2.py to obtain M numbers which you can use in your statistical analysis or upload to iPATH to obtain metabolic networks (as discussed in http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Metaproteomics.html )

Note: It will require very little effort to export this functionality to RNASeq and Proteomics datasets.

Input test.csv

[uzi@quince-srv2 ~/test]$ cat test.csv

Clusters,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10

ko:K00001,1,0,0,2,2,0,0,0,1,0,0

ko:K00002,0,0,0,0,0,0,0,0,0,0,0

ko:K00003,1,0,1,0,0,0,2,0,0,0,0

ko:K00004,0,0,0,0,0,0,0,0,0,0,0

ko:K00005,0,0,0,0,0,0,1,0,0,0,0

ko:K00006,0,0,0,0,0,0,0,0,0,0,0

ko:K00007,0,0,0,0,0,0,0,0,0,0,0

ko:K00008,0,2,0,3,0,3,2,0,0,0,6

ko:K00009,0,0,0,0,0,0,0,0,0,0,0

ko:K00010,0,19,0,8,0,22,50,2,0,2,20

ko:K00011,0,0,0,0,0,0,0,0,0,0,0

ko:K00012,1,1,0,2,2,0,1,2,1,0,2

ko:K00013,0,1,1,1,0,0,2,1,1,0,2

ko:K00014,0,1,1,1,1,0,1,1,0,0,2

ko:K00015,0,0,0,0,0,0,0,0,0,1,0

ko:K00016,0,0,1,1,0,0,1,0,0,0,0

ko:K00018,0,0,0,0,0,0,0,0,0,0,1

ko:K00019,0,0,0,0,3,0,0,0,1,0,0

ko:K00020,0,0,0,0,1,0,0,0,0,0,0

ko:K00021,0,0,1,0,0,0,0,0,0,0,0

ko:K00023,0,0,0,0,0,0,0,0,1,0,0

ko:K00024,0,0,1,0,0,1,1,0,1,0,1

ko:K00026,0,0,0,0,0,0,0,0,0,0,0

ko:K00027,1,0,1,1,0,0,2,1,0,1,1

ko:K00029,0,0,0,0,1,0,0,0,0,0,2

ko:K00030,1,2,1,2,0,0,1,0,0,1,0

ko:K00031,0,0,0,0,0,0,2,0,0,0,4

ko:K00032,0,0,0,0,0,0,0,0,0,0,0

ko:K00033,2,1,0,1,0,0,1,0,0,0,2

ko:K00034,0,0,0,0,0,0,0,0,0,0,0

ko:K00035,0,0,0,0,0,1,0,0,0,0,0

ko:K00036,0,1,0,0,0,0,0,0,0,0,1

ko:K00038,1,0,0,0,1,0,0,0,0,0,0

ko:K00039,0,0,0,0,0,0,0,0,0,0,0

ko:K00040,0,0,0,0,0,0,1,0,0,0,0

ko:K00041,0,0,0,0,0,0,0,0,0,0,3

ko:K00042,0,0,0,1,1,0,1,0,0,0,0

ko:K00045,0,0,0,0,0,0,0,0,0,0,0

ko:K00046,0,3,0,1,1,0,1,0,2,0,2

ko:K00048,0,0,0,0,0,0,0,0,0,0,0

ko:K00049,0,0,0,0,0,0,0,0,0,0,0

ko:K00050,1,0,0,0,2,0,0,1,1,0,0

ko:K00052,0,1,1,1,0,1,2,0,0,0,3

ko:K00053,0,1,1,0,1,0,1,0,1,0,1

ko:K00054,1,0,0,0,0,0,0,0,0,0,2

ko:K00055,0,0,0,0,0,0,0,0,0,0,0

ko:K00057,0,1,0,1,3,0,1,1,1,3,2

ko:K00058,3,3,3,3,2,4,4,1,1,1,6

ko:K00059,4,6,1,7,18,11,9,1,4,1,7

ko:K00060,1,3,0,1,0,5,2,0,0,1,1

ko:K00062,0,0,0,0,0,0,0,0,0,0,0

ko:K00063,0,0,0,0,0,0,0,0,0,0,0

ko:K00064,0,2,0,0,0,3,1,0,0,0,0

ko:K00065,2,0,0,0,0,0,0,0,0,0,2

ko:K00066,0,0,0,0,0,0,0,0,0,0,0

ko:K00067,2,2,1,2,1,1,1,4,0,1,1

ko:K00068,0,0,0,0,0,0,1,0,0,0,3

ko:K00070,0,0,0,0,0,0,0,0,0,0,0

ko:K00071,0,0,0,0,0,0,0,0,0,0,0

ko:K00072,0,0,0,0,0,0,0,0,0,0,0

ko:K00073,0,1,0,0,1,0,0,0,0,0,0

ko:K00074,0,0,0,0,2,0,0,0,0,1,1

ko:K00075,1,0,0,0,1,0,1,1,1,1,2

ko:K00076,0,0,0,0,1,1,0,0,0,0,0

ko:K00077,1,0,0,0,1,0,1,1,0,0,3

ko:K00078,1,2,0,0,0,0,1,0,0,0,1

ko:K00079,0,0,0,0,0,0,0,0,0,0,0

ko:K00082,0,0,0,0,0,0,0,0,0,0,0

ko:K00086,0,0,0,0,0,0,0,0,0,0,0

ko:K00087,6,0,0,1,1,1,0,0,0,0,0

ko:K00088,2,1,5,2,1,1,2,2,2,1,3

ko:K00090,1,0,0,0,0,0,0,0,0,0,0

ko:K00091,1,0,0,1,1,0,0,0,0,1,3

ko:K00094,0,0,0,0,0,0,0,0,0,0,0

ko:K00096,0,1,1,0,0,0,0,0,0,1,0

ko:K00097,1,1,0,1,1,0,1,0,1,1,1

ko:K00098,0,0,0,0,0,0,0,0,0,0,0

ko:K00099,0,1,0,2,2,0,1,1,1,1,0

ko:K00100,2,3,0,1,2,0,0,0,1,0,5

ko:K00101,0,0,0,0,2,0,0,0,0,0,0

Run the program as:

[uzi@quince-srv2 ~/Chris]$ python KO2MODULEclusters2.py -i test.csv -o testMODULES.csv

[2015-10-04 15:01:03] Loading test.csv

[2015-10-04 15:01:03] Uploading KOs from C0 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:04] Extracted Modules for C0

[2015-10-04 15:01:04] Uploading KOs from C1 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:06] Extracted Modules for C1

[2015-10-04 15:01:06] Uploading KOs from C2 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:07] Extracted Modules for C2

[2015-10-04 15:01:07] Uploading KOs from C3 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:09] Extracted Modules for C3

[2015-10-04 15:01:09] Uploading KOs from C4 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:10] Extracted Modules for C4

[2015-10-04 15:01:10] Uploading KOs from C5 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:12] Extracted Modules for C5

[2015-10-04 15:01:12] Uploading KOs from C6 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:13] Extracted Modules for C6

[2015-10-04 15:01:13] Uploading KOs from C7 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:15] Extracted Modules for C7

[2015-10-04 15:01:15] Uploading KOs from C8 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:17] Extracted Modules for C8

[2015-10-04 15:01:17] Uploading KOs from C9 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:18] Extracted Modules for C9

[2015-10-04 15:01:18] Uploading KOs from C10 to http://www.genome.jp/kegg-bin/find_module_object

[2015-10-04 15:01:20] Extracted Modules for C10

[2015-10-04 15:01:20] Resolving definitions for returned modules

[2015-10-04 15:01:47] Saving testMODULES.csv

The following output is generated:

[uzi@quince-srv2 ~/test]$ cat testMODULES.csv

Clusters,C0,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10

M00432,0.0,0.333333333333,0.333333333333,0.333333333333,0.0,0.333333333333,0.333333333333,0.0,0.0,0.0,0.333333333333

M00169,0.0,0.0,0.0,0.0,0.5,0.0,0.0,0.0,0.0,0.0,0.5

M00006,0.5,0.5,0.0,0.5,0.0,0.0,0.5,0.0,0.0,0.0,0.5

M00083,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0

M00020,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.333333333333

M00010,0.333333333333,0.333333333333,0.333333333333,0.333333333333,0.0,0.0,0.333333333333,0.0,0.0,0.333333333333,0.333333333333

M00535,0.0,0.333333333333,0.333333333333,0.333333333333,0.0,0.333333333333,0.333333333333,0.0,0.0,0.0,0.333333333333

M00168,0.0,0.0,0.5,0.0,0.0,0.5,0.5,0.0,0.5,0.0,0.5

04/10/2015: Info: Extracting 16S rRNA genes from metagenomics datasets using REAGO

Software: https://github.com/chengyuan/reago-1.1

Step 1: Create a test folder and change to python 2.7.8:

[uzi@quince-srv2 ~/test]$ bash

[uzi@quince-srv2 ~/test]$ export PYENV_ROOT="/home/opt/.pyenv"

[uzi@quince-srv2 ~/test]$ export PATH="$PYENV_ROOT/bin:$PATH"

[uzi@quince-srv2 ~/test]$ eval "$(pyenv init -)"

[uzi@quince-srv2 ~/test]$ export PATH=/home/opt/gt-1.5.2-Linux_x86_64-32bit/bin:$PATH

Step 2: Run reago (you must ensure the sequence names of a read pair to be XXXX.1 & XXXX.2):

[uzi@quince-srv2 ~/test]$ python /home/opt/reago-1.1/filter_input.py /home/opt/reago-1.1/sample_1.fasta /home/opt/reago-1.1/sample_2.fasta filter_out /home/opt/reago-1.1/cm ba 10

Indentifying 16S reads

[uzi@quince-srv2 ~/test]$ python /home/opt/reago-1.1/reago.py filter_out/filtered.fasta testing -l 101

Sun Oct  4 13:25:37 2015 REAGO (v1.10) started...

Input file: filter_out/filtered.fasta

Parameters:

-e 0.05

-f 1350

-b 10

-l 101.0

-o 0.7

-t 30

Sun Oct  4 13:25:37 2015 Reading input file...

Sun Oct  4 13:25:37 2015 Initializing overlap graph...

Sun Oct  4 13:25:37 2015 Recovering 16S rRNAs...

Sun Oct  4 13:25:40 2015 Scaffolding on short 16S rRNA segments...

Sun Oct  4 13:25:40 2015 Write to Files...

Sun Oct  4 13:25:40 2015 Done.

- Number of 16S rRNAs: 2

- Full genes: testing/full_genes.fasta

- Gene fragments: testing/fragments.fasta

Step 3: Look at your exported 16S genes:

[uzi@quince-srv2 ~/test]$ head testing/full_genes.fasta

>gene_1_len=1641

AAAGTCAATTTCTTTGGGTCTAACGACTCAAAGTATTTTTTAGCCGGATCAAACAGATTAAACTCTACAACGGAGAGTTTGATCCTGGCTCAGGACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCAAAGGAAAGCAGCTTCGGCTGGGAGTACTTGGCGCAAGGGTGAGTAACGTATAGGTAATCTGCCCTTTGGACTGGAATAACCCCGAGAAATCGGGGACAATACCAGATGAAGCAGCGACAATCGCATGGTTGTTCTGCCAAAGATTTATCGCCAAAGGATGAACCTATATCCCATCAGGTAGTTGGTAAGGTAACGGCTTACCAAGCCTACGACGGGTAGCTGGTCTGAGAGGATGATCAGCCACATTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGAGGAATATTGCGCAATGGGCGAAAGCCTGACGCAGCAACGCCGCGTGGATGATGAAGTTCTTCGGAATGTAAAGTCCTTTTGTAGAGGAAGAATATCCCGGTTTACCGGGACTGACGGTACTCTGCGAATAAGCCACGGCTAACTCTGTGCCAGCAGCCGCGGTGATACAGGGGTGGCAAGCGTTGTCCGGATTTACTGGGTGTAAAGGGTGCGCAGGCGGGCCGATAAGTCGGGGGTTAAATCCATGTGCTTAACACATGCATGGCTTCCGATACTGTCGGCCTAGAGTCTCGAAGAGGAAGATGGAATTTCCGGTGTAACGGTGGAATGTGTAGATATCGGAAAGAACACCAGTGGCGAAGGCAGTCTTCTGGTCGAGCACTGACGCTCAGGCACGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGAATACTAGATGTTGGTCATATTGATCAGTGTCGCAGCTAACGCGTTAAGTATTCCACCTGGGAAGTACGCCCGCAAGGGTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGATCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGGCTTGATATGGCGACTAAACTCATTGAAAGATGAGGTGCTTCGGCGAGTCGTCACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCCTACAATTAGTTACTAACAGGTTAAGCTGAGGACTCTAATTGAACTGCCTACGCAAGTAGTGAGGAAGGAGGGGATGACGTCAAGTCCTCATGGCCCTTACGCCCAGGGCCACACACGTGATACAATGGTAGCTACAGAGGGCAAAGCCGCGAGGCAGAGTTAATCCCTTAAAAGCTATCTCAGTCCGGATCGGAGTCTGCAACTCGACTCCGTGAAGTTGGAATCGCTAGTAATCGCAGATCAGCATGCTGCGGTGAATGTGTTCCCGGGCCTTGTACACACCGCCCGTCAAGTCATGGAAGTCAGGAGTACCCAAAGACACTCGCGTGTTTAAGGTAAGACTGGTAACTGGGACTAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGCGGCTGGATCACCTCCTTTCAATGGAGATTGGCTGACAGCAATGTCGGTGCAAACTCAAAAAGTACCGATCCGACTAAGAAATATGACTTTG

>gene_2_len=1636

CTTTCGAGCGCTGTGAGGCTGGTTCCTCTGTTGACCTCCGTCAACAGATGGTAACCCTTCAGGTTTCAAACGAGAGTTTGATCCTGGCTCAGAATCAACGCTGGCGGCGTGCCTAACACATGCAAGTCGAACAAGAAAGGGACTTCGGTCCTGAGTACAGTGGCGCACGGGTGAGTAACACGTGACTAACCTACCCTCGAGTGGGGAATAACTTCGGGAAACCGAGGCTAATACCGCATAATACCCACGGGTCAAAGGAGCAATTCGCTTGAGGAGGGGGTCGCGGCCGATTAGCTAGTTGGCGGGGTAATGGCCCACCAAGGCAGTGATCGGTATCCGGCCTGAGAGGGCGCACGGACACACTGGAACTGAAACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATTTTGCGCAATGGGGGAAACCCTGACGCAGCAACGCCGCGTGGAGGATGAAGTCTCTTGGGACGTAAACTCCTTTCGATCGGAACGATTATGACGGTACCGGAAGAAGAAGCCCCGGCTAACTTCGTGCCAGCAGCCGCGGTAATACGAGGGGGGCGAGCGTTGTTCGGAATTATTGGGCGTAAAGGGTGCGTAGGCGGTTCGGTAAGTTTGATGTGAAATCTTCGGGCTCAACTCGAAGTCTGCATCGAAAACTGCCGGGCTTGAGTGTGGGAGAGGTGAGTGGAATTTCCGGTGTAGCGGTGAAATGCGTAGATATCGGAAGGAACACCTGTGGCGAAAGCGGCTCACTGGACCACAACTGACGCTGATGCACGAAAGCTAGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCTAGCCCTAAACGATGATCGCTTGGTGTGGCGGGTACCCAATCCCGTCGTGCCGTAGCTAACGCGTTAAGCGATCCGCCTGGGGAGTACGGTCGCAAGGCTGAAACTCAAAGGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGACGCAACGCGAAGAACCTTACCTGGGCTCGAAATGTAGTGGACCGGGGTAGAAATATCCCTTCCCCGCAAGGGGCTGCTATATAGGTGCTGCATGGCTGTCGTCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATTGCCAGTTGCTACCATTTAGTTGAGCACTCTGGTGAGACCGCCTCGGATAACGGGGAGGAAGGTGGGGATGACGTCAAGTCCTCATGGCCTTTATGTCCAGGGCTACACACGTGCTACAATGGCCGGTACAAACCGCCGCAAACCCGCGAGGGGGAGCTAATCGGAAAAAGCCGGCCTCAGTTCGGATTGTAGTCTGCAACTCGACTACATGAAGCTGGAATCGCTAGTAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACATCACGAAAGTGGGTCGTACTAGAAGCGGGTGAGCCAACCGTAAGGAGGCAGCCTTCCAAGGTGTGATTCATGATTGGGGTGAAGTCGTAACAAGGTAGCCGTAGGAGAACCTGCGGCTGGATCACCTCCTTTCTAAGAGAGAACGTCACGGCACTCTTATACTTCCGCGATAGCGGACTGCGAAGGCAGCGTATAAGGGATGT

Go back to python 2.6.6

[uzi@quince-srv2 ~/test]$ exit

07/09/2015: Info: Paired-end assembler and primer mismatch counting

To assemble paired-end reads, copy the following code, save it in align.sh, and use chmod +x align.sh to make it executable

#!/bin/bash

#Paired-end assembler by Umer Zeeshan Ijaz http://userweb.eng.gla.ac.uk/umer.ijaz

#Dependencies: bioawk & water from EMBOSS utilities

#Usage: ./align.sh forward.fasta reverse.fasta FLAG_FORWARD FLAG_REVERSE GAPOPEN GAPEXTEND VERBOSE

#           ./align.sh forward.fasta reverse.fasta 4 1 5 0.5 1

#           ./align.sh reference.fasta <(for i in $(seq 1 128); do echo -e ">454_27YMF\nAGAGTTTGATYMTGGCTCAG"; done) 4 4 10 0.5 3

#Flags description:

#             FLAG_FORWARD/FLAG_REVERSE == 1 reverse complement the reads in forward.fasta/reverse.fasta

#             FLAG_FORWARD/FLAG_REVERSE == 2 reverse the reads in forward.fasta/reverse.fasta

#             FLAG_FORWARD/FLAG_REVERSE == 3 complement the reads in forward.fasta/reverse.fasta

#             FLAG_FORWARD/FLAG_REVERSE == 4/OTHERS use the reads as they are

#             VERBOSE==1 print alignment 2 print match difference 3 print length of primer, length of match, mismatch count, indel count 4 print matches

paste <(bioawk -cfastx '{print ">"$1" "$4"\n"$2}' $1) <(bioawk -cfastx '{print ">"$1" "$4"\n"$2}' $2) | \

        perl -nse 'chomp($_);push @a, $_; @a = @a[@a-2..$#a];

        if ($. % 2 == 0){

                @q=split("\t",$a[1]);

                if ($fl == 1) {

                        $q[0]=join("",map{$_ =~ tr/ACGTYRKMBVDH/TGCARYMKVBHD/; $_} reverse split("",$q[0]))}

                elsif ($fl == 2) {

                        $q[0]=reverse split("",$q[0])}

                elsif ($fl == 3) {

                        $q[0]=join("",map{$_ =~ tr/ACGTYRKMBVDH/TGCARYMKVBHD/; $_} split("",$q[0]))};

                        if ($rl == 1) {                               

                        $q[1]=join("",map{$_ =~ tr/ACGTYRKMBVDH/TGCARYMKVBHD/; $_} reverse split("",$q[1]))}                       

                elsif ($rl == 2) {                               

                        $q[1]=reverse split("",$q[1])}

                        elsif ($rl == 3) {

                                $q[1]=join("",map{$_ =~ tr/ACGTYRKMBVDH/TGCARYMKVBHD/; $_} split("",$q[1]))};

                $r=qx/water -asequence=asis:$q[0] -bsequence=asis:$q[1] -gapopen=$go -gapextend=$ge -stdout -auto -aformat3 markx3/;

                $r=~s/#.*\n//g; #Remove all the comments

                $r=~s/^\s*\n//g; #Remove all empty lines with spaces

                $r=~s/\n//g; #Remove all enters

                $r=~/>asis \.\.(.*)>asis \.\.(.*)/;

            #Define IUPAC degenerate basis

            my %A = ("R"=>1, "W"=>1, "M"=>1, "D"=>1, "H"=>1, "V"=>1, "N"=>1);

            my %C = ("Y"=>1, "S"=>1, "M"=>1, "B"=>1, "H"=>1, "V"=>1, "N"=>1);

            my %G = ("R"=>1, "S"=>1, "K"=>1, "B"=>1, "D"=>1, "V"=>1, "N"=>1);

            my %T = ("Y"=>1, "W"=>1, "K"=>1, "B"=>1, "D"=>1, "H"=>1, "N"=>1);

                if ($1 ne ""){

                        @h=split(" ",$a[0]);

                        $m_f=$1;$m_r=$2;

                    $m_f_g=$m_f;

                    $m_f_g=~s/-//g; #Remove gaps before searching    

                        $q[0]=~/(.*)$m_f_g(.*)/;

                        $f_l=$1;$f_r=$2;

                    $m_r_g=$m_r;

                    $m_r_g=~s/-//g; #Remove gaps before searching

                        $q[1]=~/(.*)$m_r_g(.*)/;

                        $r_l=$1;$r_r=$2;

                        if($v==1){

                                        print ">".substr($h[0],1)."\n";

                                        print "F.L:".$f_l."\n";

                                        print "F.Match:".$m_f."\n";

                                        print "F.R:".$f_r."\n";

                                        print "R.L:".$r_l."\n";

                                        print "R.Match:".$m_r."\n";

                                        print "R.R:".$r_r."\n";

                        }

                    elsif(($v==2)||($v==3)){

                    print substr($h[0],1)."\t";

                    @mm_f=split("",$m_f);

                    @mm_r=split("",$m_r);

                    $icount=0;

                    $mcount=0;

                    for($k=0;$k<=$#mm_f;$k++){

                                         if($mm_f[$k] eq $mm_r[$k]){

                                            if($v==2){print "___";}

                                    }

                            elsif(($mm_f[$k] eq "-") || ($mm_r[$k] eq "-")){

                                    if($v==2){print $mm_f[$k]."/".$mm_r[$k];}

                                    $icount++;

                                    }

                            else

                                    {

                                                if ($mm_r[$k]=~/[RYSWKBDHVNM]/){

                                                            if ((($mm_f[$k] eq "A") && exists ($A{$mm_r[$k]}))

                                                            || (($mm_f[$k] eq "C") && exists($C{$mm_r[$k]}))

                                                            || (($mm_f[$k] eq "T") && exists($T{$mm_r[$k]}))

                                                            || (($mm_f[$k] eq "G") && exists($G{$mm_r[$k]}))){

                                                            if($v==2){print "___";}

                                                    }

                                                    else {

                                                            if($v==2){print $mm_f[$k]."/".$mm_r[$k];}

                                                            $mcount++;

                                                            }

                                                    } 

                                    }

                                         }

                    if($v==3){print length($q[1])."\t".length($m_f)."\t".$mcount."\t".$icount;}   

                    print "\n";

                                   

                    }

                        elsif($v==4){

                            print substr($h[0],1)."\n";

                    print "\tF.Match:".$m_f."\n\tR.Match:".$m_r."\n";

                            }     

                        elsif((length($f_r)==0) && (length($r_l)==0)){ #Case where you have left overhang

                                print ">".substr($h[0],1)."\n".$f_l.$m_f.$r_r."\n";

                                }

                        elsif((length($f_l)==0) && (length($r_r)==0)){ #Case where you have right overhang

                                print ">".substr($h[0],1)."\n".$r_l.$m_f.$f_r."\n";                      

                                }

                        else { #Case where one read is included in the other read

                                print ">".substr($h[0],1)."\n";

                                if(length($f_l)>=length($r_l)){

                                        print $f_l;

                                }

                                else {

                                        print $r_l;

                                }

                                print $m_f;

                                        if(length($f_r)>=length($r_r)){

                                                print $f_r;

                                        }

                                        else {

                                                print $r_r;

                                        }

                                        print "\n";

                        }  

                }

                }' -- -fl=$3 -rl=$4 -go=$5 -ge=$6 -v=$7

Although you can use it for merging paired-end files, here is an example of how you would match forward primer against a reference database comprising 128 sequences

First, I will use the VERBOSE switch as 4 to check whether the primer is aligning:

./align.sh reference.fasta <(for i in $(seq 1 128); do echo -e ">454_27YMF\nAGAGTTTGATYMTGGCTCAG"; done) 4 4 10 0.5 4

Acidobacterium_capsulatum_ATCC_51196_R_074106.1

        F.Match:AGAGTTTGATCCTGGCTCAG

        R.Match:AGAGTTTGATYMTGGCTCAG

Akkermansia_muciniphila_ATCC_BAA-835_NR_074436.1

        F.Match:AGAGTTTGATTCTGGCTCAG

        R.Match:AGAGTTTGATYMTGGCTCAG

Akkermansia_muciniphila_ATCC_BAA-835_NR_042817.1

        F.Match:AGTTGGTGAGGTAACGGCTCA

        R.Match:AGTT--TGA--TYMTGGCTCA

Archaeoglobus_fulgidus_DSM_4304_NR_074334.1

        F.Match:AGAGGTGGTGCATGGCCGCCGTCAG

        R.Match:AGAGTTTGATYMTGGC-----TCAG

Bacteroides_thetaiotaomicron_VPI-5482

        F.Match:TTGATACTGGCTGTCTTGAGTACAG

        R.Match:TTGATYMTGGCT----------CAG

Bacteroides_thetaiotaomicron_VPI-5482_2

        F.Match:TTGATACTGGCTGTCTTGAGTACAG

        R.Match:TTGATYMTGGCT----------CAG

and if I am satisfied that the primer is in the right orientation, I use VERBOSE switch as 3 to generate a CSV file with the following record [REF_NAME, PRIMER_LENGTH,MATCH_LENGTH,MISMATCHES_IN_MATCH_LENGTH,INDELS_IN_MATCH_LENGTH]

./align.sh reference.fasta <(for i in $(seq 1 128); do echo -e ">454_27YMF\nAGAGTTTGATYMTGGCTCAG"; done) 4 4 10 0.5 3

Acidobacterium_capsulatum_ATCC_51196_R_074106.1        20        20        0        0

Akkermansia_muciniphila_ATCC_BAA-835_NR_074436.1        20        20        0        0

Akkermansia_muciniphila_ATCC_BAA-835_NR_042817.1        20        21        1        4

Archaeoglobus_fulgidus_DSM_4304_NR_074334.1        20        25        0        5

Bacteroides_thetaiotaomicron_VPI-5482        20        25        1        10

Bacteroides_thetaiotaomicron_VPI-5482_2        20        25        1        10

Bacteroides_thetaiotaomicron_VPI-5482_4        20        25        1        10

Bacteroides_vulgatus_ATCC_8482        20        19        0        4

Bacteroides_vulgatus_ATCC_8482_2        20        19        0        4

Bacteroides_vulgatus_ATCC_8482_3        20        19        0        4

Bacteroides_vulgatus_ATCC_8482_4        20        19        0        4

Bacteroides_vulgatus_ATCC_8482_5        20        19        0        4

Bordetella_bronchiseptica_strain_RB50        20        20        0        0

Burkholderia1_xenovorans_LB400_chromosome_1_complete_sequence        20        20        0        0

Caldicellulosiruptor_bescii_DSM_6725_chromosome_NC_012034.1_completegenome_2        20        26        1        9

Caldicellulosiruptor_bescii_DSM_6725_NR_074788.1        20        20        0        0

Caldisaccharolyticus_DSM_8903        20        28        0        11

Caldisaccharolyticus_DSM_8903_2        20        28        0        11

Caldisaccharolyticus_DSM_8903_3        20        28        0        11

Chlorobiumlimicola_DSM_245        20        29        0        9

Chlorobiumphaeobacteroides_DSM_266        20        29        0        9

Chlorobiumphaeovibrioides_DSM_265_2        20        29        0        9

Chlorobiumtepidum_TLS        20        29        0        9

Chlorobiumtepidum_TLS_hap1        20        29        0        9

Chloroflexus_aurantiacus_J-10-fl        20        20        0        5

Chloroflexus_aurantiacus_J-10-fl_2        20        20        0        5

Chloroflexus_aurantiacus_J-10-fl_3        20        20        0        5

Chloroflexus_aurantiacus_J-10-fl_NR_043411.1        20        20        0        0

Chloroflexus_aurantiacus_J-10-fl_D38365.1        20        13        1        0

Clostridium_thermocellum_ATCC_27405        20        21        1        4

06,13/07/2015: Tutorial: Benchmarking WGS sequencing datasets

Go through the following links:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html 

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html#LOWCOVWGS

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/oneliners.html#LINKAGE

You will also need: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/GENERATEtable.sh 

Here is the tentative workflow

Step 1: Run METAmock

for i in $(ls * -d); do cd $i; /home/opt/METAmock_v0.2/METAmock -q Raw/*_R1_*.fastq -q Raw/*_R2_*.fastq -d  /PATH_TO/references.fasta  -l  /PATH_TO/IDs.csv -o $i -t 10; cd ..; done

Step 2: Get the alignment Stats from each folder

for i in $(ls * -d); do cd $i; samtools sort -m 1000000000 ${i}.bam ${i}.sorted; java -jar $(which CollectAlignmentSummaryMetrics.jar) INPUT=${i}.sorted.bam OUTPUT=${i}_mapped_sorted_alignment_stats.txt REFERENCE_SEQUENCE= /PATH_TO/references.fasta; grep -vi -e "^#" -e "^$" *_mapped_sorted_alignment_stats.txt | awk -F"\t" '{ for (i=1; i<=NF; i++)  {a[NR,i] = $i}}NF>p{p=NF}END{for(j=1;j<=p;j++){str=a[1,j];for(i=2; i<=NR; i++){str=str"\t"a[i,j];} print str}}' | awk 'NR>1{print $0}' > ${i}_ALIGNMENT-STATS.tsv ; cd ..; done

Step 3: Get the WGS Stats from each folder

for i in $(ls * -d); do cd $i; java -Xmx4g -jar /home/opt/picard/dist/picard.jar CollectWgsMetrics INPUT=${i}.sorted.bam REFERENCE_SEQUENCE=  /PATH_TO/references.fasta  OUTPUT=${i}_mapped_sorted_wgs_stats.txt; awk '/## METRICS CLASS/ {flag=1;next} /^$/{flag=0} flag {print}' *_mapped_sorted_wgs_stats.txt | awk -F"\t" '{ for (i=1; i<=NF; i++)  {a[NR,i] = $i}}NF>p{p=NF}END{for(j=1;j<=p;j++){str=a[1,j];for(i=2; i<=NR; i++){str=str"\t"a[i,j];} print str}}' > ${i}_WGS-STATS.tsv ; cd ..; done

Step 4: Collate data from each folder:

(for i in $(ls * -d); do cd $i; awk -v k=$(basename ${i}) '{print k"\t"$0}' *_MEAN-GENOME-COVERAGE.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_MEAN-GENOME-COVERAGE.tsv

(for i in $(ls * -d); do cd $i; awk -v k=$(basename ${i}) '{print k"\t"$0}' *_PROPORTION-GENOME-COVERED.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_PROPORTION-GENOME-COVERED.tsv

(for i in $(ls * -d); do cd $i; awk -v k=$(basename ${i}) '{print k"\t"$0}' *_PROPORTION-GENOME-READS.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_PROPORTION-GENOME-READS.tsv

(for i in $(ls * -d); do cd $i; awk -F"\t" -v l=2 -v k=$(basename ${i})  'length($NF)>0{print k"\t"$1"\t"$l}' *_ALIGNMENT-STATS.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_STATS-FORWARD.tsv

(for i in $(ls * -d); do cd $i; awk -F"\t" -v l=3 -v k=$(basename ${i})  'length($NF)>0{print k"\t"$1"\t"$l}' *_ALIGNMENT-STATS.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_STATS-REVERSE.tsv

(for i in $(ls * -d); do cd $i; awk -F"\t" -v l=4 -v k=$(basename ${i})  'length($NF)>0{print k"\t"$1"\t"$l}' *_ALIGNMENT-STATS.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_STATS-PAIRS.tsv

(for i in $(ls * -d); do cd $i; awk -v k=$(basename ${i}) '{print k"\t"$0}' *_WGS-STATS.tsv; cd ..; done) | ~/bin/GENERATEtable.sh > ../Results/COLLATED_WGS-STATS.tsv

23/07/2015:Tutorial: Collating replicates from abundance tables using 1 way subject ANOVA - within subjects

We can use 1 way ANOVA - within subject to collate your replicates together:

http://ww2.coastal.edu/kingw/statistics/R-tutorials/repeated.html

http://www.personality-project.org/r/r.anova.html

Null hypothesis being the samples are different. If P-value < 0.05 I collate all the replicates together, otherwise, I take the replicate with most number of reads. Should you desire, you may change the P-value threshold in the script, mention the input and output filenames and change nothing else as the script is automated.

 

Test data:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/data/phylum.csv

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/data/otu_table.csv

Here is the code:


data<-
read.csv("phylum.csv",row.names=1,check.names=FALSE)
data<-
t(data)
saved_original_data<-
data
data_names<-
rownames(data)
pattern<-
'S([0-9]+[a-c]*)_([ACGT]+)_([a-d]+)'
site_numbers<-
unique(gsub(pattern,'\\1',rownames(data)))

site_names_filtered<-
NULL
site_names_singletons<-
NULL
nSingletsMCount<-
0
nMcount<-
0
nNotMcount<-
0
ind<-
0
ind2<-
0
collated_data<-
NULL
for (i in 1:length(site_numbers)){
 site_name<-
sort(rownames(data)[grep(paste("^S",site_numbers[i],'_',sep=''),rownames(data))])
 
if(length(site_name)>1){
   site_data<-
t(data[site_name,])
   site_data<-site_data
[rowSums(site_data)>0,]   
   stacked_variable<-
NULL
   
for (j in 1:dim(site_data)[2]){
     
if (j==1){
       stacked_variable<-
data.frame(data.frame(rownames(site_data)),site_data[,j],rep(colnames(site_data)[j],length(site_data[,j])))
     
} else{
       stacked_variable<-
rbind(stacked_variable,data.frame(data.frame(rownames(site_data)),site_data[,j],rep(colnames(site_data)[j],length(site_data[,j])))) 
     
}
   
}
 
 rownames(stacked_variable)<-seq(1,dim(stacked_variable)[1])
 
 colnames(stacked_variable)<-c("species","abundance","replicate")
   aov.out=
aov(abundance ~ species+ Error(replicate/species),data=stacked_variable)
   Pvalue<-
summary(aov.out)$"Error: replicate:species"[[1]]$"Pr(>F)"[1]
   
if(Pvalue<=0.05){
     tmp<-
data.frame(colSums(saved_original_data[site_name,]))
   
 colnames(tmp)<-paste("S",site_numbers[i],sep="")
     
if (ind==0){
       colated_data<-
data.frame(tmp) 
       ind=ind+
1
     
}else{
       colated_data<-
cbind(colated_data,tmp)
       ind=ind+
1
     
}
     
#print(site_data)
   
 cat(paste("Agreeing:",paste(site_name,collapse=' '),",","P-value(AOV):",Pvalue,"\n",sep=""))
     nMcount=nMcount+
1
   
} else {
     
#case when p-value is greater than the threshold, choose the sample with biggest colSums
     tmp<-
t(saved_original_data[site_name,])
     tmp2<-
colnames(tmp)[colSums(tmp)==max(colSums(tmp))]
     tmp<-
t(saved_original_data[colnames(tmp)[colSums(tmp)==max(colSums(tmp))],,drop=F])
   
 colnames(tmp)<-paste("S",site_numbers[i],sep="")
     
if (ind==0){
       colated_data<-
data.frame(tmp) 
       ind=ind+
1
     
}else{
       colated_data<-
cbind(colated_data,tmp)
       ind=ind+
1
     
}
     
#print(site_data)
   
 cat(paste("Not agreeing:",paste(site_name,collapse=' '),",","P-value(AOV):",Pvalue,"\n",sep=""))
   
 cat(paste("Choosing:",tmp2,"\n",sep=""))
     nNotMcount=nNotMcount+
1
   
}
   
#uncomment to get detailed data
   
#print(summary(aov.out))
   
#print(friedman.test(abundance ~ species | replicate,data=stacked_variable))

 
} 
 
else {
   tmp<-
data.frame(saved_original_data[site_name,])
 
 colnames(tmp)<-paste("S",site_numbers[i],sep="")

   nSingletsMCount=nSingletsMCount+
1   
   
if (ind==0){
     site_names_filtered<-
data.frame(site_name)
     colated_data<-
data.frame(tmp)
     ind=ind+
1
   
}else{
     site_names_filtered<-
rbind(site_names_filtered,data.frame(site_name))
     colated_data<-
cbind(colated_data,tmp)
     ind=ind+
1
   
}
   
if (ind2==0){
     site_names_singletons<-
data.frame(site_name)
     ind2=ind2+
1
   
}else{
     site_names_singletons<-
rbind(site_names_singletons,data.frame(site_name))
     ind2=ind2+
1
   
}   

 
}

}

colated_data<-
t(colated_data)

cat("-----SUMMARY STATISTICS------\n")
cat(paste("Total agreeing:",nMcount,"\n",sep=""))
cat(paste("Total not agreeing:",nNotMcount,"\n",sep=""))
cat(paste("Total singletons:",nSingletsMCount,"\n",sep=""))
cat(paste("Samples with singletons:",paste(as.matrix(site_names_singletons),collapse=","),"\n",sep=""))
cat(paste("Total useful samples:",nSingletsMCount+nMcount+nNotMcount,"\n",sep=""))
cat("Saving file\n")
write.csv(file="phylum_collated.csv",t(colated_data))

sumOfSamples<-
rowSums(colated_data)[order(rowSums(colated_data))]
filteredSumOfSamples<-sumOfSamples
[sumOfSamples<=10000]
cat("Samples<=10000:\n")
print(as.data.frame(filteredSumOfSamples))

Here is the output:

Agreeing:S114_ATCTAGTGGCAA_a S114_ATCTAGTGGCAA_b,P-value(AOV):1.10363444490979e-28

Agreeing:S105_ACACCAACACCA_a S105_ACACCAACACCA_b,P-value(AOV):1.69472703297879e-27

Agreeing:S62_CCAGGGACTTCT_a S62_CCAGGGACTTCT_b,P-value(AOV):4.85584231114742e-23

Agreeing:S340_GGCGTTGCATTC_a S340_GGCGTTGCATTC_b,P-value(AOV):2.17586880516634e-23

Agreeing:S7_CAATCGGCTTGC_a S7_CAATCGGCTTGC_b S7_CAATCGGCTTGC_c,P-value(AOV):1.35017303441747e-18

Agreeing:S267_GCTGTCGTCAAC_a S267_GCTGTCGTCAAC_b,P-value(AOV):8.77346818913736e-23

Agreeing:S338_GAAGACAGCGAC_a S338_GAAGACAGCGAC_b,P-value(AOV):5.42021879153873e-23

Agreeing:S263_ACGGATGTTATG_a S263_ACGGATGTTATG_b,P-value(AOV):1.78763891531268e-09

Agreeing:S146_ATCAGAGCCCAT_a S146_ATCAGAGCCCAT_b,P-value(AOV):1.10170351877425e-26

Agreeing:S120_GGAGAGATCACG_a S120_GGAGAGATCACG_b,P-value(AOV):7.72945501050171e-21

Agreeing:S209_GCACCTGTTGAA_a S209_GCACCTGTTGAA_b,P-value(AOV):4.93682369418985e-22

Agreeing:S109_ATGCCGGTAATA_a S109_ATGCCGGTAATA_b,P-value(AOV):1.24703128226125e-21

Agreeing:S17_TACAGTTACGCG_a S17_TACAGTTACGCG_b,P-value(AOV):3.4701814083317e-18

Agreeing:S158_TGTGTTACTCCT_a S158_TGTGTTACTCCT_b,P-value(AOV):6.52401269086258e-21

Agreeing:S164_CTTGCGGCAATC_a S164_CTTGCGGCAATC_b,P-value(AOV):7.41996268852818e-17

Agreeing:S335_TACCTGTGTCTT_a S335_TACCTGTGTCTT_b,P-value(AOV):1.62623721046612e-20

Agreeing:S159_GGTACCTGCAAT_a S159_GGTACCTGCAAT_b,P-value(AOV):8.31529544011585e-28

Agreeing:S331_TGGCGATACGTT_a S331_TGGCGATACGTT_b,P-value(AOV):3.72603460710159e-16

Agreeing:S24_CCTCGATGCAGT_a S24_CCTCGATGCAGT_b,P-value(AOV):2.60394803672866e-20

Agreeing:S98_ACAGGGTTTGTA_a S98_ACAGGGTTTGTA_b,P-value(AOV):1.15321890244556e-21

Agreeing:S20_AGTCAGTCAG_c S20_GTCATAAGAACC_a S20_GTCATAAGAACC_b,P-value(AOV):4.39035597596391e-23

Agreeing:S69_GTTTGGCCACAC_a S69_GTTTGGCCACAC_b,P-value(AOV):4.82527212623034e-19

Agreeing:S276_AGCGGCCTATTA_a S276_AGCGGCCTATTA_b,P-value(AOV):6.7846566275245e-26

Agreeing:S144_CGAATGAGTCAT_a S144_CGAATGAGTCAT_b,P-value(AOV):1.01142389432462e-25

Agreeing:S148_CCGACTCTAGGT_a S148_CCGACTCTAGGT_b,P-value(AOV):9.71366055497358e-20

Agreeing:S179_CTGCATACTGAG_a S179_CTGCATACTGAG_b,P-value(AOV):1.08141581297951e-18

Agreeing:S18_CAAGCCCTAGTA_a S18_CAAGCCCTAGTA_b,P-value(AOV):1.09514570440322e-20

Agreeing:S270_ATAGCTTCGTGG_a S270_ATAGCTTCGTGG_b,P-value(AOV):7.84435472307769e-21

Not agreeing:S275_GCAAATCAGCCT_a S275_GCAAATCAGCCT_b S275_TTGCCCTTTGAT_c,P-value(AOV):0.48323844949482

Choosing:S275_TTGCCCTTTGAT_c

Agreeing:S598_CTTTCGTTCAAC_a S598_CTTTCGTTCAAC_b S598_CTTTCGTTCAAC_c,P-value(AOV):1.28909318308998e-17

Agreeing:S16_GTAGACATGTGT_a S16_GTAGACATGTGT_b,P-value(AOV):1.8703806434498e-19

Agreeing:S93_TATGGTACCCAG_a S93_TATGGTACCCAG_b,P-value(AOV):4.16070250510679e-22

Agreeing:S160_TCGCCTATAAGG_a S160_TCGCCTATAAGG_b,P-value(AOV):1.22187584866151e-19

Agreeing:S150_GACAACGAATCT_a S150_GACAACGAATCT_b,P-value(AOV):4.68517948134122e-24

Agreeing:S110_GAACAGCTCTAC_a S110_GAACAGCTCTAC_b,P-value(AOV):1.07674176086192e-22

Agreeing:S278_TGGAATTCGGCT_a S278_TGGAATTCGGCT_b,P-value(AOV):2.81125472314319e-22

Agreeing:S113_TAGAGCTGCCAT_a S113_TAGAGCTGCCAT_b S113_TAGAGCTGCCAT_c,P-value(AOV):1.40370103798004e-10

Agreeing:S19_TAGTGTCGGATC_a S19_TAGTGTCGGATC_b,P-value(AOV):3.36123866707943e-26

Not agreeing:S310_ACTAGTTGGACC_d S310_TAGGCTCGTGCT_a S310_TAGGCTCGTGCT_b S310_TAGGCTCGTGCT_c,P-value(AOV):0.404561150051891

Choosing:S310_ACTAGTTGGACC_d

Agreeing:S222_TCAGCGCCGTTA_a S222_TCAGCGCCGTTA_b,P-value(AOV):1.29835699624235e-19

Agreeing:S145_CAACGCTAGAAT_a S145_CAACGCTAGAAT_b,P-value(AOV):1.41795211014616e-23

Agreeing:S277_TCTTCAACTACC_a S277_TCTTCAACTACC_b,P-value(AOV):1.4943411097191e-23

Agreeing:S596_GTGTCCGGATTC_a S596_GTGTCCGGATTC_b,P-value(AOV):3.43929526342876e-16

Agreeing:S128_AATGCAATGCGT_a S128_AATGCAATGCGT_b,P-value(AOV):8.70191905422214e-19

Agreeing:S44_TGGTTATGGCAC_a S44_TGGTTATGGCAC_b,P-value(AOV):1.41496472054771e-20

Agreeing:S25_GCGGACTATTCA_a S25_GCGGACTATTCA_b,P-value(AOV):1.1968364800017e-22

Agreeing:S157_CATAAGGGAGGC_a S157_CATAAGGGAGGC_b,P-value(AOV):8.26501120940695e-18

Agreeing:S238_GGCCCAATATAA_a S238_GGCCCAATATAA_b,P-value(AOV):4.34402817685489e-43

Agreeing:S214_TACGGCAGTTCA_a S214_TACGGCAGTTCA_b,P-value(AOV):1.8917485539528e-23

Agreeing:S112_TGGCCGTTACTG_a S112_TGGCCGTTACTG_b,P-value(AOV):3.97704504411945e-19

Agreeing:S66_GTTTCACGCGAA_a S66_GTTTCACGCGAA_b,P-value(AOV):8.37580993778414e-23

Agreeing:S111_GTGAGTCATACC_a S111_GTGAGTCATACC_b,P-value(AOV):8.17167372593148e-19

Not agreeing:S264_CCACGGTACTTG_d S264_TTGAGGCTACAA_a S264_TTGAGGCTACAA_b S264_TTGAGGCTACAA_c,P-value(AOV):0.37389035048679

Choosing:S264_CCACGGTACTTG_d

Agreeing:S182_AGTACGCAGTCT_a S182_AGTACGCAGTCT_b,P-value(AOV):6.2676379898768e-24

Not agreeing:S8_AACACTCGATCG_a S8_AACACTCGATCG_b S8_AACACTCGATCG_c S8_GGCACACCCTTA_d,P-value(AOV):0.237216235444762

Choosing:S8_GGCACACCCTTA_d

Agreeing:S154_GCCGGTACTCTA_a S154_GCCGGTACTCTA_b,P-value(AOV):3.90721811896393e-23

Agreeing:S208_AGATGTCCGTCA_a S208_AGATGTCCGTCA_b,P-value(AOV):3.63934941379156e-26

Agreeing:S51_CATACACGCACC_a S51_CATACACGCACC_b,P-value(AOV):1.42499369436263e-22

Agreeing:S595_CTATCCAAGTGG_a S595_CTATCCAAGTGG_b,P-value(AOV):1.07529012379713e-17

Agreeing:S205_CTCGATGTAAGC_a S205_CTCGATGTAAGC_b,P-value(AOV):1.8675631444141e-22

Agreeing:S50_TAGCGCGAACTT_a S50_TAGCGCGAACTT_b,P-value(AOV):8.31912255923976e-19

Agreeing:S599_CCGAAGATTCTG_a S599_CCGAAGATTCTG_b S599_CCGAAGATTCTG_c S599_CCGAAGATTCTG_d,P-value(AOV):1.63957083347195e-09

Agreeing:S172_GCACAAGGCAAG_a S172_GCACAAGGCAAG_b,P-value(AOV):5.69932441798279e-21

Agreeing:S67_ACAAGAACCTTG_a S67_ACAAGAACCTTG_b,P-value(AOV):2.49931279228635e-19

Agreeing:S177_CGTTCTGGTGGT_a S177_CGTTCTGGTGGT_b,P-value(AOV):1.25315446313891e-22

Agreeing:S207_ATACGCATCAAG_a S207_ATACGCATCAAG_b,P-value(AOV):1.37520363329658e-19

Agreeing:S211_GAGGTTCTTGAC_a S211_GAGGTTCTTGAC_b,P-value(AOV):5.5958738662613e-22

Agreeing:S610_TGTCCGTGGATC_a S610_TGTCCGTGGATC_b,P-value(AOV):4.77420080709932e-20

Agreeing:S101_ATCGCTTAAGGC_a S101_ATCGCTTAAGGC_b,P-value(AOV):1.10581730679513e-16

Agreeing:S96_ACGTGGTTCCAC_a S96_ACGTGGTTCCAC_b,P-value(AOV):2.99124383967741e-22

Agreeing:S339_ACACCTGCGATC_a S339_ACACCTGCGATC_b,P-value(AOV):3.52978139775741e-21

Agreeing:S60_TCGAGCCGATCT_a S60_TCGAGCCGATCT_b S60_TCGAGCCGATCT_c,P-value(AOV):1.01712874065783e-18

Agreeing:S165_TGAGGTTTGATG_a S165_TGAGGTTTGATG_b,P-value(AOV):1.53558615617393e-17

Agreeing:S75_AGAGTGCTAATC_a S75_AGAGTGCTAATC_b,P-value(AOV):4.47408930598287e-23

Agreeing:S23_CCTCTGAGAGCT_a S23_CCTCTGAGAGCT_b S23_CCTCTGAGAGCT_c S23_CCTCTGAGAGCT_d,P-value(AOV):1.50704548702783e-12

Agreeing:S611_ACTCGGCCAACT_a S611_ACTCGGCCAACT_b S611_ACTCGGCCAACT_c,P-value(AOV):2.97011944765426e-18

Agreeing:S119_AAGAGCAGAGCC_a S119_AAGAGCAGAGCC_b,P-value(AOV):6.60332878212137e-24

Agreeing:S13_AACCCAGATGAT_a S13_AACCCAGATGAT_b,P-value(AOV):1.93641743842558e-21

Agreeing:S316_CAATTCTGCTTC_a S316_CAATTCTGCTTC_b,P-value(AOV):1.60346866019032e-26

Agreeing:S54_CGCCGGTAATCT_a S54_CGCCGGTAATCT_b,P-value(AOV):6.19027789530214e-22

Agreeing:S52_ACCTCAGTCAAG_a S52_ACCTCAGTCAAG_b,P-value(AOV):3.61454970667176e-22

Agreeing:S228_ACCTTGACAAGA_a S228_ACCTTGACAAGA_b,P-value(AOV):1.36496697237766e-26

Agreeing:S49_TGAGTGGTCTGT_a S49_TGAGTGGTCTGT_b,P-value(AOV):2.82816972057331e-19

Agreeing:S333_GCCTTACGATAG_a S333_GCCTTACGATAG_b,P-value(AOV):5.3537911399645e-20

Agreeing:S314_CTCCTTAAGGCG_a S314_CTCCTTAAGGCG_b,P-value(AOV):3.49025161642979e-26

Agreeing:S53_ATATCGCGATGA_a S53_ATATCGCGATGA_b,P-value(AOV):2.25947052885169e-21

Not agreeing:S601_ATGCGAGACTTC_d S601_GAAGTAGCGAGC_a S601_GAAGTAGCGAGC_b S601_GAAGTAGCGAGC_c,P-value(AOV):0.313338785879526

Choosing:S601_ATGCGAGACTTC_d

Agreeing:S221_TAGAGGCGTAGG_a S221_TAGAGGCGTAGG_b S221_TAGAGGCGTAGG_c,P-value(AOV):1.15974384440273e-18

Agreeing:S609_CATCTGGGCAAT_a S609_CATCTGGGCAAT_b,P-value(AOV):3.65611520799012e-28

Agreeing:S115_CCTTCAATGGGA_a S115_CCTTCAATGGGA_b,P-value(AOV):3.71307499310189e-25

Agreeing:S593_GTACTACCTCGG_a S593_GTACTACCTCGG_b,P-value(AOV):1.48285490836204e-19

Agreeing:S268_AAGCTTGAAACC_a S268_AAGCTTGAAACC_b,P-value(AOV):6.17177881697906e-20

Agreeing:S594_TTCCTGTTAACC_a S594_TTCCTGTTAACC_b,P-value(AOV):2.63790269366263e-17

Agreeing:S317_ACTGGCAAACCT_a S317_ACTGGCAAACCT_b,P-value(AOV):2.37247880581094e-22

Agreeing:S305_GTGTGTGCCATA_a S305_GTGTGTGCCATA_b,P-value(AOV):1.35115352301762e-27

Not agreeing:S152_TGAGAAGAAAGG_a S152_TGAGAAGAAAGG_b S152_TGAGAAGAAAGG_c S152_TGGTTCATCCTT_d,P-value(AOV):0.384174667227471

Choosing:S152_TGGTTCATCCTT_d

Agreeing:S91_CACTAACAAACG_a S91_CACTAACAAACG_b,P-value(AOV):7.69808767340533e-20

Agreeing:S97_GACGCTTTGCTG_a S97_GACGCTTTGCTG_b,P-value(AOV):1.28445128929349e-16

Agreeing:S5_GATGACCCAAAT_a S5_GATGACCCAAAT_b,P-value(AOV):7.89588918663985e-25

Agreeing:S262_TTCTAGAGTGCG_a S262_TTCTAGAGTGCG_b,P-value(AOV):2.74081895557288e-21

Agreeing:S307_TGACAACCGAAT_a S307_TGACAACCGAAT_b,P-value(AOV):7.80377135703389e-26

Agreeing:S174_GCGAGTTCCTGT_a S174_GCGAGTTCCTGT_b,P-value(AOV):6.76644360742895e-20

Agreeing:S279_TAAGATGCAGTC_a S279_TAAGATGCAGTC_b,P-value(AOV):7.11766462068138e-24

Agreeing:S200_TTCCCGAAACGA_a S200_TTCCCGAAACGA_b,P-value(AOV):2.65611374799444e-26

Agreeing:S218_TGATAGGTACAC_a S218_TGATAGGTACAC_b S218_TGATAGGTACAC_c,P-value(AOV):1.05587388646907e-17

Agreeing:S166_ATTGCTGGTCGA_a S166_ATTGCTGGTCGA_b,P-value(AOV):1.0521991886246e-22

Agreeing:S48_ACGCACATACAA_a S48_ACGCACATACAA_b,P-value(AOV):2.24928460079523e-25

Agreeing:S4_CTAGGATCACTG_a S4_CTAGGATCACTG_b,P-value(AOV):8.59655742356576e-20

Agreeing:S153_TCGGATCTGTGA_a S153_TCGGATCTGTGA_b S153_TCGGATCTGTGA_c,P-value(AOV):3.28416521942955e-19

Agreeing:S337_AACGAGGCAACG_a S337_AACGAGGCAACG_b,P-value(AOV):7.66619842942931e-21

Agreeing:S199_CTCGGATAGATC_a S199_CTCGGATAGATC_b,P-value(AOV):2.25653276938018e-19

Agreeing:S99_GCCTATGAGATC_a S99_GCCTATGAGATC_b,P-value(AOV):1.69377578902865e-20

Agreeing:S57_TACGCAGCACTA_a S57_TACGCAGCACTA_b,P-value(AOV):1.27640815322238e-20

Agreeing:S108_GAACCTATGACA_a S108_GAACCTATGACA_b,P-value(AOV):4.05591368514948e-24

Agreeing:S74_AGGGTACAGGGT_a S74_AGGGTACAGGGT_b S74_AGGGTACAGGGT_c,P-value(AOV):1.85695010702823e-15

Agreeing:S9_TGACCGGCTGTT_a S9_TGACCGGCTGTT_b,P-value(AOV):1.97650519492946e-20

Agreeing:S233_CATAGCTCGGTC_a S233_CATAGCTCGGTC_b,P-value(AOV):9.74367870041069e-22

Agreeing:S68_CGAAGCATCTAC_a S68_CGAAGCATCTAC_b,P-value(AOV):4.74483805590154e-22

Agreeing:S203_GATGGACTTCAA_a S203_GATGGACTTCAA_b,P-value(AOV):1.83243825677876e-27

Agreeing:S100_CAAACCTATGGC_a S100_CAAACCTATGGC_b,P-value(AOV):1.59117924553245e-13

Agreeing:S163_GTGTGCTAACGT_a S163_GTGTGCTAACGT_b S163_GTGTGCTAACGT_c,P-value(AOV):4.98743306630963e-21

Not agreeing:S21_GTCCGCAAGTTA_a S21_GTCCGCAAGTTA_b S21_GTCCGCAAGTTA_c S21_TCGCGCAACTGT_d,P-value(AOV):0.328846893134768

Choosing:S21_TCGCGCAACTGT_d

Agreeing:S210_CCTAGAGAAACT_a S210_CCTAGAGAAACT_b,P-value(AOV):2.91894897004346e-24

Agreeing:S71_CTATGCCGGCTA_a S71_CTATGCCGGCTA_b S71_CTATGCCGGCTA_c,P-value(AOV):4.82240421341684e-22

Agreeing:S201_GAACTTTAGCGC_a S201_GAACTTTAGCGC_b,P-value(AOV):8.36056151205511e-16

Agreeing:S183_AGCAGCTATTGC_a S183_AGCAGCTATTGC_b,P-value(AOV):4.35220557793536e-20

Agreeing:S202_TCCTTAGAAGGC_a S202_TCCTTAGAAGGC_b,P-value(AOV):2.15336260197834e-22

Agreeing:S603_GCGGAAACATGG_a S603_GCGGAAACATGG_b,P-value(AOV):2.26130610077941e-18

Agreeing:S239_TTGTATGACAGG_a S239_TTGTATGACAGG_b,P-value(AOV):5.2105587320453e-23

Agreeing:S121_TCAACCCGTGAA_a S121_TCAACCCGTGAA_b,P-value(AOV):4.54247436348508e-24

Agreeing:S167_AAGAAGCCGGAC_a S167_AAGAAGCCGGAC_b,P-value(AOV):7.33662539991922e-18

Agreeing:S124_TCGCCAGTGCAT_a S124_TCGCCAGTGCAT_b,P-value(AOV):1.04401490401026e-23

Agreeing:S213_TGAGTCATTGAG_a S213_TGAGTCATTGAG_b,P-value(AOV):6.02540874572956e-16

Agreeing:S2_CACGTGACATGT_a S2_CACGTGACATGT_b,P-value(AOV):8.93646776419316e-21

Agreeing:S118_GGCTAAACTATG_a S118_GGCTAAACTATG_b,P-value(AOV):2.03372315705647e-23

Agreeing:S606_ACGTAACCACGT_a S606_ACGTAACCACGT_b,P-value(AOV):3.06388835246342e-19

Agreeing:S117_ACATACTGAGCA_a S117_ACATACTGAGCA_b,P-value(AOV):1.36745991924725e-25

Agreeing:S215_TGCACAGTCGCT_a S215_TGCACAGTCGCT_b,P-value(AOV):5.51380115430898e-20

Agreeing:S236_ACTACCTCTTCA_a S236_ACTACCTCTTCA_b,P-value(AOV):3.20773289478923e-25

Agreeing:S26_CGTGCACAATTG_a S26_CGTGCACAATTG_b,P-value(AOV):2.80251036938628e-20

Agreeing:S58_CGCTTAGTGCTG_a S58_CGCTTAGTGCTG_b,P-value(AOV):7.92227294549183e-21

Agreeing:S171_AGATCTATGCAG_a S171_AGATCTATGCAG_b,P-value(AOV):1.77959107551784e-22

Agreeing:S15_AACAAACTGCCA_a S15_AACAAACTGCCA_b,P-value(AOV):4.21267438307255e-25

Agreeing:S46_TGCTACAGACGT_a S46_TGCTACAGACGT_b,P-value(AOV):7.43398346787267e-18

Agreeing:S272_AGTCATCGAATG_a S272_AGTCATCGAATG_b,P-value(AOV):7.56803693395758e-21

Agreeing:S318_AATCAGAGCTTG_a S318_AATCAGAGCTTG_b,P-value(AOV):1.75042559978794e-20

Not agreeing:S600_GATCAACCCACA_d S600_GTTGGCGTTACA_a S600_GTTGGCGTTACA_b S600_GTTGGCGTTACA_c,P-value(AOV):0.239166243974164

Choosing:S600_GATCAACCCACA_d

Agreeing:S169_AAGAGTCTCTAG_a S169_AAGAGTCTCTAG_b,P-value(AOV):9.14127467756196e-19

Agreeing:S14_GATATACCAGTG_a S14_GATATACCAGTG_b,P-value(AOV):2.36575450796279e-19

Agreeing:S176_TACCTAGTGAGA_a S176_TACCTAGTGAGA_b,P-value(AOV):1.44909292991283e-25

Agreeing:S155_CACAGGATTACC_a S155_CACAGGATTACC_b,P-value(AOV):1.48607061506847e-24

Agreeing:S608_TCTAACGAGTGC_a S608_TCTAACGAGTGC_b,P-value(AOV):3.2070141860643e-22

Agreeing:S273_ATCTTGGAGTCG_a S273_ATCTTGGAGTCG_b,P-value(AOV):4.94538431379508e-24

Agreeing:S65_TCGGCGATCATC_a S65_TCGGCGATCATC_b,P-value(AOV):3.46472033220322e-24

Agreeing:S235_TATGGAGCTAGT_a S235_TATGGAGCTAGT_b,P-value(AOV):3.56874509172999e-18

Agreeing:S274_AGCACCGGTCTT_a S274_AGCACCGGTCTT_b,P-value(AOV):5.144204146857e-21

Agreeing:S212_CTGTAAAGGTTG_a S212_CTGTAAAGGTTG_b,P-value(AOV):1.64538716903115e-21

Agreeing:S607_GTCGGAAATTGT_a S607_GTCGGAAATTGT_b,P-value(AOV):4.42067524549004e-16

Agreeing:S237_GATGATAACCCA_a S237_GATGATAACCCA_b,P-value(AOV):3.16165505648597e-25

Agreeing:S181_GTCAATTAGTGG_a S181_GTCAATTAGTGG_b,P-value(AOV):2.36440463167437e-20

Agreeing:S645_GTAGCACTCATG_a S645_GTAGCACTCATG_b,P-value(AOV):2.28921728328103e-21

Agreeing:S59_CAAAGTTTGCGA_a S59_CAAAGTTTGCGA_b S59_CAAAGTTTGCGA_c,P-value(AOV):6.68811907584755e-17

Agreeing:S73_TGTACCAACCGA_a S73_TGTACCAACCGA_b S73_TGTACCAACCGA_c,P-value(AOV):8.05146411798181e-08

Agreeing:S230_GTAACCACCACC_a S230_GTAACCACCACC_b,P-value(AOV):2.64503415479883e-20

Agreeing:S55_CCGATGCCTTGA_a S55_CCGATGCCTTGA_b,P-value(AOV):2.4665833459163e-19

Agreeing:S95_CTTGGAGGCTTA_a S95_CTTGGAGGCTTA_b,P-value(AOV):5.93103444077784e-23

Agreeing:S257_CTACCACGGTAC_a S257_CTACCACGGTAC_b,P-value(AOV):3.99477585919988e-20

Agreeing:S123_AGAGAGACAGGT_a S123_AGAGAGACAGGT_b,P-value(AOV):3.72942037083891e-20

Agreeing:S92_TTCCAGGCAGAT_a S92_TTCCAGGCAGAT_b,P-value(AOV):2.18201434298289e-20

Agreeing:S302_TGCAAGCTAAGT_a S302_TGCAAGCTAAGT_b,P-value(AOV):3.07002513280741e-20

Agreeing:S151_TGCGGTTGACTC_a S151_TGCGGTTGACTC_b,P-value(AOV):2.22754044750144e-18

Agreeing:S43_AGCGCTCACATC_a S43_AGCGCTCACATC_b,P-value(AOV):1.69084184328418e-21

Agreeing:S147_TCTGTAGAGCCA_a S147_TCTGTAGAGCCA_b,P-value(AOV):1.35139798807719e-24

Not agreeing:S11_GTCCAGCTATGA_d S11_TGGAAGAACGGC_a S11_TGGAAGAACGGC_b S11_TGGAAGAACGGC_c,P-value(AOV):0.361232106017254

Choosing:S11_GTCCAGCTATGA_d

Agreeing:S180_CGATGAATATCG_a S180_CGATGAATATCG_b,P-value(AOV):5.73679005877957e-19

Agreeing:S178_TTGGTCTCCTCT_a S178_TTGGTCTCCTCT_b,P-value(AOV):7.4780450025441e-26

Agreeing:S70_TGACGTAGAACT_a S70_TGACGTAGAACT_b S70_TGACGTAGAACT_c,P-value(AOV):2.00793745874956e-32

Agreeing:S206_AGCTTCGACAGT_a S206_AGCTTCGACAGT_b,P-value(AOV):2.04627489976823e-20

Agreeing:S122_GTTTGAAACACG_a S122_GTTTGAAACACG_b,P-value(AOV):2.56591805799559e-21

Agreeing:S220_AAGCAGATTGTC_a S220_AAGCAGATTGTC_b S220_AAGCAGATTGTC_c S220_AAGCAGATTGTC_d,P-value(AOV):1.80847730693186e-08

Agreeing:S156_CGATATCAGTAG_a S156_CGATATCAGTAG_b,P-value(AOV):8.22632294920611e-19

Agreeing:S590_TGCGAGTATATG_a S590_TGCGAGTATATG_b,P-value(AOV):2.97268435111152e-21

Agreeing:S94_CACGACTTGACA_a S94_CACGACTTGACA_b,P-value(AOV):7.92436344033422e-23

Agreeing:S27_CGGCCTAAGTTC_a S27_CGGCCTAAGTTC_b S27_CGGCCTAAGTTC_c,P-value(AOV):1.46309621452722e-15

Agreeing:S271_CGGGATCAAATT_a S271_CGGGATCAAATT_b,P-value(AOV):1.68297364026065e-34

Agreeing:S170_TCCGTCATGGGT_a S170_TCCGTCATGGGT_b,P-value(AOV):1.59288723013609e-17

Agreeing:S125_GCTCAGGACTCT_a S125_GCTCAGGACTCT_b S125_GCTCAGGACTCT_c S125_GCTCAGGACTCT_d,P-value(AOV):1.83943996614536e-10

Agreeing:S63_GGCCTATAAGTC_a S63_GGCCTATAAGTC_b,P-value(AOV):4.66017997367674e-21

Agreeing:S315_TTGCCTGGGTCA_a S315_TTGCCTGGGTCA_b,P-value(AOV):6.59883718060443e-21

Agreeing:S61_CTCATCATGTTC_a S61_CTCATCATGTTC_b,P-value(AOV):1.32235943812833e-18

Agreeing:S76_TTGGCGGGTTAT_a S76_TTGGCGGGTTAT_b,P-value(AOV):9.85847103129569e-23

Agreeing:S106_CCATCACATAGG_a S106_CCATCACATAGG_b,P-value(AOV):3.83490347399855e-19

Agreeing:S107_CGACACGGAGAA_a S107_CGACACGGAGAA_b,P-value(AOV):5.31905332799461e-22

Agreeing:S224_GTCAACGCTGTC_a S224_GTCAACGCTGTC_b,P-value(AOV):9.85160195235721e-17

Agreeing:S217_TGCTCCGTAGAA_a S217_TGCTCCGTAGAA_b,P-value(AOV):4.22592979783409e-23

Agreeing:S12_GCTAGACACTAC_a S12_GCTAGACACTAC_b S12_GCTAGACACTAC_c,P-value(AOV):3.69698247099181e-19

Agreeing:S265_GTAGGAACCGGA_a S265_GTAGGAACCGGA_b,P-value(AOV):2.05748796000816e-21

Agreeing:S591_TACCACAACGAA_a S591_TACCACAACGAA_b,P-value(AOV):2.13958237364966e-27

Agreeing:S225_TGCCGAGTAATC_a S225_TGCCGAGTAATC_b,P-value(AOV):2.65703388268098e-24

Agreeing:S3_CACAGTTGAAGT_a S3_CACAGTTGAAGT_b,P-value(AOV):1.92935641266634e-16

Not agreeing:S635_TGAGACCCTACA_c S635_TTGCGACAAAGT_a S635_TTGCGACAAAGT_b,P-value(AOV):0.468564208680798

Choosing:S635_TGAGACCCTACA_c

Agreeing:S602_TTGCGGACCCTA_a S602_TTGCGGACCCTA_b,P-value(AOV):1.66828461706143e-23

Agreeing:S47_ATGGCCTGACTA_a S47_ATGGCCTGACTA_b,P-value(AOV):2.94544922383646e-25

Agreeing:S173_CGGCAAACACTT_a S173_CGGCAAACACTT_b,P-value(AOV):3.03558501049155e-23

Agreeing:S261_GTACATGTCGCC_a S261_GTACATGTCGCC_b,P-value(AOV):1.20899516014437e-22

Agreeing:S204_TACTGAGCCTCG_a S204_TACTGAGCCTCG_b,P-value(AOV):2.21193886546125e-23

Agreeing:S45_TTGCACCGTCGA_a S45_TTGCACCGTCGA_b,P-value(AOV):3.08485883299605e-22

Agreeing:S6_TGAGGACTACCT_a S6_TGAGGACTACCT_b,P-value(AOV):6.07616313172193e-19

Agreeing:S612_GTTGGTTGGCAT_a S612_GTTGGTTGGCAT_b S612_GTTGGTTGGCAT_c,P-value(AOV):5.10586446980732e-11

Agreeing:S149_ATCCTACGAGCA_a S149_ATCCTACGAGCA_b,P-value(AOV):3.8296794744166e-17

Agreeing:S259_CGGTCTGTCTGA_a S259_CGGTCTGTCTGA_b,P-value(AOV):2.67591283555001e-21

Agreeing:S90_GTCACCAATCCG_a S90_GTCACCAATCCG_b,P-value(AOV):1.08944480173919e-19

Not agreeing:S22_ATTCCTCTCCAC_d S22_CGTAGAGCTCTC_a S22_CGTAGAGCTCTC_b S22_CGTAGAGCTCTC_c,P-value(AOV):0.385904787734997

Choosing:S22_ATTCCTCTCCAC_d

Not agreeing:S126_CACTTTGGGTGC_a S126_CACTTTGGGTGC_b S126_CCTGGAATTAAG_c,P-value(AOV):0.483595723285709

Choosing:S126_CCTGGAATTAAG_c

Agreeing:S613_TTCCACACGTGG_a S613_TTCCACACGTGG_b S613_TTCCACACGTGG_c,P-value(AOV):7.83293108132882e-22

Agreeing:S1_ATTGCAAGCAAC_a S1_ATTGCAAGCAAC_b,P-value(AOV):1.73938898633643e-24

Agreeing:S56_AGCAGGCACGAA_a S56_AGCAGGCACGAA_b,P-value(AOV):6.92041664175777e-22

Agreeing:S77_CACGATGGTCAT_a S77_CACGATGGTCAT_b,P-value(AOV):4.34346163724682e-28

Agreeing:S127_TCTAGCCTGGCA_a S127_TCTAGCCTGGCA_b,P-value(AOV):2.70744473545247e-23

Agreeing:S168_ACGGGATACAGG_a S168_ACGGGATACAGG_b,P-value(AOV):3.3521133536774e-19

Agreeing:S161_AGTGGCACTATC_a S161_AGTGGCACTATC_b S161_AGTGGCACTATC_c S161_AGTGGCACTATC_d,P-value(AOV):5.14544139494881e-09

Agreeing:S72_GTGGTATGGGAG_a S72_GTGGTATGGGAG_b S72_GTGGTATGGGAG_c,P-value(AOV):1.4443144544277e-15

Agreeing:S592_TCTGGAACGGTT_a S592_TCTGGAACGGTT_b,P-value(AOV):6.1343215207246e-22

Agreeing:S604_AACGTTAGTGTG_a S604_AACGTTAGTGTG_b S604_AACGTTAGTGTG_c S604_AACGTTAGTGTG_d,P-value(AOV):2.55896383208538e-09

Agreeing:S597_TGTGGTGATGTA_a S597_TGTGGTGATGTA_b S597_TGTGGTGATGTA_c,P-value(AOV):9.57565664673033e-09

Agreeing:S102_ACCATCCAACGA_a S102_ACCATCCAACGA_b,P-value(AOV):3.84470657591711e-22

Not agreeing:S162_AGCACTTTGAGA_d S162_TAACCCGATAGA_a S162_TAACCCGATAGA_b S162_TAACCCGATAGA_c,P-value(AOV):0.300561856124944

Choosing:S162_AGCACTTTGAGA_d

Agreeing:S223_TAGACCGACTCC_a S223_TAGACCGACTCC_b,P-value(AOV):3.91323739073669e-15

Agreeing:S255_GGTAAGTTTGAC_a S255_GGTAAGTTTGAC_b,P-value(AOV):6.58578172725711e-22

Agreeing:S175_TTCCGAATCGGC_a S175_TTCCGAATCGGC_b,P-value(AOV):9.90068451002601e-24

Agreeing:S219_CGAGTTCATCGA_a S219_CGAGTTCATCGA_b S219_CGAGTTCATCGA_c S219_CGAGTTCATCGA_d,P-value(AOV):7.15397821506948e-10

Agreeing:S605_TGCATGACAGTC_a S605_TGCATGACAGTC_b,P-value(AOV):1.59347833314137e-21

Agreeing:S103_GCAATAGGAGGA_a S103_GCAATAGGAGGA_b,P-value(AOV):8.1062036341085e-21

Agreeing:S329_CAATGTAGACAC_a S329_CAATGTAGACAC_b,P-value(AOV):7.36769130134158e-25

Agreeing:S104_CCGAACGTCACT_a S104_CCGAACGTCACT_b,P-value(AOV):3.48665616092094e-21

Agreeing:S216_CATGCGGATCCT_a S216_CATGCGGATCCT_b S216_CATGCGGATCCT_c,P-value(AOV):1.87214915620237e-30

Agreeing:S266_ACAGGAGGGTGT_a S266_ACAGGAGGGTGT_b,P-value(AOV):2.0608320833006e-20

Agreeing:S269_TAAGCGTCTCGA_a S269_TAAGCGTCTCGA_b,P-value(AOV):5.38115747056129e-21

Agreeing:S64_TCCATTTCATGC_a S64_TCCATTTCATGC_b,P-value(AOV):7.21388423103848e-22

Agreeing:S10_CTTCCCTAACTC_a S10_CTTCCCTAACTC_b S10_CTTCCCTAACTC_c,P-value(AOV):1.16382520010055e-12

Not agreeing:S287_TAAACGCGACTC_c S287_TAAACGCGACTC_d,P-value(AOV):0.166769509999676

Choosing:S287_TAAACGCGACTC_c

Not agreeing:S291_CCTCGGGTACTA_c S291_CCTCGGGTACTA_d,P-value(AOV):0.169313029972582

Choosing:S291_CCTCGGGTACTA_c

Not agreeing:S293_TTCACCTGTATC_c S293_TTCACCTGTATC_d,P-value(AOV):0.388169216672022

Choosing:S293_TTCACCTGTATC_c

Not agreeing:S359_TCAGACCAACTG_c S359_TCAGACCAACTG_d,P-value(AOV):0.294005424938758

Choosing:S359_TCAGACCAACTG_c

Not agreeing:S360_AGTGATGTGACT_c S360_AGTGATGTGACT_d,P-value(AOV):0.210725124555939

Choosing:S360_AGTGATGTGACT_c

Not agreeing:S362_CGCTTGTGTAGC_d S362_CTTAGCTACTCT_c,P-value(AOV):0.243888605641639

Choosing:S362_CGCTTGTGTAGC_d

Not agreeing:S363_ATGAATGCGTCC_d S363_TCGGTCCATAGC_c,P-value(AOV):0.437591391624287

Choosing:S363_ATGAATGCGTCC_d

Not agreeing:S367_CACGTTTATTCC_c S367_GACTCTGCTCAG_d,P-value(AOV):0.48679746745624

Choosing:S367_GACTCTGCTCAG_d

Not agreeing:S368_CACGTACACGTA_d S368_GAAACGGAAACG_c,P-value(AOV):0.487324033097249

Choosing:S368_CACGTACACGTA_d

Agreeing:S405_GCAAGTGTGAGG_c S405_GCAAGTGTGAGG_d,P-value(AOV):1.07635601878665e-05

Not agreeing:S453_TCTGGGCATTGA_c S453_TCTGGGCATTGA_d,P-value(AOV):0.0563108726594252

Choosing:S453_TCTGGGCATTGA_c

Not agreeing:S511_AATATCGGGATC_c S511_AATATCGGGATC_d,P-value(AOV):0.257891025858545

Choosing:S511_AATATCGGGATC_c

Not agreeing:S513_TCAATGACCGCA_c S513_TCAATGACCGCA_d,P-value(AOV):0.18351878090285

Choosing:S513_TCAATGACCGCA_c

Not agreeing:S568_GTCTCTGAAAGA_c S568_TTATCCAGTCCT_d,P-value(AOV):0.328006040860443

Choosing:S568_TTATCCAGTCCT_d

Not agreeing:S589_CCAGTATCGCGT_c S589_CCAGTATCGCGT_d,P-value(AOV):0.299776516237944

Choosing:S589_CCAGTATCGCGT_c

Not agreeing:S646_TCGTTTCTTCAG_c S646_TCGTTTCTTCAG_d,P-value(AOV):0.139124688005728

Choosing:S646_TCGTTTCTTCAG_c

Not agreeing:S648_CAGAGCTAATTG_d S648_GGATGCAGGATG_c,P-value(AOV):0.404921278625513

Choosing:S648_CAGAGCTAATTG_d

Agreeing:S649_CCACTTGAGAGT_c S649_CCACTTGAGAGT_d,P-value(AOV):0.0210024373085415

Not agreeing:S657_ATGGGACCTTCA_c S657_TTATCCAGTCCT_d,P-value(AOV):0.246808690444265

Choosing:S657_TTATCCAGTCCT_d

Not agreeing:S84_GCGTAGAGAGAC_c S84_GCGTAGAGAGAC_d,P-value(AOV):0.383075242424429

Choosing:S84_GCGTAGAGAGAC_c

Not agreeing:S86_GAAACTCCTAGA_c S86_GAAACTCCTAGA_d,P-value(AOV):0.365529430945035

Choosing:S86_GAAACTCCTAGA_c

Not agreeing:S88_ATCGGGCTTAAC_c S88_ATCGGGCTTAAC_d,P-value(AOV):0.385312275513615

Choosing:S88_ATCGGGCTTAAC_c

-----SUMMARY STATISTICS------

Total agreeing:229

Total not agreeing:33

Total singletons:311

Samples with singletons:S129_TACGCCCATCAG_c,S132_AAGATCGTACTG_c,S134_ACTCATCTTCCA_c,S137_GAGATACAGTTC_c,S139_GCATGCATCCCA_c,S142_GATCTAATCGAG_c,S184_AATCTTGCGCCG_c,S187_GGAAATCCCATC_c,S189_GACCGTCAATAC_c,S192_TTGGAACGGCTT_c,S194_TCCTAGGTCCGA_c,S197_TCCTCACTATCA_c,S240_GCCTGCAGTACT_c,S243_GCCCAAGTTCAC_c,S245_ATAAAGAGGAGG_c,S248_GCGCCGAATCTT_c,S250_ATCCCAGCATGC_c,S253_GCTTCCAGACAA_c,S280_ACACAGTCCTGA_c,S283_ATTATACGGCGC_c,S285_ATTCAGATGGCA_c,S28_CACCTGTAGTAG_c,S294_CTCCAGGTCATG_c,S299_CAGGATTCGTAC_c,S301_CGCATACGACCT_c,S31_CACGAGCTACTC_c,S320_GCCTCGTACTGA_c,S322_ACCAACAGATTG_c,S323_GTGGCCTACTAC_c,S324_TTCCCTTCTCCG_c,S325_CATTTGACGACG_c,S327_AAGTGAAGCGAG_c,S33_TCTCGATAAGCG_c,S341_TGCCGCCGTAAT_c,S344_AACCTCGGATAA_c,S346_GTGCTTGTGTAG_c,S347_CAACTAGACTCG_c,S348_AGTGCCCTTGGT_c,S349_GGAACGACGTGA_c,S350_TGTCAGCTGTCG_c,S351_CTGGTGCTGAAT_c,S352_CGCGTCAAACTA_c,S354_GACAGAGGTGCA_c,S358_CTATCGGAAGAT_c,S361_CGGATTGCTGTA_c,S364_GGTACTGTACCA_c,S366_ATCGAATCGAGT_c,S369_GGTCGTGTCTTG_c,S36_AGACAAGCTTCC_c,S372_CGTCGTCTAAGA_c,S374_CAAGCGTTGTCC_c,S375_GACTTATGCCCG_c,S376_GTGACGTTAGTC_c,S377_GAGTCTTGGTAA_c,S378_TCGTCGCCAAAC_c,S379_AACATGCATGCC_c,S382_GTCTGTTGAGTG_c,S383_TGAGTTCGGTCC_c,S386_TTACGTGGCGAT_c,S388_CAATGCCTCACG_c,S389_TGTACGGATAAC_c,S38_TCCGCAACCTGA_c,S390_AATCAACTAGGC_c,S391_GTGAGGGCAAGT_c,S392_CGTGGGCTCATT_c,S393_CGTACCAGATCC_c,S394_ATGTTTAGACGG_c,S397_ACATGTCACGTG_c,S399_CTTTAGCGCTGG_c,S402_CTGGTCTTACGG_c,S404_CAAGTCGAATAC_c,S406_CTCGGTCAACCA_c,S407_ACCCTATTGCGG_c,S408_TCCGTTCGTTTA_c,S409_ACCACCGTAACC_c,S410_CATTTCGCACTT_c,S413_TTAAGCGCCTGA_c,S415_TGCGGGATTCAT_c,S418_CAAACTGCGTTG_c,S41_TCACTTGGTGCG_c,S420_TTAGACTCGGAA_c,S421_GACCGATAGGGA_c,S422_GGCGAACTGAAG_c,S423_CGGCACTATCAC_c,S424_AGGTGGTGGAGT_c,S425_ATTCCCAGAACG_c,S429_AGACGTTGCTAC_c,S430_AGAATAGCGCTT_c,S433_AAGCGTACATTG_c,S435_GTTATGACGGAT_c,S436_AGCCTCATGATG_c,S437_GTGTATCGCCAC_c,S438_CCAAACTCGTCG_c,S439_ACGTGAGGAACG_c,S440_TGAATCGAAGCT_c,S443_CTGCAGTAAGTA_c,S444_TATAGGCTCCGC_c,S445_ATCGTGTGTTGG_c,S449_CTTCCGCAGACA_c,S450_GCACTATACGCA_c,S454_CCAATGATAAGC_c,S499_TTAAACCGCGCC_c,S500_CTTGCATACCGG_c,S501_GTGCACGATAAT_c,S505_GGTCTAGGTCTA_c,S506_TCAGGACGTATC_c,S507_GAAAGGTGAGAA_c,S508_GAATATACCTGG_c,S509_GTCGCTTGCACA_c,S510_TCTACCACGAAG_c,S512_TAGTGCATTCGG_c,S520_CTAGCAGTATGA_c,S738_CGTGATCCGCTA_d,S522_GTATGGAGCTAT_c,S523_CCTTCTGTATAC_c,S524_ACGCTGTCGGTT_c,S525_CTCGTTTCAGTT_c,S559_GCTATTCCTCAT_c,S560_GCGAACCTATAC_c,S585_CTCTCATATGCT_c,S647_AGTACCTAAGTG_c,S650_GCACTTCATTTC_c,S651_AGAATCCACCAC_c,S652_CTCAAGTCAAAG_c,S653_GTACCTAGCCTG_c,S654_CACTGAGTACGT_c,S655_TCAAGCAATACG_c,S656_CATGTTGGAACA_c,S78_TTATGTACGGCG_c,S80_TTGGACGTCCAC_c,S82_TCCAGGGCTATA_c,S521_GTTAATGGCAGT_c,S701_TGTGCGATAACA_d,S702_GATTATCGACGA_d,S703_GCCTAGCCCAAT_d,S704_GATGTATGTGGT_d,S705_ACTCCTTGTGTT_d,S706_GTCACGGACATT_d,S707_GCGAGCGAAGTA_d,S708_ATCTACCGAAGC_d,S709_ACTTGGTGTAAG_d,S710_GCACACCTGATA_d,S711_GCGACAATTACA_d,S712_TCATGCTCCATT_d,S713_AGCTGTCAAGCT_d,S714_GAGAGCAACAGA_d,S715_TACTCGGGAACT_d,S716_CGTGCTTAGGCT_d,S717_TACCGAAGGTAT_d,S718_CACTCATCATTC_d,S719_GTATTTCGGACG_d,S720_TATCTATCCTGC_d,S721_AGTAGCGGAAGA_d,S722_GCAATTAGGTAC_d,S723_CATACCGTGAGT_d,S724_ATGTGTGTAGAC_d,S725_CCTGCGAAGTAT_d,S726_TTCTCTCGACAT_d,S727_GCTCTCCGTAGA_d,S728_GTTAAGCTGACC_d,S729_ATGCCATGCCGT_d,S730_GACATTGTCACG_d,S731_GCCAACAACCAT_d,S732_ACCCAAGCGTTA_d,S733_TGCAGCAAGATT_d,S734_AGCAACATTGCA_d,S735_GATGTGGTGTTA_d,S736_CAGAAATGTGTC_d,S737_GTAGAGGTAGAG_d,S739_GGTTATTTGGCG_d,S740_GGATCGTAATAC_d,S741_GCATAGCATCAA_d,S742_TGAACCCTATGG_d,S743_AGAGTCTTGCCA_d,S744_ACAACACTCCGA_d,S745_CGATGCTGTTGA_d,S746_ACGACTGCATAA_d,S747_ACGCGAACTAAT_d,S748_AGCTATGTATGG_d,S749_ACGGGTCATCAT_d,S750_GAAACATCCCAC_d,S751_CGTACTCTCGAG_d,S752_GATCACGAGAGG_d,S753_GTAAATTCAGGC_d,S754_AGTGTTTCGGAC_d,S755_ACACGCGGTTTA_d,S756_TGGCAAATCTAG_d,S757_CACCTTACCTTA_d,S758_TTAACCTTCCTG_d,S759_TGCCGTATGCCA_d,S760_CGTGACAATAGT_d,S761_TTAAGACAGTCG_d,S762_TCTGCACTGAGC_d,S763_CGCAGATTAGTA_d,S764_TGGGTCCCACAT_d,S765_CACTGGTGCATA_d,S766_AACGTAGGCTCT_d,S767_AGTTGTAGTCCG_d,S768_TAATCGGTGCCA_d,S769_TTGATCCGGTAG_d,S770_CGGGTGTTTGCT_d,S771_TTGACCGCGGTT_d,S772_GCTTGAGCTTGA_d,S773_CGCTGTGGATTA_d,S774_CTGTCAGTGACC_d,S775_ACGATTCGAGTC_d,S776_GGTTCGGTCCAT_d,S777_CTGATCCATCTT_d,S778_TATGTGCCGGCT_d,S779_TGGTCGCATCGT_d,S780_TGTAAGACTTGG_d,S781_CGGATCTAGTGT_d,S782_ACTGATGGCCTC_d,S783_TTCGATGCCGCA_d,S784_TGTGGCTCGTGT_d,S785_AACTTTCAGGAG_d,S786_TGCACGTGATAA_d,S787_AAGACAGCTATC_d,S788_CGTAGGTAGAGG_d,S789_ATTTAGGACGAC_d,S790_GGATAGCCAAGG_d,S791_TGGTTGGTTACG_d,S792_GTCGTCCAAATG_d,S793_CAACGTGCTCCA_d,S794_TACACAAGTCGC_d,S795_GCGTCCATGAAT_d,S796_GTAATGCGTAAC_d,S797_GTCGCCGTACAT_d,S798_GGAATCCGATTA_d,S799_TTCTGAGAGGTA_d,S800_ATCCCTACGGAA_d,S801_GGTTCCATTAGG_d,S802_GTGTTCCCAGAA_d,S803_CCGAGGTATAAT_d,S804_AGCGTAATTAGC_d,S805_CTCGTGAATGAC_d,S806_AGGTGAGTTCTA_d,S807_CCTGTCCTATCT_d,S808_GGTTTAACACGC_d,S809_AGACAGTAGGAG_d,S810_GCCGTAAACTTG_d,S811_GCAGATTTCCAG_d,S812_AGATGATCAGTC_d,S813_GAGACGTGTTCT_d,S814_TATCACCGGCAC_d,S815_TATGCCAGAGAT_d,S816_AGGTCCAAATCA_d,S817_ACCGTGCTCACA_d,S818_CTCCCTTTGTGT_d,S819_AGCTGCACCTAA_d,S820_GTTCCTCCATTA_d,S821_ACTCTAGCCGGT_d,S822_CGATAGGCCTTA_d,S823_AATGACCTCGTG_d,S824_CTTAGGCATGTG_d,S825_CCAGATATAGCA_d,S826_GAGAGTCCACTT_d,S827a_ACGTCTCAGTGC_d,S827b_GAATGACGTTTG_d,S827_TCAGTCAGATGA_d,S828_AAGTCACACACA_d,S828a_TAGTAGCACCTG_d,S828b_ACTTACGCCACG_d,S829a_AGGTCATCTTGG_d,S829b_ACGCCTTTCTTA_d,S829_GCTGTGATTCGA_d,S830a_TGCTGTGACCAC_d,S830b_TTGGTGCCTGTG_d,S830_CTAGCTATGGAC_d,S831a_ACACTTCGGCAA_d,S831b_CATCGGATCTGA_d,S831_CTTGACGAGGTT_d,S832a_ACCTCCCGGATA_d,S832_ACCTGGGAATAT_d,S832b_CATGTCTTCCAT_d,S833a_AGTAGACTTACG_d,S833b_GTTACAGTTGGC_d,S833_CTCTGCCTAATT_d,S834_ATATGACCCAGC_d,S834a_TGGAAACCATTG_d,S834b_CGGACTCGTTAC_d,S835a_AGTCCGAGTTGT_d,S835b_TCTCGCACTGGA_d,S835_CTCTATTCCACC_d,S836a_CCGCGATTTCGA_d,S836_ATTGAGTGAGTC_d,S836b_TTCTGGTCTTGT_d,S837a_ACACACCCTGAC_d,S837b_ATACGGGTTCGT_d,S837_TTATGGTACGGA_d,S838a_TCACGAGTCACA_d,S838b_GATTTAGAGGCT_d,S838_GCTAGTTATGGA_d,S839a_CACAAAGCGATT_d,S839b_GTCAGCCGTTAA_d,S839_CAGATTAACCAG_d,S840a_CACCGTGACACT_d,S840b_CCTTTCACCTGT_d,S840_GGCTGCATACTC_d,S841a_GAAGATCTATCG_d,S841b_GCAGCCATATTG_d,S841_TTGGTAAAGTGC_d,S842_AAGTGGCTATCC_d,S842a_GACGGAACAGAC_d,S842b_ATAGGTGTGCTA_d,S843_AACCGATGTACC_d,S843a_GGACCGCTTTCA_d,S843b_ACCTAGCTAGTG_d,S844a_CACGGTCCTATG_d,S844b_GTCCTGACACTG_d,S844_TCGATTGGCCGT_d

Total useful samples:573

Saving file

Samples<=10000:

          filteredSumOfSamples

S841b                       10

S769                        11

S750                       152

S751                       186

S749                       297

S747                       415

S752                       567

S708                       610

S707                       622

S746                      1360

S726                      2366

S709                      2937

S287                      2953

S646                      3073

S360                      3357

S453                      3907

S125                      3922

S23                       3992

S86                       4050

S293                      4105

S649                      4171

S513                      4232

S359                      4288

S219                      4387

S161                      4523

S88                       4550

S589                      4743

S220                      4805

S84                       4810

S511                      4837

S599                      4884

S291                      4963

S218                      5084

S604                      5146

S153                      5329

S216                      5426

S20                       5519

S60                       5545

S285                      5621

S71                       5666

S655                      5690

S113                      5721

S12                       5858

S151                      5912

S598                      5998

S62                       6080

S61                       6219

S10                       6292

S200                      6565

S263                      6796

S6                        6812

S510                      6950

S100                      6970

S388                      6981

S222                      6984

S70                       7033

S27                       7042

S613                      7055

S780                      7172

S57                       7187

S405                      7193

S43                       7353

S167                      7381

S221                      7450

S18                       7609

S454                      7614

S101                      7731

S596                      7753

S823                      7802

S280                      7819

S450                      7897

S611                      7901

S72                       7914

S59                       8007

S69                       8015

S512                      8038

S44                       8146

S612                      8217

S163                      8246

S74                       8461

S150                      8596

S821                      8869

S164                      8880

S73                       8881

S5                        8912

S585                      9105

S609                      9139

S443                      9185

S651                      9490

S82                       9595

S206                      9646

S223                      9801

S48                       9848

S383                      9904

25/06/2015:Tutorial: Correlation plot of significantly up/down regulated OTUs against the environmental data

Download the data from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

Please also have a look at https://en.wikipedia.org/wiki/Multiple_comparisons_problem

In the past, we have written a script that uses DESeq package to find significant OTUs between different conditions. We now modify that script to extract the list of significant OTUs and find their correlations with the environmental data

abund_table<-read.csv("All_Good_P2_C03.csv",row.names=1,check.names=FALSE)
abund_table<-
t(abund_table)

meta_table<-
read.csv("ENV_pitlatrine.csv",row.names=1,check.names=FALSE)
abund_table<-abund_table
[rownames(abund_table) %in% rownames(meta_table),]
abund_table<-abund_table
[,colSums(abund_table)>0]

grouping_info<-
t(as.data.frame(strsplit(rownames(abund_table),"_")))
grouping_info<-
as.data.frame(grouping_info)
rownames(grouping_info)<-rownames(abund_table)
colnames(grouping_info)<-c("Country","Latrine","Depth")

#Load taxonomy information
OTU_taxonomy<-
read.csv("All_Good_P2_C03_Taxonomy.csv",row.names=1,check.names=FALSE)

library(DESeq2)

#We will convert our table to DESeqDataSet object
countData =
 round(as(abund_table, "matrix"), digits = 0)
# We will add 1 to the countData otherwise DESeq will fail with the error:
# estimating size factors
# Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
# every gene contains at least one zero, cannot compute log geometric means
countData<-
(t(countData+1)) 

dds <- DESeqDataSetFromMatrix
(countData, grouping_info, as.formula(~ Country))

#Reference:https://github.com/MadsAlbertsen/ampvis/blob/master/R/amp_test_species.R
#Differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution
#Some reason this doesn't work: data_deseq_test = DESeq(dds, test="wald", fitType="parametric")
data_deseq_test = DESeq
(dds)

## Extract the results
res = results
(data_deseq_test, cooksCutoff = FALSE)
res_tax =
 cbind(as.data.frame(res), as.matrix(countData[rownames(res), ]), OTU = rownames(res))

sig =
0.00001
fold =
0
plot.point.size =
2
label=F
tax.display =
NULL
tax.aggregate =
"OTU"

res_tax_sig =
 subset(res_tax, padj < sig & fold < abs(log2FoldChange))

res_tax_sig <- res_tax_sig
[order(res_tax_sig$padj),]

## Plot the data
### MA plot
res_tax$Significant <-
 ifelse(rownames(res_tax) %in% rownames(res_tax_sig) , "Yes", "No")
res_tax$Significant
[is.na(res_tax$Significant)] <- "No"
p1 <-
 ggplot(data = res_tax, aes(x = baseMean, y = log2FoldChange, color = Significant)) +
 geom_point
(size = plot.point.size) +
 scale_x_log10
() +
 scale_color_manual
(values=c("black", "red")) +
 labs
(x = "Mean abundance", y = "Log2 fold change")+theme_bw()
if(label == T){
 
if (!is.null(tax.display)){
 
 rlab <- data.frame(res_tax, Display = apply(res_tax[,c(tax.display, tax.aggregate)], 1, paste, collapse="; "))
 
} else {
 
 rlab <- data.frame(res_tax, Display = res_tax[,tax.aggregate])
 
}
 p1 <- p1 + geom_text
(data = subset(rlab, Significant == "Yes"), aes(label = Display), size = 4, vjust = 1)
}
pdf("NB_MA.pdf")
print(p1)
dev.off()

res_tax_sig_abund =
 cbind(as.data.frame(countData[rownames(res_tax_sig), ]), OTU = rownames(res_tax_sig), padj = res_tax[rownames(res_tax_sig),"padj"]) 

#Apply normalisation (either use relative or log-relative transformation)
#data<-abund_table/rowSums(abund_table)
data<-
log((abund_table+1)/(rowSums(abund_table)+dim(abund_table)[2]))
data<-
as.data.frame(data)

#Now we plot taxa significantly different between the categories
df<-
NULL
for(i in res_tax[rownames(res_tax_sig),"OTU"]){
 tmp<-
data.frame(data[,i],grouping_info$Country,rep(paste(paste(i,gsub(".*;","",gsub(";+$","",paste(sapply(OTU_taxonomy[i,],as.character),collapse=";"))))," padj = ",sprintf("%.5g",res_tax[i,"padj"]),sep=""),dim(data)[1]))
 
if(is.null(df)){df<-tmp} else { df<-rbind(df,tmp)} 
}
colnames(df)<-c("Value","Type","Taxa")

library(ggplot2)

p<-
ggplot(df,aes(Type,Value,colour=Type))+ylab("Log-relative normalised")
p<-p+geom_boxplot
()+geom_jitter()+theme_bw()+
 facet_wrap
( ~ Taxa , scales="free_x",nrow=1)
p<-p+theme
(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+theme(strip.text.x = element_text(size = 16, colour = "black", angle = 90))
pdf("NB_significant.pdf",width=160,height=10)
print(p)
dev.off()


#res_tax[rownames(res_tax_sig),"OTU"] contains the list of OTUs that were found to be significant
#and we will then try to correlate them with the environmental data
#To do this we construct sel_env vector which contains the names of variables we are interested in
#and sel_env_label dataframe which is the new labelling to appear in the correlation plot
#and is used by the labeller function in facet_grid of ggplot2

sel_env<-
c("pH","Temp","TS","VS","VFA","CODt","CODs","perCODsbyt","NH4","Prot","Carbo")
sel_env_label <-
 list(
 
'pH'="PH",
 
'Temp'="Temperature",
 
'TS'="TS",
 
'VS'="VS",
 
'VFA'="VFA",
 
'CODt'="CODt",
 
'CODs'="CODs",
 
'perCODsbyt'="%CODs/t",
 
'NH4'="NH4",
 
'Prot'="Protein",
 
'Carbo'="Carbon"
)

sel_env_label<-
t(as.data.frame(sel_env_label))
sel_env_label<-
as.data.frame(sel_env_label)
colnames(sel_env_label)<-c("Trans")
sel_env_label$Trans<-
as.character(sel_env_label$Trans)


#Apply normalisation (either use relative or log-relative transformation)
#x<-abund_table/rowSums(abund_table)
x<-
log((abund_table+1)/(rowSums(abund_table)+dim(abund_table)[2]))

#Extract the data for those OTUs that were found to be significant
x<-x
[,as.character(res_tax[rownames(res_tax_sig),"OTU"])]

#Change the column names of tables to reflect taxonomy information using OTU_taxonomy dataframe
colnames(x)<-paste(colnames(x),sapply(colnames(x),
 
 function (i) gsub(".*;","",gsub(";+$","",
 
 paste(sapply(OTU_taxonomy[i,],as.character),collapse=";"))) ))

#Now get a filtered table based on sel_env
y<-meta_table
[,sel_env]

#We can get the correlation values for each group separately
groups<-grouping_info$Country

#method specifies the type of correlation you want to use: choose kendall, spearman, or pearson below:
method<-
"kendall"

#Now calculate the correlation between individual Taxa and the environmental data
df<-
NULL
for(i in colnames(x)){
 
for(j in colnames(y)){
   
for(k in unique(groups)){
     a<-x
[groups==k,i,drop=F]
     b<-y
[groups==k,j,drop=F]
     tmp<-
c(i,j,cor(a[complete.cases(b),],b[complete.cases(b),],use="everything",method=method),cor.test(a[complete.cases(b),],b[complete.cases(b),],method=method)$p.value,k)
     
if(is.null(df)){
       df<-tmp  
     
}
     
else{
       df<-
rbind(df,tmp)
     
}   
   
}
 
}
}

df<-
data.frame(row.names=NULL,df)
colnames(df)<-c("Taxa","Env","Correlation","Pvalue","Type")

#We need to convert Pvalue and Correlation to numerical values and introduce a new column
#called AdjPvalue to stored the adjusted p-values based on multi-testing correction
df$Pvalue<-as.numeric(as.character(df$Pvalue))
df$AdjPvalue<-rep(0,dim(df)[1])
df$Correlation<-as.numeric(as.character(df$Correlation))

#You can adjust the p-values for multiple comparison using Benjamini & Hochberg (1995):
# 1 -> donot adjust
# 2 -> adjust Env + Type (column on the correlation plot)
# 3 -> adjust Taxa + Type (row on the correlation plot for each type)
# 4 -> adjust Taxa (row on the correlation plot)
# 5 -> adjust Env (panel on the correlation plot)
adjustment_label<-
c("NoAdj","AdjEnvAndType","AdjTaxaAndType","AdjTaxa","AdjEnv")
adjustment<-
3

if(adjustment==1){
 df$AdjPvalue<-df$Pvalue
} else if (adjustment==2){
 
for(i in unique(df$Env)){
   
for(j in unique(df$Type)){
     sel<-
df$Env==i & df$Type==j
   
 df$AdjPvalue[sel]<-p.adjust(df$Pvalue[sel],method="BH")
   
}
 
}
} else if (adjustment==3){
 
for(i in unique(df$Taxa)){
   
for(j in unique(df$Type)){
     sel<-
df$Taxa==i & df$Type==j
   
 df$AdjPvalue[sel]<-p.adjust(df$Pvalue[sel],method="BH")
   
}
 
}
} else if (adjustment==4){
 
for(i in unique(df$Taxa)){
   sel<-
df$Taxa==i
 
 df$AdjPvalue[sel]<-p.adjust(df$Pvalue[sel],method="BH")
 
}
} else if (adjustment==5){
 
for(i in unique(df$Env)){
   sel<-
df$Env==i
 
 df$AdjPvalue[sel]<-p.adjust(df$Pvalue[sel],method="BH")
 
}
}

#Now we generate the labels for signifant values
df$Significance<-cut(df$AdjPvalue, breaks=c(-Inf, 0.001, 0.01, 0.05, Inf), label=c("***", "**", "*", ""))

#We ignore NAs
df<-
df[complete.cases(df),]

#We want to reorganize the Env data based on they appear
df$Env<-factor(df$Env,as.character(df$Env))

#We use the function to change the labels for facet_grid in ggplot2
Env_labeller <-
 function(variable,value){
 return(sel_env_label[as.character(value),"Trans"])
}

p <-
 ggplot(aes(x=Type, y=Taxa, fill=Correlation), data=df)
p <- p + geom_tile
() + scale_fill_gradient2(low="#2C7BB6", mid="white", high="#D7191C") 
p<-p+theme
(axis.text.x = element_text(angle = 90, hjust = 1, vjust=0.5))
p<-p+geom_text
(aes(label=Significance), color="black", size=3)+labs(y=NULL, x=NULL, fill=method)
p<-p+facet_grid
(. ~ Env, drop=TRUE,scale="free",space="free_x",labeller=Env_labeller)
p<-p+theme
(strip.text.x = element_text(size = 12, colour = "black", angle = 90))
pdf(paste("Correlation_",adjustment_label[adjustment],".pdf",sep=""),height=60,width=10)
print(p)
dev.off()

You have already seen the first two plots before and so I am not putting them here, the third plot is new and is the correlation plot of ~350 significant OTUs that we found before (please note that I have rotated the plot so that it takes less space):

Correlation_AdjTaxaAndType.jpg

18/06/2015:Tutorial: ANOVA with ggplot2 Part-2 (Diversity Data)[Modified:19/11/2015]

Download the data from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

Please also have a look at http://www.r-bloggers.com/analysis-of-variance-anova-for-multiple-comparisons/

Please also read through Vegan’s tutorial: http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

########## THE ONLY BIT OF CODE YOU NEED TO CHANGE ##########

#Load abundance table
abund_table<-
read.csv("SPE_pitlatrine.csv",row.names=1,check.names=FALSE)
abund_table<-
t(abund_table)
#Extract categorical data from site names and same them in grouping_info
grouping_info<-
data.frame(row.names=rownames(abund_table),t(as.data.frame(strsplit(rownames(abund_table),"_"))))
colnames(grouping_info)<-c("Country","Latrine","Depth")
#We are going to specify the category in grouping_column
grouping_column<-
"Depth"

###################################################

library(vegan)

#Calculate Richness
R<-rarefy(abund_table
,min(rowSums(abund_table)))
df_R<-
data.frame(sample=names(R),value=R,measure=rep("Richness",length(R)))

#Calculate Shannon entropy
H<-diversity(abund_table)
df_H<-
data.frame(sample=names(H),value=H,measure=rep("Shannon",length(H)))

#Calculate Simpson diversity index
simp <- diversity(abund_table
, "simpson")
df_simp<-
data.frame(sample=names(simp),value=simp,measure=rep("Simpson",length(simp)))

#Calculate Fisher alpha
alpha <- fisher.alpha(abund_table)
df_alpha<-
data.frame(sample=names(alpha),value=alpha,measure=rep("Fisher alpha",length(alpha)))

#Calculate Pielou's evenness
S <- specnumber(abund_table)
J <- H/log(S)
df_J<-
data.frame(sample=names(J),value=J,measure=rep("Pielou's evenness",length(J)))

df<-
rbind(df_R,df_H,df_simp,df_alpha,df_J)
rownames(df)<-NULL

#Incorporate categorical data in df
df<-
data.frame(df,grouping_info[as.character(df$sample),])

#To do anova, we will convert our data.frame to data.table
library(data.table)

#Since we can't pass a formula to data.table, I am creating
#a dummy column .group. so that I don't change names in the formula
dt<-
data.table(data.frame(df,.group.=df[,grouping_column]))

#I am also specifying a p-value cutoff for the ggplot2 strips
pValueCutoff<-
0.05
pval<-
dt[, list(pvalue = sprintf("%.2g", 
                               
 tryCatch(summary(aov(value ~ .group.))[[1]][["Pr(>F)"]][1],error=function(e) NULL))), 
       
 by=list(measure)]

#Filter out pvals that we don't want
pval<-pval[!pval$pvalue==
"",]
pval<-pval[
as.numeric(pval$pvalue)<=pValueCutoff,]

#I am using sapply to generate significances for pval$pvalue using the cut function.
pval$pvalue<-
sapply(as.numeric(pval$pvalue),function(x){as.character(cut(x,breaks=c(-Inf, 0.001, 0.01, 0.05, Inf),label=c("***", "**", "*", "")))})

#Update df$measure to change the measure names if the grouping_column has more than three classes
if(length(unique(as.character(grouping_info[,grouping_column])))>2){
 df$measure<-as.character(df$measure)
 
if(dim(pval)[1]>0){
   
for(i in seq(1:dim(pval)[1])){
   
 df[df$measure==as.character(pval[i,measure]),"measure"]=paste(as.character(pval[i,measure]),as.character(pval[i,pvalue]))
   }
 }
 df$measure<-as.factor(df$measure)
}

#Get all possible combination of values in the grouping_column
s<-
combn(unique(as.character(df[,grouping_column])),2)

#df_pw will store the pair-wise p-values
df_pw<-
NULL
for(k in unique(as.character(df$measure))){
 
#We need to calculate the coordinate to draw pair-wise significance lines
 
#for this we calculate bas as the maximum value
 bas<-
max(df[(df$measure==k),"value"])

 
#Calculate increments as 10% of the maximum values
 inc<-
0.1*bas

 
#Give an initial increment
 bas<-
bas+inc
 
for(l in 1:dim(s)[2]){

   
#Do a pair-wise anova
   tmp<-
c(k,s[1,l],s[2,l],bas,paste(sprintf("%.2g",tryCatch(summary(aov(as.formula(paste("value ~",grouping_column)),data=df[(df$measure==k) & (df[,grouping_column]==s[1,l] | df[,grouping_column]==s[2,l]),] ))[[1]][["Pr(>F)"]][1],error=function(e) NULL)),"",sep=""))

   
#Ignore if anova fails
   
if(!is.na(as.numeric(tmp[length(tmp)]))){

     
#Only retain those pairs where the p-values are significant
     
if(as.numeric(tmp[length(tmp)])<0.05){
       
if(is.null(df_pw)){df_pw<-tmp}else{df_pw<-rbind(df_pw,tmp)}

       
#Generate the next position
       bas<-
bas+inc
     }
   }
 }  
}
df_pw<-
data.frame(row.names=NULL,df_pw)
names(df_pw)<-c("measure","from","to","y","p")

library(ggplot2)

#We need grid to draw the arrows
library(grid)

#Draw the boxplots
p<-
ggplot(aes_string(x=grouping_column,y="value",color=grouping_column),data=df)
p<-p+geom_boxplot()+geom_jitter(position = position_jitter(height =
0, width=0))
p<-p+theme_bw()
p<-p+geom_point(size=
5,alpha=0.2)
p<-p+theme(axis.text.x = element_text(angle =
90, hjust = 1))

p<-p+facet_wrap(~measure
,scales="free_y",nrow=1)+ylab("Observed Values")+xlab("Samples")

#This loop will generate the lines and signficances
for(i in 1:dim(df_pw)[1]){
 p<-p+geom_path(inherit.aes=F
,aes(x,y),data = data.frame(x = c(which(levels(df[,grouping_column])==as.character(df_pw[i,"from"])),which(levels(df[,grouping_column])==as.character(df_pw[i,"to"]))), y = c(as.numeric(as.character(df_pw[i,"y"])),as.numeric(as.character(df_pw[i,"y"]))), measure=c(as.character(df_pw[i,"measure"]),as.character(df_pw[i,"measure"]))), color="black",lineend = "butt",arrow = arrow(angle = 90, ends = "both", length = unit(0.1, "inches")))
 p<-p+geom_text(inherit.aes=F
,aes(x=x,y=y,label=label),data=data.frame(x=(which(levels(df[,grouping_column])==as.character(df_pw[i,"from"]))+which(levels(df[,grouping_column])==as.character(df_pw[i,"to"])))/2,y=as.numeric(as.character(df_pw[i,"y"])),measure=as.character(df_pw[i,"measure"]),label=as.character(cut(as.numeric(as.character(df_pw[i,"p"])),breaks=c(-Inf, 0.001, 0.01, 0.05, Inf),label=c("***", "**", "*", "")))))
}

#We are going to use our favorite 21 orthogonal colours palette followed by greyscale values should we run out of assignments
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00",grey.colors(1000));
p<-p+scale_color_manual(grouping_column
,values=colours)

p<-p+theme(strip.background = element_rect(fill=
"white"))+theme(panel.margin = unit(0, "lines"))

pdf("ANOVA_diversity.pdf",height=7,width=40)
print(p)
dev.off()

You will get the following image if you use grouping_column<-”Country” in the above code:

ANOVA_diversity_Country.jpg

grouping_column<-”Latrine” gives:

ANOVA_diversity_Latrine.jpg

grouping_column<-”Depth” gives:

ANOVA_diversity_Depth.jpg

11/06/2015:Tutorial: ANOVA with ggplot2 Part-1 (Environmental Data) [Modified:19/11/2015]

Download the data from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

Please also have a look at http://www.r-bloggers.com/analysis-of-variance-anova-for-multiple-comparisons/

########## THE ONLY BIT OF CODE YOU NEED TO CHANGE ##########

#Load meta_table
meta_table<-
read.csv("ENV_pitlatrine.csv",row.names=1,check.names=FALSE)
grouping_info<-
data.frame(row.names=rownames(meta_table),t(as.data.frame(strsplit(rownames(meta_table),"_"))))
colnames(grouping_info)<-c("Country","Latrine","Depth")
#We are going to specify the category in grouping_column
grouping_column<-
"Country"

###################################################

#Ensure that all columns of meta_table are numeric and not factors
meta_table[] <-
 lapply(meta_table, function(x) as.numeric(as.character(x)))

library(reshape)
#We are linearizing our meta_table, why? because it is easier to do statistical analysis
#as well as visualisation using ggplot2, but we are going to use a trick here
#we will first convert to matrix before applying melt() so as to get the sample names out as well
df<-melt(
as.matrix(meta_table))
names(df)<-c("sample","env","value")

#Incorporate categorical data in df
df<-
data.frame(df,grouping_info[as.character(df$sample),])

#To do anova, we will convert our data.frame to data.table
library(data.table)

#Since we can't pass a formula to data.table, I am creating
#a dummy column .group. so that I don't change names in the formula
dt<-
data.table(data.frame(df,.group.=df[,grouping_column]))

#I am also specifying a p-value cutoff for the ggplot2 strips
pValueCutoff<-
0.05
pval<-
dt[, list(pvalue = sprintf("%.2g", 
       
 tryCatch(summary(aov(value ~ .group.))[[1]][["Pr(>F)"]][1],error=function(e) NULL))), 
       
 by=list(env)]

#Filter out pvals that we don't want
pval<-pval[!pval$pvalue==
"",]
pval<-pval[
as.numeric(pval$pvalue)<=pValueCutoff,]

#I am using sapply to generate significances for pval$pvalue using the cut function.
pval$pvalue<-
sapply(as.numeric(pval$pvalue),function(x){as.character(cut(x,breaks=c(-Inf, 0.001, 0.01, 0.05, Inf),label=c("***", "**", "*", "")))})

#Update df$env to change environmental variable names if the grouping_column has more than three classes
if(length(unique(as.character(grouping_info[,grouping_column])))>2){
 df$env<-as.character(df$env)
 
if(dim(pval)[1]>0){
   
for(i in seq(1:dim(pval)[1])){
   
 df[df$env==as.character(pval[i,env]),"env"]=paste(as.character(pval[i,env]),as.character(pval[i,pvalue]))
   }
 }
 df$env<-as.factor(df$env)
}

#Get all possible combination of values in the grouping_column
s<-
combn(unique(as.character(df[,grouping_column])),2)

#df_pw will store the pair-wise p-values
df_pw<-
NULL
for(k in unique(as.character(df$env))){
 
#We need to calculate the coordinate to draw pair-wise significance lines
 
#for this we calculate bas as the maximum value
 bas<-
max(df[(df$env==k),"value"])

 
#Calculate increments as 10% of the maximum values
 inc<-
0.1*bas

 
#Give an initial increment
 bas<-
bas+inc
 
for(l in 1:dim(s)[2]){

   
#Do a pair-wise anova
   tmp<-
c(k,s[1,l],s[2,l],bas,paste(sprintf("%.2g",tryCatch(summary(aov(as.formula(paste("value ~",grouping_column)),data=df[(df$env==k) & (df[,grouping_column]==s[1,l] | df[,grouping_column]==s[2,l]),] ))[[1]][["Pr(>F)"]][1],error=function(e) NULL)),"",sep=""))

   
#Ignore if anova fails
   
if(!is.na(as.numeric(tmp[length(tmp)]))){

     
#Only retain those pairs where the p-values are significant
     
if(as.numeric(tmp[length(tmp)])<0.05){
       
if(is.null(df_pw)){df_pw<-tmp}else{df_pw<-rbind(df_pw,tmp)}

       
#Generate the next position
       bas<-
bas+inc
     }
   }
 }  
}
df_pw<-
data.frame(row.names=NULL,df_pw)
names(df_pw)<-c("env","from","to","y","p")

library(ggplot2)

#We need grid to draw the arrows
library(grid)

#Draw the boxplots
p<-
ggplot(aes_string(x=grouping_column,y="value",color=grouping_column),data=df)
p<-p+geom_boxplot()+geom_jitter(position = position_jitter(height =
0, width=0))
p<-p+theme_bw()
p<-p+geom_point(size=
5,alpha=0.2)
p<-p+theme(axis.text.x = element_text(angle =
90, hjust = 1))

p<-p+facet_wrap(~env
,scales="free_y",nrow=1)+ylab("Observed Values")+xlab("Samples")

#This loop will generate the lines and signficances
for(i in 1:dim(df_pw)[1]){
 p<-p+geom_path(inherit.aes=F
,aes(x,y),data = data.frame(x = c(which(levels(df[,grouping_column])==as.character(df_pw[i,"from"])),which(levels(df[,grouping_column])==as.character(df_pw[i,"to"]))), y = c(as.numeric(as.character(df_pw[i,"y"])),as.numeric(as.character(df_pw[i,"y"]))), env=c(as.character(df_pw[i,"env"]),as.character(df_pw[i,"env"]))), color="black",lineend = "butt",arrow = arrow(angle = 90, ends = "both", length = unit(0.1, "inches")))
 p<-p+geom_text(inherit.aes=F
,aes(x=x,y=y,label=label),data=data.frame(x=(which(levels(df[,grouping_column])==as.character(df_pw[i,"from"]))+which(levels(df[,grouping_column])==as.character(df_pw[i,"to"])))/2,y=as.numeric(as.character(df_pw[i,"y"])),env=as.character(df_pw[i,"env"]),label=as.character(cut(as.numeric(as.character(df_pw[i,"p"])),breaks=c(-Inf, 0.001, 0.01, 0.05, Inf),label=c("***", "**", "*", "")))))
}

#We are going to use our favorite 21 orthogonal colours palette followed by greyscale values should we run out of assignments
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00",grey.colors(1000));
p<-p+scale_color_manual(grouping_column
,values=colours)

p<-p+theme(strip.background = element_rect(fill=
"white"))+theme(panel.margin = unit(0, "lines"))

pdf("ANOVA_env.pdf",height=7,width=40)
print(p)
dev.off()

You will get the following image if you use grouping_column<-”Country” in the above code:

ANOVA_env_Country.jpg

grouping_column<-”Latrine” gives:

ANOVA_env_Latrine.jpg

grouping_column<-”Depth” gives:

ANOVA_env_Depth.jpg

04/06/2015:Tutorial: Phyloseq I - plot_richness, plot_tree

Download the data from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

Please also have a look at phyloseq tutorials at http://joey711.github.io/phyloseq/tutorials-index

abund_table<-read.csv("All_Good_P2_C03.csv",row.names=1,check.names=FALSE)
abund_table<-
t(abund_table)

meta_table<-
read.csv("ENV_pitlatrine.csv",row.names=1,check.names=FALSE)
abund_table<-abund_table
[rownames(abund_table) %in% rownames(meta_table),]
abund_table<-abund_table
[,colSums(abund_table)>0]
meta_table<-meta_table
[rownames(abund_table),]

grouping_info<-
t(as.data.frame(strsplit(rownames(abund_table),"_")))
grouping_info<-
as.data.frame(grouping_info)
rownames(grouping_info)<-rownames(abund_table)
colnames(grouping_info)<-c("Country","Latrine","Depth")

#Load taxonomy information
OTU_taxonomy<-
read.csv("All_Good_P2_C03_Taxonomy.csv",row.names=1,check.names=FALSE)

#Load tree using ape package
library(ape)
OTU_tree <- read.tree
("All_Good_P2_C03.tre")

library(phyloseq)
#Convert the data to phyloseq format
OTU = otu_table
(as.matrix(abund_table), taxa_are_rows = FALSE)
TAX = tax_table
(as.matrix(OTU_taxonomy))

#Combine quantitative (meta_table) and categorical (grouping_info) together
meta_table<-
data.frame(meta_table,grouping_info)

#We will use str_pad function pad 0s to our Depth variable

library(stringr)
meta_table$Depth<-
as.factor(str_pad(as.character(meta_table$Depth), 2, pad = "0"))
SAM = sample_data
(meta_table)

#Uncomment the following if you want your tree to be ultrametricised
#OTU_tree<-compute.brlen(OTU_tree,method="Grafen")
physeq<-merge_phyloseq
(phyloseq(OTU, TAX),SAM,OTU_tree)

#Alpha diversity analysis
colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00");

library(ggplot2)
p<-plot_richness
(physeq, x = "Depth", color = "Country")
p<-p+theme_bw
()
p<-p+geom_boxplot
(aes(group=interaction(Country,Depth)),alpha=0.6,position="identity")
p<-p+geom_point
(size=5,alpha=0.5)
p<-p+scale_colour_manual
(values=colours)
p<-p+theme
(strip.text.x = element_text(size = 12, colour = "black", angle = 90))
pdf("AlphaDiversity.pdf",width=25)
print(p)
dev.off()

#We can use subset_taxa to select the taxonomic groups we are interested in plotting
physeq_subset<-subset_taxa
(physeq,Family=="Synergistaceae")
p <- plot_tree
(physeq_subset,size = "abundance", color = "Depth", label.tips = "Genus", text.size=2, ladderize = TRUE)
pdf("Synergistaceae.pdf")
print(p)
dev.off()

The above code will produce the following plots:

AlphaDiversity.jpg

Synergistaceae.jpg

28/05/2015: Tutorial: R Regression Diagnostics (Pitlatrine Longitudinal Dataset)

Here is a useful link for regression diagnostics in R:

http://www.statmethods.net/stats/rdiagnostics.html

Cook’s distance:

http://en.wikipedia.org/wiki/Cook%27s_distance

Here is my script to remove collinear terms using Variance Inflation Factor:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/remove_colinear_terms.R

Perhaps this link may also be useful:

http://polisci.msu.edu/jacoby/icpsr/regress3/lectures/week3/11.Outliers.pdf

14,21/05/2015: Tutorial: Finding OTUs that are significantly up or down regulated between different conditions

Download the data from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

I have modified NB.R and KW.R scripts from above to include taxonomy information for the OTUs in the figure panels. The scripts should work with the data generated in the previous tutorials.

Based on DESeq {DESeq2} package that allows negative binomial GLM fitting and Wald statistics for abundance data

abund_table<-read.csv("All_Good_P2_C03.csv",row.names=1,check.names=FALSE)
abund_table<-
t(abund_table)

meta_table<-
read.csv("ENV_pitlatrine.csv",row.names=1,check.names=FALSE)
abund_table<-abund_table
[rownames(abund_table) %in% rownames(meta_table),]
abund_table<-abund_table
[,colSums(abund_table)>0]
meta_table<-meta_table
[rownames(abund_table),]

 
grouping_info<-
t(as.data.frame(strsplit(rownames(abund_table),"_")))
grouping_info<-
as.data.frame(grouping_info)
rownames(grouping_info)<-rownames(abund_table)
colnames(grouping_info)<-c("Country","Latrine","Depth")

#Load taxonomy information
OTU_taxonomy<-
read.csv("All_Good_P2_C03_Taxonomy.csv",row.names=1,check.names=FALSE)

library(DESeq2)

#We will convert our table to DESeqDataSet object
countData =
 round(as(abund_table, "matrix"), digits = 0)
# We will add 1 to the countData otherwise DESeq will fail with the error:
# estimating size factors
# Error in estimateSizeFactorsForMatrix(counts(object), locfunc = locfunc,  :
# every gene contains at least one zero, cannot compute log geometric means
countData<-
(t(countData+1)) 

dds <- DESeqDataSetFromMatrix
(countData, grouping_info, as.formula(~ Country))

#Reference:https://github.com/MadsAlbertsen/ampvis/blob/master/R/amp_test_species.R

#Differential expression analysis based on the Negative Binomial (a.k.a. Gamma-Poisson) distribution
#Some reason this doesn't work: data_deseq_test = DESeq(dds, test="wald", fitType="parametric")
data_deseq_test = DESeq
(dds)

## Extract the results
res = results
(data_deseq_test, cooksCutoff = FALSE)
res_tax =
 cbind(as.data.frame(res), as.matrix(countData[rownames(res), ]), OTU = rownames(res))

sig =
0.00001
fold =
0
plot.point.size =
2
label=F
tax.display =
NULL
tax.aggregate =
"OTU"

res_tax_sig =
 subset(res_tax, padj < sig & fold < abs(log2FoldChange))

res_tax_sig <- res_tax_sig
[order(res_tax_sig$padj),]

## Plot the data
### MA plot
res_tax$Significant <-
 ifelse(rownames(res_tax) %in% rownames(res_tax_sig) , "Yes", "No")
res_tax$Significant
[is.na(res_tax$Significant)] <- "No"
p1 <-
 ggplot(data = res_tax, aes(x = baseMean, y = log2FoldChange, color = Significant)) +
 geom_point
(size = plot.point.size) +
 scale_x_log10
() +
 scale_color_manual
(values=c("black", "red")) +
 labs
(x = "Mean abundance", y = "Log2 fold change")+theme_bw()
if(label == T){
 
if (!is.null(tax.display)){
 
 rlab <- data.frame(res_tax, Display = apply(res_tax[,c(tax.display, tax.aggregate)], 1, paste, collapse="; "))
 
} else {
 
 rlab <- data.frame(res_tax, Display = res_tax[,tax.aggregate])
 
}
 p1 <- p1 + geom_text
(data = subset(rlab, Significant == "Yes"), aes(label = Display), size = 4, vjust = 1)
}
pdf("NB_MA.pdf")
print(p1)
dev.off()

res_tax_sig_abund =
 cbind(as.data.frame(countData[rownames(res_tax_sig), ]), OTU = rownames(res_tax_sig), padj = res_tax[rownames(res_tax_sig),"padj"]) 

#Apply normalisation (either use relative or log-relative transformation)
#data<-abund_table/rowSums(abund_table)
data<-
log((abund_table+1)/(rowSums(abund_table)+dim(abund_table)[2]))
data<-
as.data.frame(data)

#Now we plot taxa significantly different between the categories
df<-
NULL
for(i in res_tax[rownames(res_tax_sig),"OTU"]){
 tmp<-
data.frame(data[,i],grouping_info$Country,rep(paste(paste(i,gsub(".*;","",gsub(";+$","",paste(sapply(OTU_taxonomy[i,],as.character),collapse=";"))))," padj = ",sprintf("%.5g",res_tax[i,"padj"]),sep=""),dim(data)[1]))
 
if(is.null(df)){df<-tmp} else { df<-rbind(df,tmp)} 
}
colnames(df)<-c("Value","Type","Taxa")

p<-
ggplot(df,aes(Type,Value,colour=Type))+ylab("Log-relative normalised")
p<-p+geom_boxplot
()+geom_jitter()+theme_bw()+
 facet_wrap
( ~ Taxa , scales="free_x",nrow=1)
p<-p+theme
(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+theme(strip.text.x = element_text(size = 16, colour = "black", angle = 90))
pdf("NB_significant.pdf",width=160,height=10)
print(p)
dev.off()

NB_MA.jpg

NB_significant.jpg

*Above figure is just left half of NB_significant.pdf

Based on Kruskal-Wallis Test with FDR

abund_table<-read.csv("All_Good_P2_C03.csv",row.names=1,check.names=FALSE)
abund_table<-
t(abund_table)

meta_table<-
read.csv("ENV_pitlatrine.csv",row.names=1,check.names=FALSE)
abund_table<-abund_table
[rownames(abund_table) %in% rownames(meta_table),]
abund_table<-abund_table
[,colSums(abund_table)>0]
meta_table<-meta_table
[rownames(abund_table),]

 
grouping_info<-
t(as.data.frame(strsplit(rownames(abund_table),"_")))
grouping_info<-
as.data.frame(grouping_info)
rownames(grouping_info)<-rownames(abund_table)
colnames(grouping_info)<-c("Country","Latrine","Depth")

#Load taxonomy information
OTU_taxonomy<-
read.csv("All_Good_P2_C03_Taxonomy.csv",row.names=1,check.names=FALSE)

groups<-
as.factor(grouping_info$Country)

#Apply normalisation (either use relative or log-relative transformation)
#data<-abund_table/rowSums(abund_table)
data<-
log((abund_table+1)/(rowSums(abund_table)+dim(abund_table)[2]))
data<-
as.data.frame(data)

#Reference: http://www.bigre.ulb.ac.be/courses/statistics_bioinformatics/practicals/microarrays_berry_2010/berry_feature_selection.html
kruskal.wallis.alpha=
0.001
kruskal.wallis.table <-
 data.frame()
for (i in 1:dim(data)[2]) {
 ks.test <- kruskal.test(data[,i], g=groups)
 
# Store the result in the data frame
 kruskal.wallis.table <-
 rbind(kruskal.wallis.table,
                             
 data.frame(id=names(data)[i],
                                          p.value=
ks.test$p.value
                               
))
 
# Report number of values tested
 cat(paste("Kruskal-Wallis test for ",names(data)[i]," ", i, "/", 
         
 dim(data)[2], "; p-value=", ks.test$p.value,"\n", sep=""))
}


kruskal.wallis.table$E.value <- kruskal.wallis.table$p.value *
 dim(kruskal.wallis.table)[1]

kruskal.wallis.table$FWER <-
 pbinom(q=0, p=kruskal.wallis.table$p.value, 
                                   size=
dim(kruskal.wallis.table)[1], lower.tail=FALSE)

kruskal.wallis.table <- kruskal.wallis.table
[order(kruskal.wallis.table$p.value,
                                                  decreasing=
FALSE), ]
kruskal.wallis.table$q.value.factor <-
 dim(kruskal.wallis.table)[1] / 1:dim(kruskal.wallis.table)[1]
kruskal.wallis.table$q.value <- kruskal.wallis.table$p.value * kruskal.wallis.table$q.value.factor
pdf("KW_correction.pdf")
plot(kruskal.wallis.table$p.value,
    kruskal.wallis.table$E.value
,
    main=
'Multitesting corrections',
    xlab=
'Nominal p-value',
    ylab=
'Multitesting-corrected statistics',
   
 log='xy',
   
 col='blue',
    panel.first=
grid(col='#BBBBBB',lty='solid'))
lines(kruskal.wallis.table$p.value,
     kruskal.wallis.table$FWER
,
     pch=
20,col='darkgreen', type='p'
)
lines(kruskal.wallis.table$p.value,
     kruskal.wallis.table$q.value
,
     pch=
'+',col='darkred', type='p'
)
abline(h=kruskal.wallis.alpha, col='red', lwd=2)
legend('topleft', legend=c('E-value', 'p-value', 'q-value'), col=c('blue', 'darkgreen','darkred'), lwd=2,bg='white',bty='o')
dev.off()

last.significant.element <-
 max(which(kruskal.wallis.table$q.value <= kruskal.wallis.alpha))
selected <-
1:last.significant.element
diff.cat.factor <- kruskal.wallis.table$id
[selected]
diff.cat <-
 as.vector(diff.cat.factor)

print(kruskal.wallis.table[selected,])

#Now we plot taxa significantly different between the categories
df<-
NULL
for(i in diff.cat){
 tmp<-
data.frame(data[,i],groups,rep(paste(paste(i,gsub(".*;","",gsub(";+$","",paste(sapply(OTU_taxonomy[i,],as.character),collapse=";")))),"\nq = ",sprintf("%.5g",kruskal.wallis.table[kruskal.wallis.table$id==i,"q.value"]),sep=""),dim(data)[1]))
 
if(is.null(df)){df<-tmp} else { df<-rbind(df,tmp)} 
}
colnames(df)<-c("Value","Type","Taxa")

p<-
ggplot(df,aes(Type,Value,colour=Type))+ylab("Log-relative normalised")
p<-p+geom_boxplot
()+geom_jitter()+theme_bw()+
 facet_wrap
( ~ Taxa , scales="free_x",nrow=1)
p<-p+theme
(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+theme(strip.text.x = element_text(size = 16, colour = "black", angle = 90))
pdf("KW_significant.pdf",width=160,height=10)
print(p)
dev.off()

KW_correction.jpg

KW_significant.jpg

*Above figure is just left half of the KW_significant.pdf

23/04/2015: Tutorial: Multivariate Statistical Analysis of Microbial Communities in an Environmental Context

We will go through the following tutorial:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

Please also read through this:

http://cc.oulu.fi/~jarioksa/opetus/metodi/vegantutor.pdf

09,16/04/2015: Tutorial: Illumina Amplicons OTU Construction with Noise Removal

This tutorial is a little bit different than http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/Illumina_workflow.html 

as I have added error correction using SPAdes assembler that incorporates bayeshammer.

Why? Because this strategy along with overlapping paired-end reads reduces errors for MiSeq. For justification, read through our recently published paper (look at Figure 10):

http://nar.oxfordjournals.org/content/early/2015/01/13/nar.gku1341.full

When you are done generating OTUs, you can use phyloseq to do the analysis (end bit of my tutorial on R code):

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

The tutorial starts here:

Log on to the server and create a folder OTU_TUTORIAL inside your directory where you will do majority of the analysis

[MScBioinf@quince-srv2 ~/uzi]$ mkdir OTU_TUTORIAL

[MScBioinf@quince-srv2 ~/uzi]$ cd OTU_TUTORIAL

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$

Your sequencing center will share an online link for your data after demultiplexing your samples and generating the FASTQ files. In most cases, these will have the name SampleName_L001_R[1/2]_001.fastq when generated from the Illumina MiSeq or HiSeq.  Say we start with an example dataset comprising 24 fecal samples (from 12 healthy and 12 children with Crohns disease). Let us copy the required files to OTU_TUTORIAL folder.

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ cp /home/opt/tutorials/Raw/*.fastq .

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ ls

109-2_S109_L001_R1_001.fastq  13-1_S13_L001_R1_001.fastq

109-2_S109_L001_R2_001.fastq  13-1_S13_L001_R2_001.fastq

110-2_S110_L001_R1_001.fastq  132-2_S131_L001_R1_001.fastq

110-2_S110_L001_R2_001.fastq  132-2_S131_L001_R2_001.fastq

113-2_S113_L001_R1_001.fastq  20-1_S20_L001_R1_001.fastq

113-2_S113_L001_R2_001.fastq  20-1_S20_L001_R2_001.fastq

114-2_S114_L001_R1_001.fastq  27-1_S27_L001_R1_001.fastq

114-2_S114_L001_R2_001.fastq  27-1_S27_L001_R2_001.fastq

115-2_S115_L001_R1_001.fastq  32-1_S32_L001_R1_001.fastq

115-2_S115_L001_R2_001.fastq  32-1_S32_L001_R2_001.fastq

117-2_S117_L001_R1_001.fastq  38-1_S38_L001_R1_001.fastq

117-2_S117_L001_R2_001.fastq  38-1_S38_L001_R2_001.fastq

119-2_S119_L001_R1_001.fastq  45-1_S45_L001_R1_001.fastq

119-2_S119_L001_R2_001.fastq  45-1_S45_L001_R2_001.fastq

1-1_S1_L001_R1_001.fastq          51-1_S51_L001_R1_001.fastq

1-1_S1_L001_R2_001.fastq          51-1_S51_L001_R2_001.fastq

120-2_S120_L001_R1_001.fastq  56-1_S56_L001_R1_001.fastq

120-2_S120_L001_R2_001.fastq  56-1_S56_L001_R2_001.fastq

126-2_S125_L001_R1_001.fastq  62-1_S62_L001_R1_001.fastq

126-2_S125_L001_R2_001.fastq  62-1_S62_L001_R2_001.fastq

128-2_S127_L001_R1_001.fastq  68-1_S68_L001_R1_001.fastq

128-2_S127_L001_R2_001.fastq  68-1_S68_L001_R2_001.fastq

130-2_S129_L001_R1_001.fastq  7-1_S7_L001_R1_001.fastq

130-2_S129_L001_R2_001.fastq  7-1_S7_L001_R2_001.fastq

We want to organise the data in such a manner that we have a folder for each sample and within that folder we have a Raw folder where we will place the relevant forward and reverse paired-end FASTQ files. For this purpose, we will use a small bash one-liner to organize the data:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(awk -F"_" '{print $1}' <(ls *.fastq) | sort | uniq); do mkdir $i; mkdir $i/Raw; mv $i*.fastq $i/Raw/.; done

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ ls

109-2  110-2  114-2  117-2  120-2  128-2  13-1   20-1  32-1  45-1  56-1  68-1

1-1        113-2  115-2  119-2  126-2  130-2  132-2  27-1  38-1  51-1  62-1  7-1

The contents of 109-2 will be as follows:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ ls 109-2
Raw
[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ ls 109-2/Raw
109-2_S109_L001_R1_001.fastq  109-2_S109_L001_R2_001.fastq
[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$

Next we want to see if the files require quality trimming. This step is optional and most sequencing centres will have already quality trimmed your files. A way to check is by looking at length distribution of sequences, i.e., for each sample, we want to see how many sequences of a given length are obtained. The output is given as [LENGTH],[FREQUENCY]. Notice that I am checking it for forward reads only ${i}/Raw/*_R1_001.fastq:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls -d *); do echo $i":";bioawk -cfastx '{i[length($seq)]++}END{for(j in i)print j","i[j]}' ${i}/Raw/*_R1_001.fastq | sort -nrk2 -t",";done

109-2:

251,5382

1-1:

251,21165

110-2:

251,3745

113-2:

251,1907

114-2:

251,7536

115-2:

251,5585

117-2:

251,206085

119-2:

251,13616

120-2:

251,9784

126-2:

251,1689

128-2:

251,27086

130-2:

251,13278

13-1:

251,32388

132-2:

251,13130

20-1:

251,55596

27-1:

251,23450

32-1:

251,4586

38-1:

251,10527

45-1:

251,5007

51-1:

251,35597

56-1:

251,33292

62-1:

251,7358

68-1:

251,17025

7-1:

251,1

As can be seen, we have sequences of equal length and therefore, we will then use sickle to do quality trimming. We will consider a 20bp long window and trim the reads where the average quality score drops below 20 as well as read length goes below 10bp:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls -d *); do cd $i;cd Raw; R1=$(ls *_R1_*.fastq); R2=$(ls *_R2_*.fastq); cd .. ; sickle pe -f Raw/$R1 -r Raw/$R2 -o ${R1%.*}_trim.fastq -p ${R2%.*}_trim.fastq -s ${R1%.*}_singlet.fastq -q 20 -l 10 -t "sanger";cd ..; done

Now check the lengths distribution again:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls -d *); do echo $i":";bioawk -cfastx '{i[length($seq)]++}END{for(j in i)print j","i[j]}' $i/*_R1_*trim.fastq | sort -nrk2 -t",";done | head -30

109-2:

251,4623

250,301

249,26

149,19

147,14

137,14

225,11

223,11

214,11

138,11

222,10

201,8

187,8

148,8

218,7

150,7

224,6

212,6

209,6

204,6

203,6

159,6

158,6

146,6

140,6

248,5

226,5

221,5

213,5

We will use SPAdes assembler that will create a corrected folder for each sample with error-corrected paired-end reads inside that folder:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls */ -d); do cd $i; /home/opt/SPAdes-2.5.0-Linux/bin/spades.py -1 *_R1_*trim.fastq -2 *_R2_*trim.fastq -o . --only-error-correction --careful --disable-gzip-output ; cd ..; done

Check to see if the corrected folder is generated successfully for each sample:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls -d *); do echo $i:; ls -1 $i; done

109-2:

109-2_S109_L001_R1_001_singlet.fastq

109-2_S109_L001_R1_001_trim.fastq

109-2_S109_L001_R2_001_trim.fastq

corrected

input_dataset.yaml

params.txt

Raw

spades.log

1-1:

1-1_S1_L001_R1_001_singlet.fastq

1-1_S1_L001_R1_001_trim.fastq

1-1_S1_L001_R2_001_trim.fastq

corrected

input_dataset.yaml

params.txt

Raw

spades.log

110-2:

110-2_S110_L001_R1_001_singlet.fastq

110-2_S110_L001_R1_001_trim.fastq

110-2_S110_L001_R2_001_trim.fastq

corrected

input_dataset.yaml

params.txt

Raw

spades.log

We will next use pandaseq to overlap our paired-end reads using a minimum overlap of 20. There is also a bug with error correction when using SPAdes as the resulting error corrected reads do not have the identifier to distinguish between the forward and the reverse read. For this purpose, we will just add a fictitious identifier using the awk statement so that pandaseq does not complain:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ for i in $(ls */ -d); do cd $i; /home/opt/pandaseq/pandaseq -f <(awk '/^@M01359/{$0=$0" 1:N:0:GGACTCCTGTAAGGAG"}1' corrected/*R1*.cor.fastq) -r <(awk '/^@M01359/{$0=$0" 2:N:0:GGACTCCTGTAAGGAG"}1' corrected/*R2*.cor.fastq) -B -d bfsrk -o 20 > $(basename ${i})".overlap.fasta"; cd ..; done

*Note that for your samples, you have to change @M01359 to the identifier that is found in your FASTQ files, otherwise it will fail.

Now that we have obtained the overlapped reads, we will generate the OTUs in an another folder in the parent directory so as not to mess with the previous folder structure:

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ ls

109-2  1-1  110-2  113-2  114-2  115-2  117-2  119-2  120-2  126-2  128-2  130-2  13-1  132-2  20-1  27-1  32-1  38-1  45-1  51-1  56-1  62-1  68-1  7-1

[MScBioinf@quince-srv2 ~/uzi/OTU_TUTORIAL]$ cd ..

[MScBioinf@quince-srv2 ~/uzi]$ mkdir UP

[MScBioinf@quince-srv2 ~/uzi]$ cd UP

[MScBioinf@quince-srv2 ~/uzi/UP]$

We need to combine all the overlapped sequences (*.overlap.fasta) by generating multiplexed.fasta. Final labels are in usearch format: >barcodelabel=FolderName;SID. The sequences in each sample are given internal identifiers starting with S1, S2, and so on:

[MScBioinf@quince-srv2 ~/uzi/UP]$ for i in $(ls -d ../OTU_TUTORIAL/*/); do awk -v k=$(basename ${i}) '/^>/{$0=">barcodelabel="k";S"(++i)}1' < $i/*.overlap.fasta; done > multiplexed.fasta

Check the total number of reads and the format of the multiplexed file:

[MScBioinf@quince-srv2 ~/uzi/UP]$ grep -c ">" multiplexed.fasta

545569

[MScBioinf@quince-srv2 ~/uzi/UP]$ head multiplexed.fasta

>barcodelabel=109-2;S1

TACGTAGGGGGCAAGCGTTATCCGGAATTACTGGGTGTAAAGGGTGCGTAGGTGGTATGGCAAGTCAGAAGTGAAAACCCAGGGCTTAACTCTGGGACTGCTTTTGAAACTGTCAGACTGGAGTGCAGGAGAGGTAAGCGGAATTCCTAGTGTAGCGGTGAAATGCGTAGATATTAGGAGGAACATCAGTGGCGAAGGCGGCTTACTGGACTGAAACTGACACTGAGGCACGAAAGCGTGGGGAGCAAACAGG

>barcodelabel=109-2;S2

TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAG

>barcodelabel=109-2;S3

TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG

>barcodelabel=109-2;S4

TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCTGCGCCGGGTACGGGCGGGCTGGAGTGCGGTAGGGGAGGCTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG

>barcodelabel=109-2;S5

TACGTAGGGTGCAAGCGTTATCCGGAATTATTGGGCGTAAAGGGCTCGTAGGCGGTTCGTCGCGTCCGGTGTGAAAGTCCATCGCTTAACGGTGGATCCGCGCCGGGTACGGGCGGGCTTGAGTGCGGTAGGGGAGACTGGAATTCCCGGTGTAACGGTGGAATGTGTAGATATCGGGAAGAACACCAATGGCGAAGGCAGGTCTCTGGGCCGTTACTGACGCTGAGGAGCGAAAGCGTGGGGAGCGAACAGG

Linearize multiplexed.fasta (i.e., sequences spanning multiple lines should all be in one line):

[MScBioinf@quince-srv2 ~/uzi/UP]$ awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' multiplexed.fasta > multiplexed_linearized.fasta

Dereplicate the sequences (32-bit version of usearch will fail as there is a 4GB memory cutoff). Let us dereplicate using a work-around and produce the output in usearch format:

[MScBioinf@quince-srv2 ~/uzi/UP]$ grep -v "^>" multiplexed_linearized.fasta | grep -v [^ACGTacgt] | sort -d | uniq -c | while read abundance sequence ; do hash=$(printf "${sequence}" | sha1sum); hash=${hash:0:40};printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}"; done > multiplexed_linearized_dereplicated.fasta

Do abundance sort and discard singletons:

[MScBioinf@quince-srv2 ~/uzi/UP]$ ~/bin/usearch7.0.1001_i86linux32 -sortbysize multiplexed_linearized_dereplicated.fasta -output multiplexed_linearized_dereplicated_sorted.fasta -minsize 2

Perform OTU clustering and de novo chimera removal:

[MScBioinf@quince-srv2 ~/uzi/UP]$ ~/bin/usearch7.0.1001_i86linux32 -cluster_otus multiplexed_linearized_dereplicated_sorted.fasta -otus otus1.fa

Perform chimera filtering using reference database. The cluster_otus command in usearch discards reads that have chimeric models built from more abundant reads. However, a few chimeras may be missed, especially if they have parents that are absent from the reads or are present with very low abundance. It is therefore recommend to use a reference-based chimera filtering step using UCHIME if a suitable database is available. Use the uchime_ref command for this step with the OTU representative sequences as input and the -nonchimeras option to get a chimera-filtered set of OTU sequences. For the 16S gene, Robert Edgar recommends the gold database (http://drive5.com/uchime/gold.fa) and not using a large 16S database like Greengenes.

[MScBioinf@quince-srv2 ~/uzi/UP]$ ~/bin/usearch7.0.1001_i86linux32 -uchime_ref otus1.fa -db ~/bin/gold.fa -strand plus -nonchimeras otus2.fa

Label OTU sequences OTU_1, OTU_2,... We will use fasta_number.py from Robert Edgar's python scripts (http://drive5.com/python/python_scripts.tar.gz):

[MScBioinf@quince-srv2 ~/uzi/UP]$ python ~/bin/fasta_number.py otus2.fa OTU_ > otus.fa

Map reads including singletons back to OTUs:

[MScBioinf@quince-srv2 ~/uzi/UP]$ ~/bin/usearch7.0.1001_i86linux32 -usearch_global multiplexed_linearized.fasta -db otus.fa -strand plus -id 0.97 -uc map.uc

Incase usearch runs out of memory (it did not in our case):

There is a 3GB limit on free version of usearch and if the above command fails, then we have to split the original file into small chunks. Make two folders and download fasta_splitter.pl (http://kirill-kryukov.com/study/tools/fasta-splitter/ ):

[MScBioinf@quince-srv2 ~/uzi/UP]$ mkdir split_files

[MScBioinf@quince-srv2 ~/uzi/UP]$ mkdir uc_files

In split_files, break the fasta files into 100 equal parts:

[MScBioinf@quince-srv2 ~/uzi/UP]$ cd split_files

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ perl ~/bin/fasta_splitter.pl -n-parts-total 100 ../multiplexed_linearized.fasta

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ mv ../*.part-*.fasta .

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ ls

multiplexed_linearized.part-001.fasta

multiplexed_linearized.part-002.fasta

multiplexed_linearized.part-003.fasta

multiplexed_linearized.part-004.fasta

multiplexed_linearized.part-005.fasta

multiplexed_linearized.part-006.fasta

multiplexed_linearized.part-007.fasta

multiplexed_linearized.part-008.fasta

multiplexed_linearized.part-009.fasta

multiplexed_linearized.part-010.fasta

...

Now run the command in split_files folder:

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ for i in $(ls *.fasta); do ~/bin/usearch7.0.1001_i86linux32 -usearch_global $i -db ../otus.fa -strand plus -id 0.97 -uc ../uc_files/$i.map.uc; done

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ ls ../uc_files

multiplexed_linearized.part-001.fasta.map.uc

multiplexed_linearized.part-002.fasta.map.uc

multiplexed_linearized.part-003.fasta.map.uc

multiplexed_linearized.part-004.fasta.map.uc

multiplexed_linearized.part-005.fasta.map.uc

multiplexed_linearized.part-006.fasta.map.uc

multiplexed_linearized.part-007.fasta.map.uc

multiplexed_linearized.part-008.fasta.map.uc

multiplexed_linearized.part-009.fasta.map.uc

multiplexed_linearized.part-010.fasta.map.uc

...

And combine all the *.map.uc files together:

[MScBioinf@quince-srv2 ~/uzi/UP/split_files]$ cd ..

[MScBioinf@quince-srv2 ~/uzi/UP]$ cat uc_files/* > map.uc

Generate  a tab-delimited OTU table using uc2otutab.py from Robert Edgar's python scripts:

[MScBioinf@quince-srv2 ~/uzi/UP]$ python ~/bin/uc2otutab.py map.uc > otu_table.txt

Convert tab-delimited OTU table to a CSV file:

[MScBioinf@quince-srv2 ~/uzi/UP]$ tr "\\t" "," < otu_table.txt > otu_table.csv

Calculate statistics for OTU Construction:

[MScBioinf@quince-srv2 ~/uzi/UP]$ (echo TOTAL_READS:$(grep -c ">" multiplexed_linearized.fasta);echo TOTAL_READS_DEREPLICATED:$(grep -c ">" multiplexed_linearized_dereplicated.fasta);echo TOTAL_READS_DEREPLICATED_SINGLETONS_REMOVED:$(grep -c ">" multiplexed_linearized_dereplicated_sorted.fasta);echo OTUS_AFTER_DENOVO_CHIM_REM:$(grep -c ">" otus1.fa);echo OTUS_AFTER_DB_CHIM_REM:$(grep -c ">" otus.fa); echo FINAL_OTUS_AFTER_MATCHING:$(($(wc -l < otu_table.csv)-1)) )

TOTAL_READS:545569

TOTAL_READS_DEREPLICATED:81603

TOTAL_READS_DEREPLICATED_SINGLETONS_REMOVED:13046

OTUS_AFTER_DENOVO_CHIM_REM:510

OTUS_AFTER_DB_CHIM_REM:498

FINAL_OTUS_AFTER_MATCHING:498

Calculate statistics for the whole process until overlapped reads generation:

[MScBioinf@quince-srv2 ~/uzi/UP]$ (echo -e "SAMPLE\tINITIAL_PE_READS\tTRIMMED_PE_READS\tPE_READS_WITH_CHANGED_BASES\tCHANGED_BASES\tFAILED_BASES\tTOTAL_BASES\tFINAL_PE_READS\tOVERLAP_READS";for i in $(ls ../OTU_TUTORIAL/*/ -d); do cd $i; if [ -s spades.log ]; then (echo -n $(basename ${i}); echo -ne '\t'$(($(wc -l < Raw/*_R1_*.fastq)/4));echo -ne '\t'$(($(wc -l < *_R1_*trim.fastq)/4)); echo -ne '\t'$(grep -i "Correction done" spades.log | grep -Po '(?<=in ).*(?= reads)'); echo -ne '\t'$(grep -i "Correction done" spades.log | grep -Po '(?<=Changed ).*(?= bases)'); echo -ne '\t'$(grep -i "Failed to correct" spades.log | grep -Po '(?<=correct ).*(?= bases)'); echo -ne '\t'$(grep -i "Failed to correct" spades.log | grep -Po '(?<=out of ).*(?=\.)'); echo -ne '\t'$(($(wc -l < corrected/*_R1_*.cor.fastq)/4)); echo -e '\t'$(grep -c ">" *.overlap.fasta) ) fi; cd ..; done)

SAMPLE    INITIAL_PE_READS    TRIMMED_PE_READS    PE_READS_WITH_CHANGED_BASES    CHANGED_BASES    FAILED_BASES    TOTAL_BASES    FINAL_PE_READS    OVERLAP_READS

109-2    5382    5368    5950    6553    976091    2633465    5352    5262

1-1    21165    21085    16464    18584    2332952    10373755    21007    20689

110-2    3745    3731    4139    4820    632337    1824064    3718    3643

113-2    1907    1904    2181    2400    342141    928343    1897    1856

114-2    7536    7518    8118    9774    1247564    3691854    7503    7402

115-2    5585    5564    6777    7292    970011    2688154    5548    5460

117-2    206085    205750    94371    103187    10505265    100681940    205399    203092

119-2    13616    13561    15282    19103    2464707    6636196    13530    13341

120-2    9784    9744    10552    11232    1541370    4779674    9713    9589

126-2    1689    1675    1978    2068    310653    812844    1667    1634

128-2    27086    27023    25396    26922    3462404    13273476    26964    26737

130-2    13278    13217    13510    14394    2157529    6507004    13176    13044

13-1    32388    32323    26874    28086    3204975    15785006    32261    31868

132-2    13130    13110    11183    11663    1587045    6479125    13078    12979

20-1    55596    55493    41505    43973    5003510    26993274    55402    54734

27-1    23450    23410    19390    20335    2644967    11498857    23334    22938

32-1    4586    4566    5366    5754    754967    2226446    4544    4465

38-1    10527    10499    13782    14076    2098349    5188696    10467    10346

45-1    5007    4992    5479    6124    773263    2387260    4973    4834

51-1    35597    35463    30262    32146    4194525    17422574    35362    35027

56-1    33292    33199    24805    27963    3351048    16372931    33118    32840

62-1    7358    7331    7736    8995    1181410    3589508    7299    7155

68-1    17025    16986    14167    14632    1836506    8354853    16919    16633

7-1    1    1    0    0    156    502    1    1

Generating the phylogenetic tree and OTUs assignment:

There are two strategies to generate the phylogenetic tree. If you have generated otus.fa from above,

a) Not using QIIME

[MScBioinf@quince-srv2 ~/uzi/UP]$ mafft-ginsi otus.fa > otus.gfa

[MScBioinf@quince-srv2 ~/uzi/UP]$ FastTree -nt -gtr < otus.gfa > otus.tre

This otus.tre contains the phylogenetic tree in NEWICK format that you can use to calculate Unifrac distances.

b) Using QIIME

STEP 0: Getting the right version of QIIME

[MScBioinf@quince-srv2 ~/uzi/UP]$ bash

[MScBioinf@quince-srv2 ~/uzi/UP]$ export PYENV_ROOT="/home/opt/.pyenv"

[MScBioinf@quince-srv2 ~/uzi/UP]$ export PATH="$PYENV_ROOT/bin:$PATH"

[MScBioinf@quince-srv2 ~/uzi/UP]$ eval "$(pyenv init -)"

STEP 1: Aligning sequences

We'll be using a reference alignment to align our sequences.  In QIIME, this reference alignment is core_set_aligned.fasta.imputed  and QIIME already knows where it is. The following is done using PyNAST, though alignment can also be done with MUSCLE and Infernal (http://qiime.org/scripts/align_seqs.html).

[MScBioinf@quince-srv2 ~/uzi/UP]$ align_seqs.py -i otus.fa -o alignment/

or if you want to use  MUSCLE: 

[MScBioinf@quince-srv2 ~/uzi/UP]$ align_seqs.py -m muscle -i otus.fa -o alignment/

STEP 2: Filtering alignments

This alignment contains lots of gaps, and it includes hypervariable regions that make it difficult to build an accurate tree. So, we'll  filter it.  Filtering an alignment of 16S rRNA gene sequences can involve a Lane mask. In QIIME, this Lane mask for the GG core is lanemask_in_1s_and_0s

[MScBioinf@quince-srv2 ~/uzi/UP]$ filter_alignment.py -i alignment/otus_aligned.fasta -o alignment

If you are using MUSCLE, then use the following (as discussed in https://groups.google.com/forum/#!topic/qiime-forum/iNr5YZWjIPI ):

[MScBioinf@quince-srv2 ~/uzi/UP]$ filter_alignment.py -i alignment/otus_aligned.fasta -o alignment -e 0.1 --suppress_lane_mask_filter

STEP 3: Make tree

make_phylogeny.py script uses the FastTree approximately maximum likelihood program, a good model of evolution for 16S rRNA gene sequences

[MScBioinf@quince-srv2 ~/uzi/UP]$ make_phylogeny.py -i alignment/otus_aligned_pfiltered.fasta -o otus.tre

This otus.tre contains the phylogenetic tree in NEWICK format that you can then use to calculate Unifrac distances. But I wouldn't recommend using PyNAST because it filters out many OTUs so either use a) or use MUSCLE in b)

After method a) or b), we will also need taxonomic profiles (required for phyloseq), and for this purpose, we will use RDP classifier:

[MScBioinf@quince-srv2 ~/uzi/UP]$ java -Xmx1g -jar /home/opt/rdp_classifier_2.7/dist/classifier.jar classify -f filterbyconf -o otus_Assignments.txt otus.fa

Finally generate a CSV file from the above assignment file that can be imported in R

[MScBioinf@quince-srv2 ~/uzi/UP]$ <otus_Assignments.txt awk -F"\t" 'BEGIN{print "OTUs,Domain,Phylum,Class,Order,Family,Genus"}{gsub(" ","_",$0);gsub("\"","",$0);print $1","$3","$6","$9","$12","$15","$18}' > otus_Taxonomy.csv

So now you have the following files in UP directory:

otus.fa

otu_table.csv

otus_Taxonomy.csv

otus.tre

You need these along with a meta_data.csv  to perform analysis in R. Go to

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html

and search for phyloseq.R (it should be there at the end)

02/04/2015: Tutorial: Linux Scripting

a) Arrangement of files in the form of a folder:

for i in $(ls | awk -F"_" '{print $1}' | sort | uniq); do mkdir $i; mkdir $i/Raw; mv ${i}_*.fastq $i/Raw/.; done

b) Manipulating FASTA file:

cat test.fasta

>C1_10

ACTGTGTGTC

ATCTCTCTA

>C2_20

ATTCTTATA

>C3_30

ATTTTTCTTATACGT

>C4_40

ATTTTATTTATATATATATATATA

>C5_50

ATTATATATATATATAT

awk -F"_" '/>/{$0=">"$2}1' test.fasta

>10

ACTGTGTGTC

ATCTCTCTA

>20

ATTCTTATA

>30

ATTTTTCTTATACGT

>40

ATTTTATTTATATATATATATATA

>50

ATTATATATATATATAT

Homework: Do exercise 1 from

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/linux.html

26/03/2015: Tutorial: Bayesian Concordance Analysis

This analysis is applicable to both:

First, let us talk about Whole Genome Metagenomic Sequencing. CONCOCT bins contigs together into clusters, each cluster comprising fragments of a single genome

From the CONCOCT paper (http://dx.doi.org/10.1038/nmeth.3103), you can notice below (first figure: lower subfigure), how we are able to distinguish between different E. Colis and Bifidos (i.e., we get species level discrimination). And from the top subfigure, you can notice, that we can obtain something similar to abundances from 16S analysis.

Once you have identified the clusters, you can annotate these clusters using PROKKA ( www.vicbioinformatics.com/software.prokka.shtml )

PROKKA comprising of several software will generate a genbank file from which you can then use RPSBLAST to identify Cluster of Orthologous Groups (COGs). Each COG consists of a group of proteins found to be orthologous across at least three lineages and likely corresponds to an ancient conserved domain. For more information check out the NCBI COG website. Since the COG database is significantly smaller than the NCBI non-redundant (NR) database, it provides a fast alternative for rapidly describing the functional characteristics of one microbe or a community of microbes. I have written two scripts that can integrate with PROKKA to give COGs information:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/PROKKA_RPSBLAST.sh

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/PROKKA_CDD.py

We have used 525 genomes ( https://github.com/BinPro/CONCOCT/blob/master/scgs/gen_scg.txt ) to find COGs that occur as single-copy in genomes. Since there are very few COGs that occur once in every genome we applied a relaxed criteria of being present in greater than 97.0% of the genomes and having an average frequency of less than 1.03. This resulted in 36 COGs (shown in table given below). Twenty-seven of these are shared with the list of 40 COGs that was selected in a similar way in an earlier study ( Ciccarelli, F.D. et al. Science 311, 1283–1287 (2006). )

Now knowing that these 36-40 COGs occur only once, their presence can serve as a proxy for completeness of clusters, and if we observe all of them then it tells us that our genomes are complete. For example, for a real sample, we can have a COG heatmap as follows (notice the top row being Cluster 189 is all green suggesting that we are recovering almost 100% of the genome):

nmeth.3103-F2.jpg

COGs.jpgCOG_completeness.jpg

However Clusters completeness is not what we will use the COGs for!

We will download different genomes from NCBI hoping that some of these clusters are similar to one of those, and then we search for the same 36 ~ 40 COGs in those reference sequences. Eventually, we will have two types of COGs information:

a) COGs identified from your real WGS samples

b) COGs identified from your reference sequences (HMP, NCBI, anything)

We will then take unique COGs from both clusters and reference sequences, and use multisequence alignment algorithm such as MAFFT or MUSCLE to generate gapped alignment. We will get 36 ~ 40 gapped alignments this way depending on the COGs you have chosen. Now, we will have two approaches:

a) We concatenate all the alignments together for these 36 ~ 40 COGs and use FastTree to generate a tree

b) We iteratively build a tree for each COG using MrBayes (http://mrbayes.sourceforge.net/ ) and generate a consensus tree using Bucky: http://www.stat.wisc.edu/~ane/bucky/

Eventually we will have a phylogenetic tree through which we can identify closeness of your clusters to the reference sequences. For example, last week I was working on a WGS dataset from EGSB Bioreactors and obtained the following ultrametricised tree (colored at Phylum level) for some of these clusters (based on those COGs) using approach a) from above

Notice Cluster 51 lying to Chlaymydia:

case1.jpg

I wrote a lowest-common-ancestor algorithm using the phylogentic tree (assuming the tree is provided as All.tree):

library(ape) # To load the tree

library(taxize) # To get taxonomic path from NCBI

library(phyloseq) # To plot the tree

library(ggplot2) # Plotting manipulation

library(geiger) # For finding tips of internal nodes

# Load the tree

phylogenetic_tree <- read.tree("All.tree")

# Keep a record of tip labels

old_labels<-phylogenetic_tree$tip.label

# We want to extract taxonomic names and gids from the tip labels

# For example for a given tip label:

# Planctomyces_brasiliensis_DSM_5305_uid60583_gi|325106586|ref|NC_015174.1|"                                

# taxonomic name is "Planctomyces_brasiliensis_DSM_5305" and gid is "325106586"

# We extract both because in some tip labels gid doesn't return a record and

# so we use the taxonomic name instead

gids<-gsub("\\|.*","",gsub(".*_gi\\|","",old_labels))

tax_names<-gsub("_uid.*","",old_labels)

# We will store the taxonomy information in Taxonomy

Taxonomy<-data.frame()

for(i in seq(1:length(gids))){

  print(paste("Processing",i,"/",length(gids),":",tax_names[i],gids[i]))

 

  # genbank2uid() will fail for some gids, and in such case, we will pass the tax_name to classification()

  returned_data<-cbind(classification(tryCatch(genbank2uid(id=gids[i]),error=function(e) tax_names[i]),db="ncbi",verbose=T))

  # Create a temporary buffer to hold taxonomic information

  tmp<-data.frame(phylum="",class="",order="",family="",genus="",species="")

  # If a record is not found, the length of returned_data is 2

  if(length(returned_data)>2)

  {

   

        # Extract information for different taxonomic level breakdowns

        if("phylum" %in% colnames(returned_data)){

          tmp$phylum<-returned_data[,"phylum"]

        }

        if("class" %in% colnames(returned_data)){

          tmp$class<-returned_data[,"class"]

        }

        if("order" %in% colnames(returned_data)){

          tmp$order<-returned_data[,"order"]

        }

        if("family" %in% colnames(returned_data)){

          tmp$family<-returned_data[,"family"]

        }         

        if("genus" %in% colnames(returned_data)){

          tmp$genus<-returned_data[,"genus"]

        }

        if("species" %in% colnames(returned_data)){

          tmp$species<-returned_data[,"species"]

        } 

  }

 

  # Bind all the data together

  if(is.null(Taxonomy)){Taxonomy<-tmp} else {Taxonomy<-rbind(Taxonomy,tmp)}

}

# Modify the tree tip labels by assigning them gids

phylogenetic_tree$tip.label<-gids

# Same goes for Taxonomy table

rownames(Taxonomy)<-gids

# To color the leaf nodes based on a particular level, use the following

# Supported values are "phylum","class","order","family","genus","species"

level<-"phylum"

# Now we create a pseudo abundance table by extracting all the unique taxa at the level specified by "level"

# Size of the table is gids x unique taxa

unique_taxa<-unique(Taxonomy[,level])

unique_taxa<-unique_taxa[unique_taxa!=""]

color_table<-as.data.frame(matrix(ncol=length(unique_taxa),nrow=length(gids)))

names(color_table)<-unique_taxa

for(i in seq(1:length(unique_taxa))){

  color_table[,as.character(unique_taxa[i])]<-as.numeric(as.character(Taxonomy[,level])==as.character(unique_taxa[i]))*50

}

rownames(color_table)<-gids

color_table$Cluster<-as.numeric(grepl("Cluster",rownames(color_table)))*100

# Uncomment to save the data

#write.table(color_table,file="new_tree_color.csv",quote=F,sep=",",col.names=NA)

#write.table(Taxonomy,file="new_tree_taxonomy.csv",quote=F,sep=",",col.names=NA)

#write.tree(phylogenetic_tree,file="new_tree.tre")

#Convert the data to phyloseq format

OTU = otu_table(as.matrix(color_table), taxa_are_rows = TRUE)

SAM<-data.frame(colnames(color_table))

colnames(SAM)<-c("Name")

rownames(SAM)<-colnames(color_table)

SAM[,"Name"]<-as.character(SAM[,"Name"])

SAM$Type<-factor(as.numeric(grepl("Cluster",rownames(SAM)))+1,labels=c("Reference","Cluster"))

TAX = tax_table(as.matrix(Taxonomy))

Taxonomy[grep("Cluster",rownames(TAX)),"species"]<-rownames(Taxonomy)[grep("Cluster",rownames(TAX))]

TAX = tax_table(as.matrix(Taxonomy))

# Merge all the data

# We will also ultrametricize the tree using Grafen's method (Grafen, 1989) using compute.brlen(tree,method="Grafen")

# Reference: http://www.r-phylo.org/wiki/HowTo/DataTreeManipulation

physeq<-merge_phyloseq(phyloseq(OTU, TAX),sample_data(SAM),compute.brlen(phylogenetic_tree, method="Grafen"))

# Now plot the tree

pdf("Tree.pdf",height=42,width=20)

p<-plot_tree(physeq,color="Name",shape="Type",ladderize = "left",base.spacing=0.003,label.tips="species",nodelabf = nodeplotblank,

              plot.margin = 0.9,text.size=1.7,sizebase=5)

p<-p+scale_shape_manual(values=c(4,16))

p<-p+guides(colour=guide_legend(ncol=3))

print(p)

dev.off()

#LOWEST COMMON ANCESTOR

# Generate the distances between tips

t<-cophenetic(phy_tree(physeq))

Clusters_Taxonomy<-data.frame()

verbose<-T

for (i in phy_tree(physeq)$tip.label[grep("Cluster",phy_tree(physeq)$tip.label)]){

  s<-t[i,!names(t[i,]) %in% c(i)]

  #Get the neighbour node with minimum distance

  j<-names(s[s==min(s)])[1]

  if(verbose){print(paste(i,": -min-",j))}

  #If the node turns out to be another cluster node, recursively find the next non-cluster node

  while(grepl("Cluster",j)){s<-s[!names(s) %in% j]; j<-names(s[s==min(s)])[1];if(verbose){print(paste(i,": -newmin-",j))}} 

 

  if(verbose){

  print(paste("Leaf nodes of mrca of",i,"and",j,"are:"))

  print(tips(phy_tree(physeq),mrca(phy_tree(physeq))[i,j]))

  }

 

  #Get the leaf nodes of most recent common ancestor

  leaf_nodes_ancestor<-tips(phy_tree(physeq),mrca(phy_tree(physeq))[i,j])

 

  #Extract the non-cluster nodes

  assignments<-Taxonomy[leaf_nodes_ancestor[!grepl("Cluster",leaf_nodes_ancestor)],,drop=F]

 

  #Convert from factors to character (required downstream for table() to work properly)

  assignments[]<-lapply(assignments,as.character)

 

  #Create a dummy placeholder for assignments

  tmp2<-data.frame(phylum="",class="",order="",family="",genus="",species="")

  k<-1

   

  #Do an LCA

  while(k<7){cons<-table(assignments[assignments[,k]!="",k]);if(length(cons)==1){tmp2[,k]<-names(cons)}else{break};k<-k+1}

  rownames(tmp2)<-i

  #For the cases where we do not get any assignments for Clusters, get the Phylum level assignments with confidence

  if(as.character(tmp2$phylum)==""){

        ass_phylum<-sort(table(assignments$phylum),decreasing=T)[1]

        tmp2$phylum=paste(names(ass_phylum),sprintf("%.3g",ass_phylum/dim(assignments)[1]))

  }

 

  if(verbose){

        print("Assignments:")

        print(tmp2)

  }

  # Bind all the data together

  if(is.null(Clusters_Taxonomy)){Clusters_Taxonomy<-tmp2} else {Clusters_Taxonomy<-rbind(Clusters_Taxonomy,tmp2)}

}

 and got the assignments for above clusters as:

Cluster51        Chlamydiae              Chlamydiia            Chlamydiales                                                                   

Cluster139        Chlamydiae              Chlamydiia            Chlamydiales                                                                  

Cluster46        Chlamydiae              Chlamydiia            Chlamydiales                                                                  

Cluster181        Chlamydiae              Chlamydiia            Chlamydiales                                                                  

Cluster35        Chlamydiae              Chlamydiia            Chlamydiales

Cluster214        Chlamydiae              Chlamydiia            Chlamydiales                                                                  

Cluster37      Chlamydiae              Chlamydiia            Chlamydiales

Here, we lacked enough reference sequences for the above clusters in their neighborhood, but if we look at another part of the tree where we have enough reference sequences in the neighborhood, say for Clusters 133:

case2.jpg

we get species-level assignment as:

Cluster133       Actinobacteria          Actinobacteria         Actinomycetales Propionibacteriaceae Propionibacterium          Propionibacterium acnes

So you can notice how we have not used any metagenomic taxonomic profiling software such as MEGAN or TAXAassign and classified our unknown clusters based on LCA of COG based tree only. Obviously this approach is limited by how many reference sequences you can entertain and whether you consider approach a) or approach b) to generate the tree.

Now I’ll discuss approach b) i.e., iteratively build a tree for each COG/gene using MrBayes (http://mrbayes.sourceforge.net/ ) and generate a consensus tree using Bucky: http://www.stat.wisc.edu/~ane/bucky/

For WGS, we will have separate aligned files for COGS with the following naming convention:

COG1006.gapped.fasta

COG0532.gapped.fasta

and in each file we will have extracted COG sequences with names as >[CLUSTER_NAME] or >[REFERENCE_NAME]

Instead, we will use single genome isolates (SIRN PROJECT; C Difficile samples) in which we assembled different ribotypes of C. Difficile using SPAdes (http://bioinf.spbau.ru/spades ) and can then used  PROKKA on the assembled contigs to identify CDS regions to extract genes that were common between them (see previous tutorial) and so we start with a folder containing alignment files for different genes, again with the same naming convention (so that you can use the same strategy for WGS analysis):

gyrA.gapped.fasta

tetM.gapped.fasta

Essentially, we would like to consider the right part of the following workflow (left we used in WGS analysis):

Building on from the previous tutorial, convert your alignment files to NEXUS format (required for MrBayes) for those genes that are covered in all 48 samples:

for i in $((for i in $(ls *.fasta.gapped); do echo -e "$i,$(grep -c ">" $i)"; done) | sort -t"," -k2rn | awk -F"," '$2>=48{print $1}'); do /home/opt/trimal/source/readal -in $i -out $i.nexus -nexus ; done

Here, I have used readal that is useful for converting alignments in different formats. For details, visit: http://trimal.cgenomics.org/use_of_the_readal_v1.2

Now create a folder nexus, move the nexus files there, and do all the processing in that folder:

mkdir nexus

mv *.nexus nexus/.

cd nexus

Now run MrBayes for 20 genes at most! Everytime when you will run this command it will choose 20 new unprocessed genes:

c_cnt=1;for i in $(ls *.nexus); do if [ ! -f $i.run1.t ] && [ $c_cnt -lt 21 ]; then echo -e "#nexus\nbegin mrbayes;\nset autoclose=yes nowarn=yes;\nexecute $i;\n\nlset nst=2 rates=gamma;\nprset brlenspr=Unconstrained:Exp(50.0);\nmcmc nruns=2 temp=0.2 ngen=30000 burninfrac=0.0909 Nchains=4 samplefreq=10 swapfreq=10 printfreq=10000 mcmcdiagn=yes diagnfreq=10000 filename=$i;\nquit;\nend;" > $i.cfg; /home/opt/mrbayes_3.2.2/src/mb $i.cfg; let c_cnt=c_cnt+1; fi; done

In the above code, we are generating a configuration file for MrBayes, one such file (for agaC_1 gene) has the following contents:

cat agaC_1.fasta.gapped.nexus.cfg

#nexus

begin mrbayes;

set autoclose=yes nowarn=yes;

execute agaC_1.fasta.gapped.nexus;

lset nst=2 rates=gamma;

prset brlenspr=Unconstrained:Exp(50.0);

mcmc nruns=2 temp=0.2 ngen=30000 burninfrac=0.0909 Nchains=4 samplefreq=10 swapfreq=10 printfreq=10000 mcmcdiagn=yes diagnfreq=10000 filename=agaC_1.fasta.gapped.nexus;

quit;

end;

Notes (MrBayes Manual: http://mrbayes.sourceforge.net/mb3.2_manual.pdf ):

There are four steps to a typical Bayesian phylogenetic analysis using MrBayes:

  1. Read the Nexus data file
  2. Set the evolutionary model
  3. Run the analysis
  4. Summarize the samples

MrBayes uses MCMC sampling. To get this clear in your head, check the Markov Chains visualisation on the page: http://setosa.io/ev/markov-chains/ and the conditional probabilities: http://setosa.io/ev/conditional-probability/ 

In the above file,

         The MCMC sampler will use the following moves:

             With prob.  Chain will use move

                2.33 %   Multiplier(Alpha)

               11.63 %   ExtSPR(Tau,V)

               11.63 %   ExtTBR(Tau,V)

               11.63 %   NNI(Tau,V)

               11.63 %   ParsSPR(Tau,V)

               34.88 %   Multiplier(V)

               11.63 %   Nodeslider(V)

                4.65 %   TLMultiplier(V)

          Division 1 has 139 unique site patterns

          Initializing conditional likelihoods

          Using standard non-SSE likelihood calculator for division 1 (single-precision)

          Initial log likelihoods and log prior probs for run 1:

             Chain 1 -- -15400.765793 -- -282.836656

             Chain 2 -- -14625.120509 -- -282.836656

             Chain 3 -- -13272.529582 -- -282.836656

             Chain 4 -- -13093.756321 -- -282.836656

          Initial log likelihoods and log prior probs for run 2:

             Chain 1 -- -16334.424002 -- -282.836656

             Chain 2 -- -16780.137467 -- -282.836656

             Chain 3 -- -19025.246161 -- -282.836656

             Chain 4 -- -17644.836156 -- -282.836656

          Using a relative burnin of 9.1 % for diagnostics

          Chain results (30000 generations requested):

              0 -- [-15400.766] (-14625.121) (-13272.530) (-13093.756) * [-16334.424] (-16780.137) (-19025.246) (-17644.836)

10000 -- (-1635.719) [-1536.657] (-1572.720) (-1551.809) * [-1553.579] (-1550.294) (-1609.916) (-1567.689) -- 0:11:26

          Average standard deviation of split frequencies: 0.062250

          20000 -- (-1567.059) [-1550.944] (-1576.853) (-1597.113) * (-1571.132) [-1551.685] (-1550.959) (-1601.969) -- 0:05:43

          Average standard deviation of split frequencies: 0.076375

Running MrBayes through the above loop will produce additional sets of files in the current folder, for example for acpP_1 gene, we will have the following files:

acpP_1.fasta.gapped.nexus.cfg

acpP_1.fasta.gapped.nexus.mcmc

acpP_1.fasta.gapped.nexus.run2.p

acpP_1.fasta.gapped.nexus.run1.t

acpP_1.fasta.gapped.nexus.run2.t

acpP_1.fasta.gapped.nexus.run1.p

We will then create a folder called bucky inside the nexus folder and move the tree files *.fasta.gapped.nexus.run?.t there for the genes we want to build the consensus tree with

mkdir bucky

mv *.fasta.gapped.nexus.run?.t bucky/.

cd bucky

So why have we used MrBayes?

For example, I have considered the following 9 genes (trees from both runs):

ls -1

abgB_1.fasta.gapped.nexus.run1.t

abgB_1.fasta.gapped.nexus.run2.t

abgB_2.fasta.gapped.nexus.run1.t

abgB_2.fasta.gapped.nexus.run2.t

abgB_3.fasta.gapped.nexus.run1.t

abgB_3.fasta.gapped.nexus.run2.t

abgB_4.fasta.gapped.nexus.run1.t

abgB_4.fasta.gapped.nexus.run2.t

abgB_5.fasta.gapped.nexus.run1.t

abgB_5.fasta.gapped.nexus.run2.t

abgB_6.fasta.gapped.nexus.run1.t

abgB_6.fasta.gapped.nexus.run2.t

abgB_7.fasta.gapped.nexus.run1.t

abgB_7.fasta.gapped.nexus.run2.t

abgT_1.fasta.gapped.nexus.run1.t

abgT_1.fasta.gapped.nexus.run2.t

abgT_2.fasta.gapped.nexus.run1.t

abgT_2.fasta.gapped.nexus.run2.t

Now we are going to build the logic for a one-liner to get the unique filenames out (until the "nexus" portion):

for i in $(for i in $(ls *run?.t); do echo ${i:0:${#i} - 7}; done | uniq); do echo $i; done

abgB_1.fasta.gapped.nexus

abgB_2.fasta.gapped.nexus

abgB_3.fasta.gapped.nexus

abgB_4.fasta.gapped.nexus

abgB_5.fasta.gapped.nexus

abgB_6.fasta.gapped.nexus

abgB_7.fasta.gapped.nexus

abgT_1.fasta.gapped.nexus

abgT_2.fasta.gapped.nexus

Using the above logic we are going to collate the information from both runs using bucky’s mbsum to generate *.in files:

for i in $(for i in $(ls *run?.t); do echo ${i:0:${#i} - 7}; done | uniq); do /home/opt/buckyTutorial/bucky-1.4.3/src/mbsum -n 1001 -o $i.in $i.run?.t; done

ls -1 *.in

abgB_1.fasta.gapped.nexus.in

abgB_2.fasta.gapped.nexus.in

abgB_3.fasta.gapped.nexus.in

abgB_4.fasta.gapped.nexus.in

abgB_5.fasta.gapped.nexus.in

abgB_6.fasta.gapped.nexus.in

abgB_7.fasta.gapped.nexus.in

abgT_1.fasta.gapped.nexus.in

abgT_2.fasta.gapped.nexus.in

For each gene, we can calculate the posterior probability of the best supported tree  :

for i in $(ls *.in); do echo -e $i: $(awk '/^\(/{if(sum==0){best=$2};sum+=$2}END{print "Best supported tree with posterior probability:"best/sum}' $i); done

abgB_1.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.001

abgB_2.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.001

abgB_3.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.00125

abgB_4.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.001

abgB_5.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.001

abgB_6.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.001

abgB_7.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.00125

abgT_1.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.00075

abgT_2.fasta.gapped.nexus.in: Best supported tree with posterior probability:0.00075

Now run BUCKy:

/home/opt/buckyTutorial/bucky-1.4.3/src/bucky -n 1000 -o results *.in

ls -1 results*

results.cluster

results.concordance

results.gene

results.input

results.out

Description of these files are as follows:

.input → lists input files (loci) with their assigned ID

.out → similar to screen output. Lists parameters values, reports average SD of mean sample-wide CF (to assess convergence) and acceptance probabilities when swapping between cold/heated chains.

.cluster → posterior distribution on the # clusters, i.e. # of groups of genes that share the same tree.

.gene → For each gene, lists its support for each tree from the individual analysis (input), and from the combined analysis.

.concordance → Estimated population & concordance trees and more.

Note: You can also run BUCKy for a longer run: -n 100000 generations, -c 2 chains (cold and heated). The default Dirichlet parameter  = 1

What is this  parameter?

So what is the goal of BUCKy?

Goal is to infer the primary concordance tree, along with

Concordance factors (CFs) →   measures of genomic support, % of genome having a clade

        Sample-wide CF: %genes/COGs in the sample

        Genome-wide CF: %genes/COGs in the genome

A concordance tree is built from clades with largest CFs (greedy approach)

Credibility intervals →  measure of statistical support

If you check results.concordance it will have the following values:

Four-way partitions in the Population Tree: sample-wide CF, coalescent units and Ties(if present)

{1,3,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48; 2|4; 18}    0.402, 0.109,  

{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,40,41,42,43,45,46,48; 44,47|29; 39}    0.729, 0.902,  

{1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,40,41,42,43,45,46,48; 29,39|44; 47}    0.676, 0.721,  

{1,3,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,40,41,42,43,45,46,48; 29,39,44,47|2; 4,18}    0.462, 0.214,  

{1,3,5,6,7,8,9,10,11,12,13,14,15,16,17,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,40,41,42,43,45,46,48; 2,4,18|29,39; 44,47}    0.374, 0.062,  

{1,2,3,4,5,6,7,8,9,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,41,42,43,44,46,47,48; 14|40; 45}    0.410, 0.122,  

{1,3,5,6,7,8,9,10,11,12,13,15,16,17,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,41,42,43,46,48; 14,40,45|2,4,18; 29,39,44,47}    0.366, 0.050,  

{1,3,5,6,7,8,9,10,11,12,13,15,16,17,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,41,42,43,46,48; 2,4,18,29,39,44,47|14; 40,45}    0.382, 0.076,  

{1; 3,5,6,7,8,9,10,11,12,13,15,16,17,19,20,21,22,23,24,25,26,27,28,30,31,32,33,34,35,36,37,38,41,42,43,46,48|2,4,18,29,39,44,47; 14,40,45}    0.381, 0.074,  

{1,2,3,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,41,42,43,44,45,46,47; 38|5; 48}    0.414, 0.128,  

{1,2,3,4,5,6,7,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,42,43,44,45,46,47,48; 16|8; 41}    0.527, 0.344,  

{1,2,3,4,6,7,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,42,43,44,45,46,47; 8,16,41|5,48; 38}    0.423, 0.144,  

{1,2,3,4,6,7,9,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,42,43,44,45,46,47; 5,38,48|8,41; 16}    0.380, 0.073,  

{1,2,3,4,6,7,10,11,12,13,14,15,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,39,40,42,43,44,45,46,47; 9|5,38,48; 8,16,41}    0.454, 0.199,

Note: bucky can only handle data sets where all taxa are sampled across all genes. Version 1.4 will accept data sets where this is not  the case, but will restrict the analysis to the set of taxa that are found in *all* genes. A taxon that is missing in a single gene will be  excluded from the analysis. https://groups.google.com/forum/#!topic/bucky-users/TguVKyLbFsk 

 

Bucky fails for taxa greater than 50. https://groups.google.com/forum/#!topic/bucky-users/WsQarpcVzpo

I have uploaded the results to bucky online tools and you can browse the plots by visiting the following link (note that they are available for 30 days from 22/03/2015):

http://ane-www.cs.wisc.edu/buckytools/buckytools.php?userfileid=4590747

However, we are going to use a bit of bioinformatics to draw the results ourselves.

Extracting tip labels:

<results.concordance awk '/translate/,/^$/' | awk '!/translate/ && !/^$/{gsub(",$","",$2);gsub(";$","",$2);print $1","$2}' > tip_labels.csv

Extracting concordance tree topology:

The primary concordance tree features relationships inferred to be true for a large proportion of genes. It is built as a greedy consensus: clades are ranked by their estimated CFs and included in the concordance tree one by one as long as they do not contradict a clade with a higher CF already in the tree.

<results.concordance awk '/^Primary Concordance Tree Topology:/,/^$/{print}' | awk 'NR==2' > primary_concordance_tree.nwk

Extracting population tree:

The population tree usually is the same as the Primary Concordance Tree. This tree is expected to converge on the primary concordance tree when the cause of the discordance is incomplete lineage sorting. From a different point of view, the population tree will be different from the primary concordance tree in the 'too-greedy' zone as described by Degnan et al. (2009) or if discordance is due to processes other than ILS (e.g. hybridization, long branch attraction that causes tree errors, etc.). To determine the population tree, BUCKy implements a consensus method similar to the R*-consensus, based on unrooted quartets and which consistently identifies the species tree.

<results.concordance awk '/^Population Tree:/,/^$/{print}' | awk 'NR==2' > population_tree.nwk

Extracting population tree with branch lengths in estimated coalescent units:

<results.concordance awk '/^Population Tree, With Branch Lengths In Estimated Coalescent Units:/,/^$/{print}' | awk 'NR==2' > population_tree_branch_lengths.nwk

Extracting primary concordance tree with sample concordance factors:

<results.concordance awk '/^Primary Concordance Tree with Sample Concordance Factors:/,/^$/{print}' | awk 'NR==2' > primary_concordance_tree_concordance_factors.nwk

Having generated the data, we are going to plot these trees in R using ape package by running the following code:

library(ape)

#Load all the trees from Bucky analysis

PT <- read.tree("population_tree.nwk")

PTBL <- read.tree("population_tree_branch_lengths.nwk")

PCT <- read.tree("primary_concordance_tree.nwk")

PCTCF <- read.tree("primary_concordance_tree_concordance_factors.nwk")

tip_labels<-read.csv("tip_labels.csv",header=F,row.names=1,check.names=FALSE)

#Change the labels of the trees

PT$tip.label<-as.character(tip_labels[PT$tip.label,])
PTBL$tip.label<-as.character(tip_labels[PTBL$tip.label,])
PCT$tip.label<-as.character(tip_labels[PCT$tip.label,])
PCTCF$tip.label<-as.character(tip_labels[PCTCF$tip.label,])

#Load the meta_table

meta_table<-read.csv("../metadata_99.txt",row.names=1,check.names=FALSE,sep="\t")

#Extract meta data for the tip labels

meta_table<-meta_table[PT$tip.label,]

colours <- c("#F0A3FF", "#0075DC", "#993F00","#4C005C","#2BCE48","#FFCC99","#808080","#94FFB5","#8F7C00","#9DCC00","#C20088","#003380","#FFA405","#FFA8BB","#426600","#FF0010","#5EF1F2","#00998F","#740AFF","#990000","#FFFF00")

tip_colours<-colours[as.numeric(as.factor(as.character(meta_table[,"Ribo-Category"])))]

legends<-data.frame(Type=as.character(meta_table[,"Ribo-Category"]),Colours=tip_colours)

legends$Type<-as.character(legends$Type)

legends$Colours<-as.character(legends$Colours)

legends<-t(as.data.frame(strsplit(unique(paste(legends$Type,legends$Colours,sep=":")),":")))

rownames(legends)<-NULL

colnames(legends)<-c("Type","Colour")

pdf("population_tree.pdf",height=10)

plot(PT,tip.color=tip_colours,show.tip.label=T,main="Population Tree")

nodelabels(cex=0.5,bg="white",frame="circle")

legend("bottomleft",lty=1,legends[,"Type"],col=legends[,"Colour"],bty='o', cex=0.5,lwd=8)

dev.off()

pdf("population_tree_branch_lengths.pdf",height=10)

plot(PTBL,tip.color=tip_colours,show.tip.label=T,main="Population Tree, With Branch Lengths In Estimated Coalescent Units")

nodelabels(cex=0.5,bg="white",frame="circle")

legend("topright",lty=1,legend=legends[,"Type"],col=legends[,"Colour"],bty='o', cex=0.5,lwd=8)

dev.off()

pdf("primary_concordance_tree.pdf",height=10)

plot(PCT,tip.color=tip_colours,show.tip.label=T,main="Primary Concordance Tree Topology")

nodelabels(cex=0.5,bg="white",frame="circle")

legend("bottomleft",lty=1,legend=legends[,"Type"],col=legends[,"Colour"],bty='o', cex=0.5,lwd=8)

dev.off()

pdf("primary_concordance_tree_concordance_factors.pdf",height=10)

plot(PCTCF,tip.color=tip_colours,show.tip.label=T,main="Primary Concordance Tree with Sample Concordance Factors")

nodelabels(cex=0.5,bg="white",frame="circle")

legend("bottomleft",lty=1,legend=legends[,"Type"],col=legends[,"Colour"],bty='o', cex=0.5,lwd=8)

dev.off()

The following plots are produced (NEXT PAGE):

primary_concordance_tree.jpg

population_tree.jpg

population_tree_branch_lengths.jpgprimary_concordance_tree_concordance_factors.jpg

Practise:

Extracting Uniprot IDs from genbank files:

grep -Po "(?<=UniProtKB:)\w*" /shared2/cosmika/denovo/prokka_fixed/TAY103_S55_spades/annotation/PROKKA_03252015/PROKKA_03252015.gbf | sort | uniq | head

A0QTF8

A0QTG1

A0QU63

A0QWV9

A0QYU6

A0R2D5

A0R3I8

A0R4Z6

A2REG0

A2RI45

Please take help from http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/subsetFASTAFASTAQ.html

Get the sequences from Uniprot based on supplied ID:

grep -Po "(?<=UniProtKB:)\w*" /shared2/cosmika/denovo/prokka_fixed/TAY103_S55_spades/annotation/PROKKA_03252015/PROKKA_03252015.gbf | sort | uniq | while read l; do echo -e ">"$l"\n"$(curl -s http://www.uniprot.org/uniprot/$l.txt | awk '/SQ/,/\/\//{if ($0!~/^\/\// && $0!~/^SQ/) {gsub(" ","",$0); printf $0}}'); done | head

>A0QTF8

MDLINGMGTSPGYWRTPREPGNDHRRARLDVMAQRIVITGAGGMVGRVLADQAAAKGHTVLALTSSQCDITDEDAVRRFVANGDVVINCAAYTQVDKAEDEPERAHAVNAVGPGNLAKACAAVDAGLIHISTDYVFGAVDRDTPYEVDDETGPVNIYGRTKLAGEQAVLAAKPDAYVVRTAWVYRGGDGSDFVATMRRLAAGDGAIDVVADQVGSPTYTGDLVGALLQIVDGGVEPGILHAANAGVASRFDQARATFEAVGADPERVRPCGSDRHPRPAPRPSYTVLSSQRSAQAGLTPLRDWREALQDAVAAVVGATTDGPLPSTP

>A0QTG1

MSAAANAEHGAADRVEILPVPGLPEFRPGDDLVGSLAEAAPWLRDGDVLVVTSKVVSKCEGRIVAAPSDPEERDTLRRKLIDDEAVRVLARKGRTLITENAIGLVQAAAGVDGSNVGSTELALLPVDPDRSAATLREGLRERLGVTVGVVITDTMGRAWRTGQTDFAIGASGLTVLQGYAGSRDRHGNELVVTEVAVADEIAAAADLVKGKLTAIPVAVVRGLRLPDDGSTAHRLVRAGEDDLFWLGTAEAIELGRRQAQLLRRSVRRFSAEPVPHDAIEAAVGEALTAPAPHHTRPVRFVWVQDSETRTRLLDRMKEQWRADLTADGLDADAVDRRVARGQILYDAPELVIPFLVPDGAHSYPDDARTAAEHTMFTVAVGAAVQGLLVALAVRDIGSCWIGSTIFAADLVRAELELPDDWEPLGAIAIGYPEQTPQPLGPRDPVPTDELLVRK

>A0QU63

MTKKSASSNNKVVATNRKARHNYTILDTYEAGIVLMGTEVKSLREGQASLADAFATVDDGEIWLRNVHIAEYHHGTWTNHAPRRNRKLLLHRKQIDNLIGKIRDGNLTLVPLSIYFTDGKVKVELALARGKQAHDKRQDLARRDAQREVIRELGRRAKGKI

>A0QWV9

MTAEVKDELSRLVVNSVSARRAEVASLLRFAGGLHIVAGRVVVEAEVDLGIIARRLRKDIYDLYGYNAVVHVLSASGIRKNTRYVVRVANDGEALARQTGLLDMRGRPVRGLPAQVVGGSVGDAEAAWRGAFLAHGSLTEPGRSSALEVSCPGPEAALALVGAARRLGVSAKAREVRGSDRVVVRDGEAIGALLTRMGAQDTRLTWEERRMRREVRATANRLANFDDANLRRSARAAVAAAARVERALEILGDSVPDHLAAAGKLRVEHRQASLEELGRLADPPMTKDAVAGRIRRLLSMADRKAKQEGIPDTESAVTPDLLDDA

>A0QYU6

MARVKRALNAQKKRRTVLKASKGYRGQRSRLYRKAKEQQLHSLTYAYRDRRARKGEFRKLWISRINAAARANDITYNRLIQGLKAAGVEVDRKNLAELAVSDPAAFTALVDVARAALPEDVNAPSGEAA

To search a sequence against ProSite database, read section 10 on http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc141

19/03/2015: Tutorial: Comparing isolates of a genome based on their gene content

Read how to fix genbank files produced by PROKKA 1.7.2 at the end of my annotation webpage:

http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/annotation.html

This is the code I used to fix the genbank files produced by Cosmika:

ls /shared2/cosmika/denovo/spades2/*/annotation/PROKKA_*/PROKKA_*.gbf > genbank_files.txt

for i in $(cat genbank_files.txt); do cat $i | sed -e 's/\(NODE_[0-9]*\)_length_\([0-9]*\)_.* bp/\1\t\2 bp/g' > $(dirname $i)/$(basename $i).fixed; done

Once we have assembled all the isolates, annotated them to produce a genbank file each, fixed them using the above two one-liners, we are again going to save the genbank files listing in a file:

ls /shared2/cosmika/denovo/spades2/*/annotation/PROKKA_*/PROKKA_*.gbf.fixed > genbank_files.txt

Generate a CSV  file of the format [SAMPLE_NAME],[LOCATION_OF_GENBANK_FILE] from the above file:

awk '{split($0,a,"/");gsub("_spades$","",a[6]);print a[6]","$0}' genbank_files.txt > genbank_files.csv

cat genbank_files.csv

HA-011_S26,/shared2/cosmika/denovo/spades2/HA-011_S26_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-012_S32,/shared2/cosmika/denovo/spades2/HA-012_S32_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-013_S38,/shared2/cosmika/denovo/spades2/HA-013_S38_spades/annotation/PROKKA_03032015/PROKKA_03032015.gbf.fixed

HA-013_S38,/shared2/cosmika/denovo/spades2/HA-013_S38_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-014_S44,/shared2/cosmika/denovo/spades2/HA-014_S44_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-015_S19,/shared2/cosmika/denovo/spades2/HA-015_S19_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-016_S8,/shared2/cosmika/denovo/spades2/HA-016_S8_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HA-017_S15,/shared2/cosmika/denovo/spades2/HA-017_S15_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HD-004_S45,/shared2/cosmika/denovo/spades2/HD-004_S45_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HD-006_S3,/shared2/cosmika/denovo/spades2/HD-006_S3_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HH-009_S27,/shared2/cosmika/denovo/spades2/HH-009_S27_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HH-010_S9,/shared2/cosmika/denovo/spades2/HH-010_S9_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HH-011_S16,/shared2/cosmika/denovo/spades2/HH-011_S16_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HH-012_S22,/shared2/cosmika/denovo/spades2/HH-012_S22_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HH-013_S28,/shared2/cosmika/denovo/spades2/HH-013_S28_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-001_S34,/shared2/cosmika/denovo/spades2/HT-001_S34_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-002_S40,/shared2/cosmika/denovo/spades2/HT-002_S40_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-003_S46,/shared2/cosmika/denovo/spades2/HT-003_S46_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-004_S4,/shared2/cosmika/denovo/spades2/HT-004_S4_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-008_S10,/shared2/cosmika/denovo/spades2/HT-008_S10_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-014_S30,/shared2/cosmika/denovo/spades2/HT-014_S30_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-015_S36,/shared2/cosmika/denovo/spades2/HT-015_S36_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-017_S42,/shared2/cosmika/denovo/spades2/HT-017_S42_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-019_S1,/shared2/cosmika/denovo/spades2/HT-019_S1_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-021_S6,/shared2/cosmika/denovo/spades2/HT-021_S6_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-022_S13,/shared2/cosmika/denovo/spades2/HT-022_S13_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-023_S20,/shared2/cosmika/denovo/spades2/HT-023_S20_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-024_S25,/shared2/cosmika/denovo/spades2/HT-024_S25_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-025_S31,/shared2/cosmika/denovo/spades2/HT-025_S31_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

HT-026_S37,/shared2/cosmika/denovo/spades2/HT-026_S37_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SA-017_S43,/shared2/cosmika/denovo/spades2/SA-017_S43_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SA-018_S2,/shared2/cosmika/denovo/spades2/SA-018_S2_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SA-019_S7,/shared2/cosmika/denovo/spades2/SA-019_S7_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SA-020_S14,/shared2/cosmika/denovo/spades2/SA-020_S14_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SA-022_S12,/shared2/cosmika/denovo/spades2/SA-022_S12_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SD-005_S33,/shared2/cosmika/denovo/spades2/SD-005_S33_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SD-006_S39,/shared2/cosmika/denovo/spades2/SD-006_S39_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SH-009_S21,/shared2/cosmika/denovo/spades2/SH-009_S21_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SH-010_S17,/shared2/cosmika/denovo/spades2/SH-010_S17_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

SH-011_S23,/shared2/cosmika/denovo/spades2/SH-011_S23_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-015_S48,/shared2/cosmika/denovo/spades2/ST-015_S48_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-016_S49,/shared2/cosmika/denovo/spades2/ST-016_S49_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-017_S50,/shared2/cosmika/denovo/spades2/ST-017_S50_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-019_S47,/shared2/cosmika/denovo/spades2/ST-019_S47_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-020_S5,/shared2/cosmika/denovo/spades2/ST-020_S5_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-023_S11,/shared2/cosmika/denovo/spades2/ST-023_S11_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-024_S18,/shared2/cosmika/denovo/spades2/ST-024_S18_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-026_S24,/shared2/cosmika/denovo/spades2/ST-026_S24_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

ST-027_S29,/shared2/cosmika/denovo/spades2/ST-027_S29_spades/annotation/PROKKA_03082015/PROKKA_03082015.gbf.fixed

We will then use the  following python script to create a Sample x Genes table as well as FASTA files with gene sequence extracted from multiple samples:

#!/usr/bin/python

import sys

import csv

from Bio import SeqIO

# This function is useful for reading a csv file

def read_dict_file_csv(filename):

    reader = csv.reader(open(filename, 'rb'),delimiter=',')

    mydict = dict(x for x in reader)

    return mydict

mydict=read_dict_file_csv("genbank_files.csv")

#datastructure to hold sample:gene info

sample_genes={}

#datastructure to hold discovered genes

discovered_genes=[]

for i in mydict.keys():

    for seq_record in SeqIO.parse(mydict[i],"genbank"):

            for f in seq_record.features:

                    if f.type=="CDS":

                            gene_name=",".join(f.qualifiers.get('gene',[]))

                            #if it is a hypothetical protein then gene_name is empty

                            #for the sake of simplicity we only output resolved genes

                            #and you can uncomment the following two lines

                            #to get the hypothetical proteins out as well

   

                            #if gene_name=='':

                            #    gene_name=",".join(f.qualifiers.get('locus_tag',[]))

                            if gene_name!='':

                                    sample_genes[i+":"+gene_name]=",".join(f.qualifiers.get('translation',[]))

                                    discovered_genes.append(gene_name)

#get unique genes name

discovered_genes=list(set(discovered_genes))

#print discovered_genes

#Save gene table to a file

out=open("gene_table.csv",'w')

out.write('Genes,'+",".join(discovered_genes)+"\n")

for i in mydict.keys():

    out.write(i)

    for j in discovered_genes:

            if sample_genes.get(i+":"+j,None)==None:

                    out.write(",0")

            else:

                    out.write(",1")

    out.write("\n")

out.close()

#Generate a fasta file for each gene

for j in discovered_genes:

    out=open(j.replace("/","_")+".fasta",'w')

    for i in mydict.keys():

            extracted_sequence=sample_genes.get(i+":"+j,None)

            if extracted_sequence!=None:

                    out.write('>'+i+"\n")

                    out.write(extracted_sequence+"\n")

    out.close()

Name the above script as genebank_genes.py and run it as

python genbank_genes.py 2>/dev/null

It will produce the gene_table.csv file as:

and separate FASTA files for all the genes with header names for sequences as sample they are coming from

Now list those genes that are covered in more than 40 samples:

(for i in $(ls *.fasta); do echo -e "$i,$(grep -c ">" $i)"; done) | sort -t"," -k2rn | awk -F"," '$2>40'

Generate the phylogenetic tree for genes that are covered in more than 10 samples:

for i in $((for i in $(ls *.fasta); do echo -e "$i,$(grep -c ">" $i)"; done) | sort -t"," -k2rn | awk -F"," '$2>10{print $1}'); do mafft --retree 1 $i > $i.gapped; FastTree -gtr < $i.gapped > $i.gapped.tre ; done        

For example, above python program will produce gyrA.fasta as follows:

>HT-019_S1

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-015_S48

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-013_S38

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-016_S8

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-012_S32

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-016_S49

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SA-019_S7

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-017_S50

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-017_S15

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HH-010_S9

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HH-009_S27

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVEDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HH-012_S22

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-027_S29

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-026_S24

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGNAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-014_S44

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SD-006_S39

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-024_S25

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HH-013_S28

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-019_S47

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SA-022_S12

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-002_S40

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SH-010_S17

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SH-011_S23

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SD-005_S33

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-008_S10

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-003_S46

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-022_S13

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-021_S6

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SA-020_S14

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVEDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-015_S36

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-023_S11

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HD-004_S45

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-025_S31

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-023_S20

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-020_S5

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-017_S42

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HD-006_S3

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-011_S26

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-026_S37

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVEDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>ST-024_S18

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HH-011_S16

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SA-018_S2

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFNLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-001_S34

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SH-009_S21

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HA-015_S19

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-004_S4

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>SA-017_S43

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

>HT-014_S30

MEENNKILPIEIAEEMKKSYIDYSMSVIAGRALPDVRDGLKPVHRRILYSMSELNLTPDKPYRKSARIVGDVLGKYHPHGDTAVYYAMVRMAQDFSTRALLVDGHGNFGSVDGDSPAAMRYTEAKMSKLSLELLRDIEKETVDFKPNFDESLKEPSVLPARYPNLLVNGSNGIAVGMATSIPPHNLAEVIDATVYLIDNPECSVDDLIKFVQGPDFPTAAIIMGKESIAEAYRTGRGKVKVRSRAFIEELPKGKQQIIVTEIPYQVNKAKLVERIAELVKEKRIEGISDLRDESNRNGMRIVIELKRDANANIVLNNLYKHSQMEDTFSIIMLALVDGQPRVLNLKQILYHYIKHQEDVVTRRTKFELNKAEARAHILEGLKIALDNIDAVISLIRASKTGQEAKLGLIEKFKLTEIQAQAILDMRLQRLTGLERDKIEAEYEDLIKKINRLKEILADERLLLNVIKDEITIIKENYSDERRTEIRHAEGEIDMRDLISDEEIAITLTHFGYIKRLPSDTYKSQKRGGRGISALTTREEDFVRHLVTTTTHSRLLFFTNKGRVFKLNAYEIPEGKRQAKGTAIVNLLQLSADEKIATLIPIDGNDENEYLLLATKKGIVKKTKREEFKNINKSGLIAIGLRDDDELIGVELTDGKQEVLLVTKEGMSIRFDENDIRYMGRTAMGVKGITLSKEDFVVSMNLCSKGTDVLVVSKNGFGKRTNIEEYRSQIRAGKGIKTYNISEKTGTIVGADMVNEDDEIMIINSDGVLIRIRVNEISLFGRVTSGVKLMKTNDEVNVVSIAKINIEEE

and the corresponding tree in Newick format (gyrA.fasta.gapped.tre) is:

((SA-019_S7:0.0,HT-024_S25:0.0,SA-022_S12:0.0,HT-022_S13:0.0,ST-020_S5:0.0,HT-017_S42:0.0,SA-018_S2:0.0):0.00120,(HH-009_S27:0.0,SA-020_S14:0.0,HT-026_S37:0.0):0.00120,((HT-019_S1:0.0,ST-015_S48:0.0,HA-013_S38:0.0,HA-016_S8:0.0,HA-012_S32:0.0,ST-016_S49:0.0,ST-017_S50:0.0,HA-017_S15:0.0,HH-010_S9:0.0,HH-012_S22:0.0,ST-027_S29:0.0,HA-014_S44:0.0,SD-006_S39:0.0,HH-013_S28:0.0,ST-019_S47:0.0,HT-002_S40:0.0,SH-010_S17:0.0,SH-011_S23:0.0,SD-005_S33:0.0,HT-008_S10:0.0,HT-003_S46:0.0,HT-021_S6:0.0,HT-015_S36:0.0,ST-023_S11:0.0,HD-004_S45:0.0,HT-025_S31:0.0,HT-023_S20:0.0,HD-006_S3:0.0,HA-011_S26:0.0,ST-024_S18:0.0,HH-011_S16:0.0,HT-001_S34:0.0,SH-009_S21:0.0,HA-015_S19:0.0,HT-004_S4:0.0,SA-017_S43:0.0,HT-014_S30:0.0):0.00055,ST-026_S24:0.00120)0.381:0.00055);

which you can plot in R as:

library(ape)

phylogenetic_tree <- read.tree("gyrA.fasta.gapped.tre")

plot(phylogenetic_tree)

We can also visualise the alignment as well as draw an NMDS plot based on distances between the sequences using the following code (TO DO: Cosmika, try looking at geom_tile from ggplot2 as discussed to do a better job at drawing alignments. Take hint from the heatmap I drew in my R Code for ecology webpage: http://userweb.eng.gla.ac.uk/umer.ijaz/bioinformatics/ecological.html ):

# To install alignfigR use the following:

#library(devtools)

#devtools::install_github("sjspielman/alignfigR")

library(alignfigR)