To do:

2020-5-4 Baobab dataset continued

Configure SRA tool kit

But Diego mentioned that I should try to do it on the lab workstation instead

Fastq-dump just one dataset and figure out why the number of reads does not match Table S1, and read lengths are all around 220-230 bp

2020-4-3 Baobab dataset

Download all 18 datasets from sra: PRJNA579976

Trying to do this on mac and see how it goes…

Download SRA toolkit for mac

Had to go to system preferences to give permission for things that won’t run.

Sample info from Table S1:

Taxon        Sample ID        Sequencing Platform        Number of read pairs        Source                

Adansonia digitata        Adi001        MiSeq        10,698,792        Accession # UW11 (UWBG)        

Adansonia digitata        Adi002        MiSeq        10,785,012        Accession # UW2291 (UWBG)        

Adansonia digitata        Adi003        MiSeq        10,947,592        See Cron et al. 2016; GenBank ID: KU145771        

Adansonia grandidieri        Aga001        MiSeq        9,040,756        Accession #97-B002010-1 (GBDBG)        

Adansonia grandidieri        Aga002        MiSeq        8,093,124        Accession #03-B000192-1 (GBDBG)         

Adansonia gregorii        Age001        MiSeq        10,627,128        North Western Australia,  D.A.Baum                

Adansonia perrieri        Ape001        MiSeq        10,627,128        Accessions # 92-B000060-1 (GBDBG)                

Adansonia perrieri        Ape009        HiSeq        18,871,820        Northern Madagascar, Karimi-2014-09 (WIS)                

Adansonia madagascariensis        Ama006        HiSeq        18,105,976        Northern Madagascar, Karimi-2014-006 (WIS)                                

Adansonia madagascariensis        Ama018        HiSeq        17,421,708        Northern Madagascar,  Karimi-2014-018 (WIS)

Adansonia rubrostipa        Aru001        MiSeq        8,890,992        Southwestern Madagascar,  D.A.Baum 313 (MO)

Adansonia rubrostipa        Aru127        HiSeq        16,620,060        Western Madagascar, Karimi-2014-127 (WIS)

Adansonia suarezensis        Asu001        MiSeq        11,343,772        Accession #UW11 (GBDBG): Seed from Northern Madagascar, Baum 320A (WIS)

Adansonia za        Aza037        HiSeq        18,017,204        Northern Madagascar, Karimi-2014-37 (WIS)

Adansonia za        Aza135        HiSeq        18,420,364        Southern Madagascar, Karimi-2014-135 (WIS)

Bombax ceiba        Bce020        MiSeq        7,398,256        Accession #UW10 (UWBG)

Pseudobombax croizatii        Pcr070        MiSeq        7,398,256        Accession #UW1255 (UWBG): Seed from Puerto Ayacucho in Venezuela, Paul E. Berry (MO)

Scleronema micrantha        Smi165        MiSeq        7,398,256        See Alverson et al. (1999); GenBank: AF111735

2020-2-25 debugging phyparts for Yinjie

Run phyparts and ignore gene tree notes with <50 bootstrap

$ java -jar ~/apps/phyparts/target/phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d RAxML_tree -m rooted_6_taxa.txt -o outS50 -s 50

Install the ETE toolkit following instructions

Install Minconda  (you can ignore this step if you already have Anaconda/Miniconda)

$ curl -L '' -o

bash -b -p ~/anaconda_ete/

export PATH=~/anaconda_ete/bin:$PATH;

Install ETE and external tools

$ conda install -c etetoolkit ete3 ete_toolchain

Simply cannot find ete3 from the $PATH

Instead, tried PIP

$ pip install ete3

That seemed to work

Install dependencies

$ pip install PyQt5

python -m pip install -U matplotlib

Check  again

$ ete3 build check

WARNING: external applications not found

Install using conda (recommended):

 conda install -c etetoolkit ete_toolchain

or manually compile from:

I gave up trying to install ETE tool kit...Yinjie’s dataset basically switched color coding. Now it’s solved.

2019-10-24 data to Stephen

Sorted out genome walking data set from the 2018 New Phyt WGD paper and sent Stephan both the homolog trees, alignments, and all seq fasta files. They are in the data_transfer folder in the shared Caryophyllales google drive folder

I also deleted Chen’s account on bigsmith

2018-12-19 Trouble shoot gen dup mapping scripts for Diego

Git not working after updating to OS 10.14

C02SX0HZGTF1:~ yangya$ xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun

C02SX0HZGTF1:~ yangya$ xcode-select --install

Now it is fixed

Downloaded the repo instead of git clone.

2018-12-4 cary353+ bait design

Still having issue with GUI doesn’t load when logging in.

Freed up some space in the system drive

Updated packages sudo apt-get dist-upgrade

yayang@eeb-bigsmith:~$ dmesg | tail -10

[73915.513427] [drm] nouveau 0000:06:00.0: illegal object class: 0x90b8

[73915.513429] [drm] nouveau 0000:06:00.0: Error creating object: -22 (3/0x000590b8)

[73915.513430] [drm] nouveau 0000:06:00.0: illegal object class: 0x90b5

[73915.513431] [drm] nouveau 0000:06:00.0: Error creating object: -22 (3/0x000490b5)

[73915.513433] [drm] nouveau 0000:06:00.0: illegal object class: 0x85b5

[73915.513434] [drm] nouveau 0000:06:00.0: Error creating object: -22 (3/0x000085b5)

[73916.467104] systemd-logind[4138]: New session c6 of user Debian-gdm.

[73940.625214] systemd-logind[4138]: New session 34 of user yayang.

[75977.029247] systemd-logind[4138]: New session 35 of user yayang.

[76082.020647] gnome-shell[10012] trap divide error ip:7fc2a3e64310 sp:7ffc97fe2d60 error:0 in[7fc2a3e2c000+10c000]

Rename all seq to ensure that formatting are uniform

2018-12-4 cary353+ bait design

Previously had a folder in the macbook pro but somehow cannot download it from Linux using dropbox. Doing it again

Obtain data:


Data types to get: mRNA (for intron-exon boundary), CDS (primary), peptides (primary), and whole genome assembly (for physical location)

Six genomes will be used for clustering

Searched for Caryophyllales in the NCBI Genomes database

Silene genome is still low quality with the new assembly

Phytozome: (transcripts and non-redundant peptides and CDS)

Amaranthus hypochondriacus v2.1 ER {Lightfoot, 2017 #2126}

Chenopodium quinoa v1.0 {Jarvis, 2017 #2128}

Beta valgaris subsp. vulgaris RefBeet-1.2 

Fagopyrum tataricum gene annotation V2 

Mesembryanthemum crystallinum shared by John Cushman and Won Yim

Spinach citation Xu C, Jiao C, Sun H, Cai X, Wang X, Ge C, Zheng Y, Liu W, Sun X, Xu Y, Deng J, Zhang Z, Huang S, Dai S, Mou B, Wang Q, Fei Z, Wang Q (2017) Draft genome of spinach and transcriptome diversity of 120 Spinacia accessions. Nature Communications 8:15275

Transcriptomes: We are using the 175 transcriptomes from Yang et al. 2018 New Phytologist. This dataset is comprehensive that it included XX out of XX family. It is less intensive as Walker 2018 (basically I trust the data I cleaned myself :P)


Emms, D.M. and Kelly, S. (2015) OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biology 16:157

Emms, D.M. and Kelly, S. (2018) OrthoFinder2: fast and accurate phylogenomic orthology analysis from gene sequences.bioRxiv

Update OrthoFinder to v2

cd ~/apps/

cd OrthoFinder/

git status

git pull

Obtain dependencies:

Mcl: sudo apt-get install mcl

Skipping installing DIAMOND or MMseqs2 as we are not dealing with massive data set



Download release from 

sudo cp ~/apps/fastme- /usr/local/bin/fastme

2018-11-2 Delphine’s Salsola dataset

Between Saldiv_Salsola_divaricata

Hyb parent 1

Salsola montana

Hyb parent 2

Not use Salsola oppositifolia 8x

Hammada scoparia 2x

Haloxylon ammodendron 2x

Halogeton_glomeratus 2x

python ../../src/ . Saldiv_Salsola_divaricata Salmont_Salsola_montana 4

python ../../src/ . Salmont_Salsola_montana Haam_Haloxylon_ammodendron 4

2018-10-31 Delphine’s Salsola dataset: C4 gene tree

C4 genes

Hagl_Halogeton_glomeratus might be contaminated by non-cary


Did mafft run on AT2G16400_1_1.fa. Alignment has some divergent sequences forced together


Remove MSB0496298SFB_Kali_collina@DN11787_c0_g1_i2 from

AT2G26080 fixed AT alignment at the end and removed empty columns

AT2G26900 Hagl_Halogeton_glomeratus@DN9629_c0_g3_i1 removed; remove empty columns

AT2G35120 remove odd seqs before start codon



AT3G01200 removed odd seq before start codon

AT3G01550 remove Hagl_Halogeton_glomeratus@DN17940_c0_g1_i1; removed odd seqs before start condon

AT3G04530 The two homeologs of Salsola divericata might be misassembled or there was cross over






AT3G56160 removed odd seqs before start codon



AT4G21210 removed odd seqs before start codon


AT4G31990_1_1.fa.mafft.aln removed odd seqs before start codon

AT4G37650 removed odd seqs before start codon


Odd to have non-monophyletic insertion, and verticilata goes to two different places (checked that this is not contamination)

AT4G37930 removed odd seqs before start codon





MSB0496298SFB_Kali_collina@DN11787_c0_g1_i2 removed from AT5G11670_1.fa.mafft.aln; remove empty columns

AT5G12860 removed odd seqs before start codon


 removed odd seqs before start codon


AT5G38410 removed odd seqs before start codon. Salsola soda goes to multiple places and the C2 species are not there





Removed odd seqs before start codon; removed clade that is 0.3 from the main homolog


Exactly the same tree as AT4G37870

yayang@eeb-bigsmith:~/salsola_Delphine/photosynthesis_genes/2_AT_baits_separated_by_locus$ cp AT*/*final* ../3_homolog_mafft_alignment/

Write unaligned file, and fix seq id with “/”

yayang@eeb-bigsmith:~/salsola_Delphine/photosynthesis_genes$ python 3_homolog_mafft_alignment/

Run prank wrapper

Done for all 50 trees. Looking good

Run pal2nal and bootstrap with 17 species, no Arabidopsis

Concatenate cds files for all 17 species, not Arabidopsis

yayang@eeb-bigsmith:~/salsola_Delphine/photosynthesis_genes/data_17taxa$ cat *.cds.fa >../17taxa.cds.fa

Write fasta files from mafft alignment but remove arabidopsis

3_homolog_mafft_alignment/yayang@eeb-bigsmith:/media/yayang/data1/YA/salsola_Delphine/photosynthesis_genes$ python src/ 3_homolog_mafft_alignment/

yayang@eeb-bigsmith:/media/yayang/data1/YA/salsola_Delphine/photosynthesis_genes$ mv 3_homolog_mafft_alignment/*.fa 6_prank_alignment_17taxa/

yayang@eeb-bigsmith:/media/yayang/data1/YA/salsola_Delphine/photosynthesis_genes$ python src/ 6_prank_alignment_17taxa/ 6_prank_alignment_17taxa/ .fa aa

Run pal2nal wrapper

Phyutility by 0.3

Run raxml with bootstrap


Selection analysis

Aminal acid changes


Once translation from raw Trinity out put is done do Ks on unfiltered blastp and plot it to 3 and to 0.5. Done for 5. Running the rest.

WGD mapping

Rooted tree with at least xx out of xx taxa

2018-10-30 Delphine’s Salsola dataset

C4 data set

Wrote a new script to parse ATbaits.pep.fa

$ python src/ AT_baits/ATbaits.pep.fa .


Remove EosaxSFB_Eokochia_saxicola@DN18943_c0_g2_i2 as it has misassembled sequence in the second half and otherwise identical to EosaxSFB_Eokochia_saxicola@DN18943_c0_g2_i1


yayang@eeb-bigsmith:~/salsola_Delphine/photosynthesis_genes/AT1G23310$ python ../src/ AT1G23310.swipe.fa.mafft.aln AT1G23310_1_1.subtree

20 tips read

Renamed the output fasta file and removed empty columns


Remove CarveSFB_Caroxylon_vermiculatum@DN78990_c0_g1_i1

Remove empty columns


Removed odd seqs before start codon in AT1G32470_1_1.fa.mafft.aln


Removed odd seqs before start codon in AT1G48030_1_1.fa.mafft.aln


Removed odd seqs before start codon in AT1G49810.swipe.fa.mafft.aln


AT1G53310_1.fa.mafft.aln nothing to adjust but just used it as the final alignment


Removed odd seqs before start codon in AT1G67090.swipe.fa.mafft.aln

Seq variable. Cannot find the C2 taxa but can find Salsola montana


Remove Saldiv198_Salsola_divaricata@DN38252_c0_g1_i1 from AT1G68010_1.fa.mafft.aln and removed empty columns


AT1G79750_1.fa.mafft.aln as final. Some Kali seqs look odd but keeping them for now


Removed odd seqs before start codong from AT1G80380.swipe.fa.mafft.aln


AT2G13360.swipe.fa.mafft.aln no edit

2018-10-29 Delphine’s Salsola dataset

Check the bigsmith machine

GUI freeze on bigsmith

Couldn’t log into GUI using ssh yayang@ Realized that had to use VPN

Restarted the workstation

Update the system following as it haven’t been updated for at least 2 years

$ sudo apt-get update
$ sudo apt-get dist-upgrade

Check for crash plan to make sure that the system is backed up

$ sudo /usr/local/crashplan/bin/CrashPlanDesktop

Workstation is currently running the Debian 8. The current release is Debian 9.5. Looks like Debian 10 will be released early 2019. Will hold on for updating untill then unless the backup and GUI are still having issues

Baited search on C4 genes

Start from PEPCK

Downloaded source code from to src just for this analysis

Modified the bait homolog scripts to remove trim tips and simplify it

From AT1G08650_1.fa.mafft.aln, remove the clade with branch length 0.8, and remove sequences before start codon

Remove empty columns

Save as the final alignment

Looking great.

2018-10-26 Ks for Delphine’s Salsola dataset

Noticed that Diego used full species name in individual pep and cds id. Need to modify code for it. Two scripts were modified from the clustering repo →  /home/yayang/salsola_Delphine/src/ (no cd-hit since already did corset)

ks_plot_cds →  /home/yayang/salsola_Delphine/src/

Looks like corset is quite agressive. Both species below are 4x

2017-10-11 Set up the macbook pro for analysis

Git clone from RAxML

$ make -f Makefile.AVX2.mac

$ sudo cp raxmlHPC-AVX2 /usr/bin/raxml

Download mafft mac package and unpacked it.

Seems to work with multi-threading

Download jalview without Java

Somehow everything worked. Amazing.

2017-9-27 First test on Droseraceae Sanger seqs


Get the PyPHLAWD code

$ git clone

Get the PHLAWD database maker

$ git clone 

Looking at phlawd_db_maker readme 

Make sure that wget is installed

$ wget

Looks good

$ cd phlawd_db_maker/

$ cmake .

~/apps/phlawd_db_maker/deps/sqlitewrapped-1.3.1$ make

~/apps/phlawd_db_maker/deps/sqlitewrapped-1.3.1$ cp libsqlitewrapped.a ../linux/

Back to the ~/apps/phlawd_db_maker/ dir and type make, it always say

[ 25%] Building CXX object CMakeFiles/phlawd_db_maker.dir/src/utils.cpp.o
/home/yayang/apps/phlawd_db_maker/src/utils.cpp:132:42: warning: unknown escape sequence: '\.'
/home/yayang/apps/phlawd_db_maker/src/utils.cpp: In function ‘int getdir(std::string, std::vector<std::basic_string<char> >&)’:
/home/yayang/apps/phlawd_db_maker/src/utils.cpp:123:29: error: ‘errno’ was not declared in this scope
        cout << "Error(" << errno << ") opening " << dir << endl;
CMakeFiles/phlawd_db_maker.dir/build.make:123: recipe for target 'CMakeFiles/phlawd_db_maker.dir/src/utils.cpp.o' failed
make[2]: *** [CMakeFiles/phlawd_db_maker.dir/src/utils.cpp.o] Error 1
CMakeFiles/Makefile2:60: recipe for target 'CMakeFiles/phlawd_db_maker.dir/all' failed
make[1]: *** [CMakeFiles/phlawd_db_maker.dir/all] Error 2
Makefile:76: recipe for target 'all' failed
make: *** [all] Error 2

Issue filed on github

OK give up and do it old fashioned way

Go to genbank and download all the Droseraceae rbcl

Lots of random junk

Try using a bait to BLASTN against nr/nt

>AB072553.1 Drosera trinervia chloroplast rbcL gene for ribulose bisphosphate carboxylase oxygenase large subunit, partial cds



















Set the Max target sequences to 250

In the results page, go to taxonomy viewer, and download all the 112 Droseraceae results in FASTA (text) format

Remove five genomes

/media/YA/drosera$ grep '>' Droseraceae_rbcl.fa | grep 'genome'

>KU168830.1 Drosera rotundifolia chloroplast, complete genome

>KY679199.1 Drosera regia chloroplast, complete genome

>KY679200.1 Aldrovanda vesiculosa voucher PERTH_08722560 chloroplast, complete genome

>KY651214.1 Drosera erythrorhiza voucher PERTH_06332374 plastid, complete genome (replaced with >NC_035241.1:50938-52365 Drosera erythrorhiza voucher PERTH_06332374 plastid, complete genome to preserve the species sampling )

>KY679201.1 Dionaea muscipula chloroplast, complete genome

Got 108 sequences

Fix names with special characters and spaces

Than run with bootstrap on. Well my first Sanger-tree in many, many years. Looks like there are lots of unpublished 18S genes as well but I would not expect much variation

Try clustering with MCL

GenBank filter "Droseraceae"[Organism] NOT genome

200 seqs, fixed names

$ makeblastdb -in droseraceae_all_download_shorten_name.fa -dbtype nucl -out all.fa
$ blastn -db all.fa -query all.fa -evalue 10 -num_threads 4 -max_target_seqs 1000 -out all.rawblast -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend =sstart send evalue bitscore'
$ python ~/clustering/ all.rawblast 0.4

$ mcl all.rawblast.hit-frac0.4.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 1.4 -o hit-frac0.4_I1.4_e5

$ python ~/clustering/ droseraceae_all_download_shorten_name.fa hit-frac0.4_I1.4_e5 20 .

Reading mcl output file

Reading the fasta file droseraceae_all_download_shorten_name.fa

2 clusters with at least 20 taxa written to .

No it didn’t work. Broken up too much due to small data set size

Bait the ITS old fashioned way

Starting from

>JN388076.1 Drosera pulchella 18S ribosomal RNA gene, partial sequence; internal transcribed spacer 1, 5.8S ribosomal RNA gene, and internal transcribed spacer 2, complete sequence; and 28S ribosomal RNA gene, partial sequence

Out put 1000, but just won’t hit Aldrovanda…

Manually added 5 from Aldrovanda and Dioneae

Looks like the tree is quite different compared to rbcL almost like the pattern seen in island radiations. Really interesting...

2017-6-9 K-mer continued

Standard protocol for the six genomes:

#Check download

$ md5sum <raw read fq.gz files>


~/apps/Trimmomatic-0.36$ java -jar ~/apps/Trimmomatic-0.36/trimmomatic-0.36.jar PE -phred33 /media/yayang/data/YA/genomeSFB/La_1.fq.gz /media/yayang/data/YA/genomeSFB/La_2.fq.gz /media/yayang/data/YA/genomeSFB/La_1_paired.fq /media/yayang/data/YA/genomeSFB/La_1_unpaired.fq /media/yayang/data/YA/genomeSFB/La_2_paired.fq /media/yayang/data/YA/genomeSFB/La_2_unpaired.fq ILLUMINACLIP:TruSeq3-PE-2.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36


$ ~/apps/FastQC/fastqc *.fq

#jellifish with paired reads

$ ~/apps/jellyfish-2.2.6/bin/jellyfish count -C -m 21 -s 1000000000 -t 10 *paired.fq -o La_paired_reads.jf

$ ./jellyfish histo -t 10 /media/yayang/data/YA/genomeSFB/La_paired_reads.jf > /media/yayang/data/YA/genomeSFB/La_paired_reads.histo

Load histo file to, change read length to 150

Remove the .gf file and gzip the .fq files; remove untrimmed reads


Model didn’t converge. Looking at the raw coverage it’s a genome >1.5 GB


property                      min               max              

Heterozygosity                0.144151%         0.161645%        

Genome Haploid Length         231,725,070 bp    232,190,735 bp    

Genome Repeat Length          36,787,288 bp     36,861,215 bp    

Genome Unique Length          194,937,781 bp    195,329,521 bp    

Model Fit                     89.7417%          91.8797%          

Read Error Rate               0.505607%         0.505607%  

Ma-Macarthuria astralis (?)

property                      min               max              
Heterozygosity                2.83277%          2.86394%          
Genome Haploid Length         386,944,999 bp    387,739,025 bp    
Genome Repeat Length          131,480,368 bp    131,750,170 bp    
Genome Unique Length          255,464,632 bp    255,988,855 bp    
Model Fit                     92.8859%          96.7025%          
Read Error Rate               0.388363%         0.388363%        

This is really high heterozygosity. Could be allopolyploids?


property                      min               max              

Heterozygosity                1.1998%           1.21038%          

Genome Haploid Length         650,532,664 bp    651,746,276 bp    

Genome Repeat Length          198,160,523 bp    198,530,205 bp    

Genome Unique Length          452,372,141 bp    453,216,071 bp    

Model Fit                     93.9298%          98.1359%          

Read Error Rate               0.301755%         0.301755%  

1C=0.74 from C-value database


property                      min               max              

Heterozygosity                2.83101%          2.85311%          

Genome Haploid Length         390,060,728 bp    390,782,032 bp    

Genome Repeat Length          138,243,600 bp    138,499,242 bp    

Genome Unique Length          251,817,128 bp    252,282,790 bp    

Model Fit                     93.1384%          97.6165%          

Read Error Rate               0.459713%         0.459713%  

Ploidy level - no existing tool without reference

Ploidy estimation, but needs genome Also requires BAM genome mapping as input

2017-6-5 K-mer continued

Processing rest 6 genomes (8 in total)

(never use GUI to unpack really large files; it crashes the workstation)

$ ./fastqc <path to Kc_1.fq.gz>

$ ./fastqc <path to Kc_2.fq.gz>

Looking at fastqc results: need to trim adapters. Otherwise look ok

Download and Trimmomatic binary, and move TruSeq3-PE.fa to the same directory as the .jar file

yayang@eeb-bigsmith:~/apps/Trimmomatic-0.36$ java -jar ./trimmomatic-0.36.jar PE -phred33 /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_1.fq.gz /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_2.fq.gz /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_1_trim_paired.fq /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_1_trim_unpaired.fq /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_2_trim_paired.fq /media/yayang/data/YA/genomeSFB/Kc_md5sum_OK/Kc_2_trim_unpaired.fq ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Checked the trimmed reads and still some adapters in paired reads. In unpaired reads the adaptor is almost unclipped.

Need to use TreSeq3-PE-2 instead.

Output file sizes are a little smaller and FastQC looks a lot better

Using similar scripts as 2017-5-21

~/apps/jellyfish-2.2.6/bin/jellyfish count -C -m 21 -s 1000000000 -t 20 *paired.fq -o Kc_paired_reads.jf

Only using paired reads after trimming


property                      min               max              

Heterozygosity                0.344643%         0.358824%        

Genome Haploid Length         506,159,088 bp    507,278,688 bp    

Genome Repeat Length          168,162,003 bp    168,533,970 bp    

Genome Unique Length          337,997,085 bp    338,744,719 bp    

Model Fit                     89.8932%          95.418%          

Read Error Rate               0.252167%         0.252167%  

2017-5-22 K-mer continued

Did exactly the same for Spergularia media as for Corrigiola litoralis yesterday

Did not converge. Looked up chromosome counts both species are n=9.

“Basal” Caryophyllaceae Polycarpaea all have genome size of 0.4-0.5G

Spergularia media

Try using forward reads _1

GenomeScope version 1.0
k = 21

property                      min               max              
Heterozygosity                1.14926%          1.16774%          
Genome Haploid Length         218,004,034 bp    218,859,428 bp    
Genome Repeat Length          57,586,080 bp     57,812,034 bp    
Genome Unique Length          160,417,953 bp    161,047,394 bp    
Model Fit                     95.7331%          99.3259%          
Read Error Rate               0.256082%         0.256082%        

Reverse reads_2

property                      min               max              

Heterozygosity                1.05629%          1.07298%          

Genome Haploid Length         223,189,637 bp    223,944,113 bp    

Genome Repeat Length          59,497,720 bp     59,698,848 bp    

Genome Unique Length          163,691,916 bp    164,245,265 bp    

Model Fit                     95.6041%          99.3431%          

Read Error Rate               0.516712%         0.516712%    

A few additional tools below but seems that GenomeScope already did what I need


It uses genome size N = Total no. of k-mers/Coverage. This is only the size of non-repetitive areas, which can be just half of the genome. Submitted Aug 2013, cited 18 times but never seen a peer-reviewed version come out

A simple perl script wrapping Jellyfish

2017-5-21 K-mer based genome size estimation

Read processing → Jellyfish → downstream analyses


Download Linux binary v0.11.5 from

Unpack, cd into the directory

chmod 755 fastqc

Copied the executable to the path but that didn’t work

sudo cp fastqc /usr/local/bin

fastqc '/media/yayang/data/YA/genomeSFB/DataForYa/Cor/CL_1.fq.gz'
Error: Could not find or load main class
Running it from the original directary

$ ./fastqc '/media/yayang/data/YA/genomeSFB/DataForYa/Cor/CL_1.fq.gz'

Looks like it takes .gz files fine


Vurture, GW, Sedlazeck, FJ, Nattestad, M, Underwood, CJ, Fang, H, Gurtowski, J, Schatz, MC (2017) Bioinformatics doi: 

K = 21 should work with most genomes, and >25x coverage is recommended for accurate estimation

Install jellyfish

Download v 2.2.6 binary from

Unpack, cd into the directory

$ ./configure

$ make

$ sudo make install

$ make check

Looks good. Check on a large data set

$ make check BIG=1

All passed

$ ~/apps/jellyfish-2.2.6/bin/jellyfish count -C -m 21 -s 1000000000 -t 20 *.fq -o CL_reads.jf

Note you should adjust the memory (-s) and threads (-t) parameter according to your server. This example will use 20 threads (-t) and 1GB of RAM (-s).

Jellyfish doesn’t take .gz files.

$ ~/apps/jellyfish-2.2.6/bin$ ./jellyfish histo -t 10 '/media/yayang/data/YA/genomeSFB/DataForYa/Cor/CL_reads.jf' > '/media/yayang/data/YA/genomeSFB/DataForYa/Cor/CL_reads.histo'


Upload the .histo file to GenomeScope, and use read length of 150 

GenomeScope version 1.0
k = 21

property                      min               max              
Heterozygosity                0.251358%         0.266698%        
Genome Haploid Length         231,450,535 bp    232,014,712 bp    
Genome Repeat Length          35,264,796 bp     35,350,756 bp    
Genome Unique Length          196,185,739 bp    196,663,956 bp    
Model Fit                     92.4455%          95.049%          
Read Error Rate              
0.556803%         0.556803%  

2017-3-19 Ks plots for WGD paper

In the clustering dir, modified the original Ks plots scripts so that it

  1. Use CDS instead of pep for more recent eventss
  2. Within as well as between species
  3. Plot up to Ks=1, instead of Ks=3

To test for allopolyploids

Previously tested using final homologs for Ks plot but that is redundant

Also when mapped to two adjacent nodes topology test is needed.

Plotting within and between species Ks plots










2017-3-15 Alternative filtering for WGD paper

Modified clustering/

So that require gensister <= spsister set to make sure only map to that particular node on species tree. Won’t simply map to a deeper node if there is phylogenetic uncertainty, neither will it map shallower with missing data

Worked really well. Turns out that this is the same as MAPS

2017-3-10 Figures for WGD paper

Played with clustering/

When only allopolyploidy is at play, all bipartitions from gene tree will be concordant. It is the taxa present at both sides being non-random. Missing Ks-peak at species front is probably a better indicator

Wrote a script ~/clustering/ to show that our cutoff for WGD is not arbitrary but according to a bimodal distribution of percent duplications

python ~/clustering/ dup_perc_filter50_global.filter43

Go to R






2017-3-7 summarize species trees for WGD paper

Map duplication using mrca

3_AMAR/analysis/dup$ python ~/clustering/ inclades_filter35 RAxML_bestTree.AMAR.rooted 35 filter35

Easiest way to get rooted species tree is to open it in figtree, root it, and export as newick

Probably need to do the concordance-only mapping... we’ll see

2017-3-6 summarize species trees for WGD paper

Running ASTRAL for each subclades exactly as for the overall species tree with ASTRAL version 4.10.12. It output internal branch lengths, versus previous versions doesn’t

$ cat *.tre >all_ML.trees

$ ls *.trees >all_boot.trees

$ java -jar ~/apps/ASTRAL/Astral/astral.4.10.12.jar  -i all_ML.trees -b all_boot.trees -g -r 100 -o astral_boot.out

$ tail -1 astral_boot.out >../astral_boot.tre

Odd that it keeps giving error message

Caused by: ')' expected

Still need to use the script clustering/ to add extra ) and remove node labels for the newer version. It is good to use it anyways to ensure that ML tree and boot trees are in the correct order

5_NCORE/analysis$ python ~/clustering/ 5_homolog 36 ASTRAL


java -jar ~/apps/ASTRAL/Astral/astral.4.10.12.jar  -i all_ML.trees -b all_boot.trees -g -r 100 -o astral_boot.out

tail -1 astral_boot.out >../astral_boot.tre

Compare the bootstrap support percentages the new version gives the same value as the old version but added branch length. Tried with or without ML tree branch lengths as input the output branch lengths are the same.

Copy over the all_ML.trees written by, remove the extra “)” before “;” and calculate IC/ICA scores

$ sed -i 's/);/;/g' all_ML.trees

$ raxml -f i -T 2 -t RAxML_bestTree.ncore -z all_ML.trees -m GTRCAT -n ICA

2017-3-3 summarize species trees for WGD paper

Test latest version of RAxML

Running RAxML. For eeb-bigsmith the AVX instruction doesn’t work. It gives an “illegal instruction” error. Have to use the SSE3 version (machine not new enough)

Benchmark speed with gcc vs. icc (without optimization)

Copy over the pthreads sse3 makefile, keep everything except change CC=gcc to CC=icc

Using the gcc version

$ time /home/yayang/apps/standard-RAxML/raxmlHPC-PTHREADS-SSE3-gcc -T 4 -f a -x 12345 -# 200 -p 12345 -s homolog7695_1_1_1.subtree.aln-cln -n homolog7695_1_1_1.subtree.aln-cln -m PROTCATAUTO

Time for BS model parameter optimization 105.573528
Bootstrap[0]: Time 121.397746 seconds, bootstrap likelihood -18651.825026, best rearrangement setting 13
Bootstrap[1]: Time 15.156944 seconds, bootstrap likelihood -20086.527596, best rearrangement setting 13
Bootstrap[2]: Time 15.822543 seconds, bootstrap likelihood -18013.316711, best rearrangement setting 12
Bootstrap[3]: Time 15.496137 seconds, bootstrap likelihood -19849.625207, best rearrangement setting 15
Bootstrap[4]: Time 12.483267 seconds, bootstrap likelihood -19897.080057, best rearrangement setting 10
Bootstrap[5]: Time 11.798428 seconds, bootstrap likelihood -18457.845984, best rearrangement setting 10

Using the icc version with the same command

Time for BS model parameter optimization 78.101595
Bootstrap[0]: Time 87.618615 seconds, bootstrap likelihood -18651.825026, best rearrangement setting 13
Bootstrap[1]: Time 8.929033 seconds, bootstrap likelihood -20086.527596, best rearrangement setting 13
Bootstrap[2]: Time 9.147727 seconds, bootstrap likelihood -18013.316711, best rearrangement setting 12
Bootstrap[3]: Time 9.106802 seconds, bootstrap likelihood -19849.625207, best rearrangement setting 15

Looks like using icc is indeed about 30% faster than gcc. 

Add this line to ~/.bash_aliases

alias raxml='~/apps/standard-RAxML/raxmlHPC-PTHREADS-SSE3-icc’


Update to the latest ASTRAL

$ cd ~/apps/ASTRAL

$ git pull

Unzip the executable

Run ASTRAL and bootstrap both the sites and genes from all 634 genome walking orthologs

$ cat *.tre >all_ML.trees

$ ls *.trees >all_boot.trees

$ java -jar ~/apps/ASTRAL/Astral/astral.4.10.12.jar  -i all_ML.trees -b all_boot.trees -g -r 100 -o astral_boot.out

$ tail -1 astral_boot.out >astral_boot.tre

ICA scores w/ partial gene trees

$ raxml -f i -T 2 -t 7_ortho_filter160/RAxML_bestTree.filter150-160 -z 8_astral/all_ML.trees -m PROTCATAUTO -n ICA

The branch length is usually set to an arbitrary 1. It’s 0.9 something perhaps due to the partial gene trees.

To visualize the tree in figtree, replace all the arbitrary branch lengths with 1, and then use regular expression in geany



The lossless vs. probabilistic IC/ICA values are very different.

2017-3-1 summarize species trees for WGD paper

Update raxml. The newer version does ICA scores differently

$ cd ~/apps

$ git clone

$ cd standard-RAxML

$ make -f Makefile.AVX.PTHREADS.gcc

Running initial rsync to external hard drive after formatting to exFAT


Feb 16, 2015 Test Trinotate

Try blastp using the original peps against uniprot_unirep90 on bigmem3 with 19 cores:

$ blastp -query longest_orfs.pep -db uniprot_uniref90.trinotate.pep -num_threads 19 -max_target_seqs 1 -outfmt 6 > IlleSFB.uniref90.blastp.outfmt6

The run started 3pm on Feb 18 and finished Feb 20 22:07

$ blastn -db all.fa -query all.fa -evalue 10 -num_threads 10 -max_target_seqs 10 -out all.rawblastn -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'

$ mcl all.rawblast.hit-frac0.4.minusLogEvalue --abc -te 5 -tf 'gq(5)' -I 1.4 -o hit-frac0.4_I1.4_e5

$ python ~/phylogenomic_dataset_construction/ all.fa hit-frac0.4_I1.4_e5 3 clusters/

$ awk '{print $1}' IlleSFB.uniref90.blastp.outfmt6 | uniq | wc -l


Feb 18, 2015 Test Trimmatic

Use the beginning sequence of TruSeq adapter to search unfiltered (*.fq) versus filtered reads


Lots of adapters as expected

Filtered by in-house scripts based on simple blastn:


None! :)


Lots of adapters found, followed by poly-A seqs. Hmm…


Only one found

Unzip the paired-end output from Trimmatic to take a closer look

$ gzip -d TelSFB_1.fq.P.qtrim.gz TelSFB_2.fq.P.qtrim.gz


Again lots of adapters!

Try using user-specified adapter sequences to feed Trimmatic

Now grep again


No adapter found! Supply custom adapters, and ignore readthrough since we are using short reads, not like MiSeq that gives long reads. All other default settings given by Trinity looks reasonable:


Feb 18, 2015 Test Transdecoder v2

As of v2 blastp to Beta and Arabidopsis is fast and effective, whereas hmmscan takes 2-3 days and do not make use of more than 3 cores. Trying to figure out whether hmm is needed at all

Total initial longest orfs

$ grep '>' IlleSFB.Trinity.fasta.transdecoder_dir/longest_orfs.pep | wc -l


Number that had pfam hits

$ awk '{print $4}' IlleSFB.pfam.domtblout | uniq | wc -l

42042 (this included some header lines)

Number that had blastp hits

$ awk '{print $1}' IlleSFB.blastp.outfmt6 | uniq | wc -l


Compare the pfam vs. blastp hits

$ awk '{print $4}' IlleSFB.pfam.domtblout | sort | uniq >pfam

remove all the pfam output headers

$ awk '{print $1}' IlleSFB.blastp.outfmt6 | sort | uniq >blastp

with open("blastp","r") as infile:

        blastp_query_ids = set(

with open("pfam","r") as infile:

        pfam_query_ids = set(

print len(blastp_query_ids), len(pfam_query_ids)

print len(blastp_query_ids - pfam_query_ids)

print len(pfam_query_ids - blastp_query_ids)

48644 42030

So there are 639 that have pfam hits but not blastp hits to either Arabidopsis or Beta. Hand check some of these:

These are interesting but

→  For phylogenomic analysis, only by blasting to plants are probably good enough.

Feb 25, 2015 test new seq processing pipeline

Added a test dir in the optimize assembler dir and put it on bitbucket:


Has ~1M  paired-end reads using KAPA stranded kit

Trimmatic, read pairs, stranded assembly, stranded translation
total reference coverage in bp: 179161
total number of reference seqids covered: 1051
redundancy: 2.50903901047

Filter reads, read pairs, stranded assembly, stranded translation

total reference coverage in bp: 173734

total number of reference seqids covered: 1028

redundancy: 2.51848249027

Trimmomatic, read pairs, non-stranded assembly, non-stranded translation
total reference coverage in bp: 176169
total number of reference seqids covered: 1041
redundancy: 3.12199807877

Filter reads, read pairs, non-stranded assembly, non-stranded translation

total reference coverage in bp: 180121

total number of reference seqids covered: 1058

redundancy: 3.10586011342

Trimmomatic, only forward single end reads,  non-stranded assembly, non-stranded translation

total reference coverage in bp: 65772

total number of reference seqids covered: 390

redundancy: 3.20512820513

Filter reads, only forward single end reads,  non-stranded assembly, non-stranded translation
total reference coverage in bp: 65193
total number of reference seqids covered: 387
redundancy: 3.23514211886

–> Read pairs info are used and the stranded lib prep and assembly worked nicely

–> Use Trimmatic for trimming and not the filtering scripts!

Mar 3, 2015 Test old thinning scripts vs. trimmomatic

Old thinning settings:




Trimmomatic settings: SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:25

Both using Trinity v2.0.3 to 2.0.5 and transdecoder v2 with blastp to Beta vulgaris for coverage

Amaranthaceae Agriophyllum_squarrosum


total reference coverage in bp: 5931459

total number of reference seqids covered: 14241

redundancy: 4.20019661541


total reference coverage in bp: 5936103

total number of reference seqids covered: 14244

redundancy: 4.23308059534

Mar 9, 2015 Test Stephen’s conflict concordance calculator

Install the maven integration in Eclipse:

Have to use the older version 1.3. The default m2e 1.5 doesn’t work

RNBN        Molluginaceae_Mollugo_cerviana

SCAO        Molluginaceae_Mollugo_nudicaulis

KJAA        Molluginaceae_Mollugo_pentaphylla

NXTS        Molluginaceae_Mollugo_verticillata

$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d tre -m rooted.tre -s 0







$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d tre -m rooted.tre -s 50






$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d tre -m rooted.tre -s 80






$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d tre -m rooted.tre -s 0

Mapping tree:rooted.tre

Read 2885 trees


$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d tre -m rooted.tre -s 50


$ java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d tre -m rooted.tre -s 80


Turns out that KJAA is from two data sets, in which the juvenile set is a mix with NXTS

Mar 11, 2015 Test Molluginaceae topology

Download HURS from 1KP to replace KJAA

Conflict and concordance:

Tree to map on: (HURS,(NXTS,(SCAO,RNBN))); This is the ML-CA tree

java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 0 -d extracted_clades -m rooted.ingroup.tre -s 0


        alt (261):HURS,RNBN  SCAO,NXTS

        alt (292):HURS,NXTS,RNBN  SCAO

        alt (311):SCAO,HURS NXTS,RNBN

        alt (333):SCAO,HURS,NXTS  RNBN

        alt (567):SCAO,HURS,RNBN  NXTS

        alt (676):HURS,NXTS  SCAO,RNBN # second, versus 877 for the first


        alt (337):SCAO,NXTS HURS,RNBN

        alt (255):HURS,RNBN SCAO,NXTS

        alt (290):HURS,NXTS,RNBN SCAO

        alt (298):SCAO,HURS NXTS,RNBN

        alt (345):SCAO,HURS,NXTS RNBN

        alt (394):NXTS,RNBN SCAO,HURS # second, versus 1223 for the first





bootstrap filter 80:


        alt (121):HURS,RNBN SCAO,NXTS

        alt (125):HURS,NXTS,RNBN SCAO

        alt (165):SCAO,HURS NXTS,RNBN

        alt (213):SCAO,HURS,NXTS RNBN

        alt (323):SCAO,HURS,RNBN NXTS

        alt (392):HURS,NXTS SCAO,RNBN


        alt (170):SCAO,NXTS HURS,RNBN

        alt (117):HURS,RNBN SCAO,NXTS

        alt (121):HURS,NXTS,RNBN SCAO

        alt (159):SCAO,HURS NXTS,RNBN

        alt (216):SCAO,HURS,NXTS RNBN

        alt (213):NXTS,RNBN SCAO,HURS





        alt (121):HURS,RNBN SCAO,NXTS

        alt (125):HURS,NXTS,RNBN SCAO

        alt (159):SCAO,HURS NXTS,RNBN

        alt (165):SCAO,HURS NXTS,RNBN

        alt (213):SCAO,HURS,NXTS RNBN

            alt (216):SCAO,HURS,NXTS RNBN

        alt (323):SCAO,HURS,RNBN NXTS

        alt (392):HURS,NXTS SCAO,RNBN

bootstrap filter 90:


        alt (100):HURS,RNBN SCAO,NXTS

        alt (104):HURS,NXTS,RNBN SCAO

        alt (125):SCAO,HURS NXTS,RNBN

        alt (174):SCAO,HURS,NXTS RNBN

        alt (275):SCAO,HURS,RNBN NXTS

        alt (333):HURS,NXTS SCAO,RNBN


        alt (139):SCAO,NXTS HURS,RNBN

        alt (97):HURS,RNBN SCAO,NXTS

        alt (102):HURS,NXTS,RNBN SCAO

        alt (120):SCAO,HURS NXTS,RNBN

        alt (178):SCAO,HURS,NXTS RNBN

        alt (173):NXTS,RNBN SCAO,HURS





bootstrap filter 95 (why not...):


        alt (71):HURS,RNBN SCAO,NXTS

        alt (80):HURS,NXTS,RNBN SCAO

        alt (100):SCAO,HURS NXTS,RNBN

        alt (134):SCAO,HURS,NXTS RNBN

        alt (222):SCAO,HURS,RNBN NXTS

        alt (281):HURS,NXTS SCAO,RNBN


        alt (111):SCAO,NXTS HURS,RNBN

        alt (68):HURS,RNBN SCAO,NXTS

        alt (77):HURS,NXTS,RNBN SCAO

        alt (96):SCAO,HURS NXTS,RNBN

        alt (136):SCAO,HURS,NXTS RNBN

        alt (141):NXTS,RNBN SCAO,HURS





–> (SCAO,RNBN) is the overwhelming majority.


Boot 0

java -jar phyparts-0.0.1-SNAPSHOT-jar-with-dependencies.jar -a 1 -d extracted_clades -m rooted.ingroup.tre -s 0

Mapping tree:rooted.ingroup.tre

Read 2606 trees


Boot 50


Boot 80


–> Cannot see much dups at all

Tree to map on: (HURS,NXTS),(SCAO,RNBN)

Bootstrap 80:


        alt (164):SCAO,NXTS HURS,RNBN

        alt (117):HURS,RNBN SCAO,NXTS

        alt (533):SCAO,NXTS,RNBN HURS

        alt (160):SCAO,HURS NXTS,RNBN

        alt (323):SCAO,HURS,RNBN NXTS

        alt (211):NXTS,RNBN SCAO,HURS


        alt (170):SCAO,NXTS HURS,RNBN

        alt (117):HURS,RNBN SCAO,NXTS

        alt (121):HURS,NXTS,RNBN SCAO

        alt (159):SCAO,HURS NXTS,RNBN

        alt (216):SCAO,HURS,NXTS RNBN

        alt (213):NXTS,RNBN SCAO,HURS





OK, the original one was the dominant topology

April 11, 2015 Test minimum read length for assembly


Here’s the command for minimum read length = 40. All the rest are just changing the read filtering settings

Trinity version: Trinity_v2.0.2
ulimit -s unlimited
/export/apps/trinity/trinityrnaseq-2.0.3/Trinity --seqType fq --trimmomatic --quality_trimming_params "ILLUMINACLIP:/export/people/yangya/phylogenomic_dataset_construction/data/TruSeq_adapters:2:30:10 SLIDINGWINDOW:4:5 LEADING:5 TRAILING:5 MINLEN:40" --max_memory 100G --CPU 2 --full_cleanup --SS_lib_type RF --output TelSFB.trinity --left
../Sample_41627/41627_ATGTCA_L006_R1_010.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_006.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_009.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_007.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_005.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_008.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_003.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_004.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_001.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R1_002.fastq.gz --right ../Sample_41627/41627_ATGTCA_L006_R2_010.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_003.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_002.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_008.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_006.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_001.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_007.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_004.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_009.fastq.gz,../Sample_41627/41627_ATGTCA_L006_R2_005.fastq.gz
/export/people/yangya/apps/TransDecoder/TransDecoder.LongOrfs -t TelSFB.Trinity.fasta -S
blastp -query TelSFB.Trinity.fasta.transdecoder_dir/longest_orfs.pep -db /export/people/yangya/apps/TransDecoder/blastp_database/db -max_target_seqs 1 -outfmt 6 -evalue 10 -num_threads 2 > TelSFB.blastp.outfmt6
/export/people/yangya/apps/TransDecoder/TransDecoder.Predict -t TelSFB.Trinity.fasta --retain_blastp_hits TelSFB.blastp.outfmt6 --cpu 2

More stringent filtering is at a cost of both less genes assembled and more genes broken into multiple pieces. However, the real thing that matters are the conserved genes

Min read len


ref cov bp

ref cov # genes






















The ks plots all look quite the same. Perhaps because Tel is indeed not a recent polyploid anyways

May 11, 2015 process TSA data sets and between-taxa Ks in cluster directory. Turns out to be fairly straightforward except that need to force num_core in to be str()

May 14, 2015 Between-species Ks for testing allopolyploidy

The analysis_5SRA_added one was run using

python ~/clustering/ --indir=data --outdir=. --threads=8 --reltip=0.1 --abstip=0.3 --deep=0.2 --ortho=RT --inout=in_out --max_mis_taxa=20 --test n

The new data from Colobanthus quitensis would be really interesting, since it is currently the sole representative of subfamily Sagineae and is an Antarctica cushion with many literatures available.

Extract ingroup clades with at least 8 taxa:

$ python ~/clustering/ 5_homolog .tre inclades_min8taxa 8 in_out .cary

Got 10302 rooted cary clades

$ python ~/clustering/ ../inclades_min8taxa/ '/media/data/YA/cary_project/3_Caryo_Acha/analysis_5SRA_added/2_clustering/all.fa' '/media/data/YA/cary_project/3_Caryo_Acha/seq/all_pep.fa' 4 '/media/data/YA/cary_project/taxon_table'

Fixed the plotting and xlim so that the plots look better now.

No evidence for polyploidy yet but wait for all sample to finish running. Seems that TJES (Spergularia media) has lots of phylogenetic uncertainty and just have been switching phylogenetic locations

Get species tree for plotting using phypart using both RT and 121 with full taxon occupancies. Write fasta from all.fa.cut and prank orthologs


$ python ~/clustering/ 5_homolog .tre 6_ortho/tre 27 in_out

$ python ~/clustering/ . . .fa dna


filter with 28 taxa

Wrote a script to prepare astral input files from the pipeline homolog trees

$ python ~/clustering/ ../5_homolog 28 .

$ java -jar ~/apps/ASTRAL/Astral/astral.4.7.8.jar -i all_ML.trees -o astral.out

$ java -jar ~/apps/ASTRAL/Astral/astral.4.7.8.jar -i all_ML.trees -b all_boot.trees -g -r 100 -o astral_boot.out

$ tail -1 astral_boot.out >astral_boot.tre

Root the austral output tree in figtree and save as newick

Map duplications to the rooted austral output

$ python ~/clustering/ '/media/data/YA/cary_project/3_Caryo_Acha/analysis_5SRA_added/inclades_min8taxa' '/media/data/YA/cary_project/3_Caryo_Acha/analysis_5SRA_added/ASTRAL/astral.out.rooted'

Added arabidopsis, tomato and vitis, plus Colobanthus_quitensis

python ~/clustering/ --indir=data --outdir=. --threads=8 --reltip=0.1 --abstip=0.3 --deep=0.2 --ortho=RT --inout=in_out --max_mis_taxa=20 --test n

Update May 15: that didn’t work out. All the non-Caryophyllales are too far away and attach to the base randomly, even Vitis is too far away to be informative. Get a better annotation for Beta instead. Colobanthus is definitely a distinctive and interesting one though

June 18, 2015 31 taxa, 29 species data set

Downloaded Refbeet 1.2 from the beta vulgaris resourses website. Refbeet 1.2 had 29,088 seqs, compared to 29,831 seqs in Refbeet 1.1. I am hoping this is because the new version has less artifects

Also corrected a bug in adapter filtering for SRA data sets.

python ~/clustering/ --indir=data --outdir=. --threads=8 --reltip=0.1 --abstip=0.3 --deep=0.2 --ortho=RT --inout=in_out --max_mis_taxa=21 --test n

July 17, 2015 pattern search

Giving up working on pattern search based on the 96 taxa data set. Two few Caryophyllaceae taxa and lots of false positives.

Tested on swipe to all_pep.fa. However, swipe only output the top 100 and it is by no means as exhaustive as filtering from  top 10 from each taxa.

bitscore of 20 or so are very random. Don’t worry about short sequences. Can’t do much with them anyways. Change the baiting criteria to 50 bitscore minimal, and at least 10% of the highest overall.

July 20, 2015 genome walking, 43 genomes only

Shell scripts look like:

python ../../src/ chr4/Bv4_070950_jkpx ../../genome_pep 5 2 ../3_results_genome_only/

When cut by 0.5, will cut appart some fast-evolving genes –> Cut by 1.0 only and take

Looks like

python ../src/ 2_baits/chr1 43genomes.pep.fa 2 3_results_genome_new

 python ../../src/ . ee to extract cary clades

Sometimes Beta and Dica don’t stay together due to low phylogenetic information such as ubiquitin proteins and ribosomal proteins.

Aug 11, 2015 Maeda’s arogenate dehydrogenases

Got sequences from both BvADH1 and BvADH2 from KEGG and blasted against Beta vulgaris 1.2. Turns out to be Bv8_193180 and Bv8_193190, not on chr2 but closely related.

Copied over all transcriptomes plus Dica and use the BvADH1 and BvADH2 as bait.

$  python src/ ADH.pep.fa all_transcriptome_Dica_pep . 2

Baiting 10 max per transcriptome

Adding all the genomes from baiting once.


$ python ../src/ . .tre 0.5 1


WPYJ@16699 1.08103251874

MJM2311@30920 3.30811201396

RUUB@74258 1.08092328713

MJM1162@69292 0.766611520665

$ python ../src/ . . y

$ python ../src/ . .mm 0.6 50 .

$ python ../src/ ADH.swipe_genome.fa ADH_1.subtree

write fasta


remove coqu@78223 from ADH_1.subtree.fa to write ADH_1_rm.subtree.fa. clear contamination and placed outside of Caryophyllales.



The NAD+ biding domain GFGNFG is duplicated in AtADH1 but not AtADH2 or Caryophyllales.

prank for 3 iterations

prank -d=./ADH_1_rm.subtree.fa.temp -o=./ADH_1_rm.subtree.fa.prank.aln  -protein

killed when iteration No. 4 is running

phyutility -aa -clean 0.3 -in ./ADH_1_rm.subtree.fa.prank.aln -out ./ADH_1_rm.subtree.fa.prank.aln-pht

trim and mask using the same script for cds and pep

python ../../src/ . .mm 1 50 .

April 30, 2015: bait MYB from the 202 taxa data set

First, only bait out genomes.

$ python src/ MYB.pep.fa ../all_pep 4

evalue cutoff is 0.00001, max to bait is 10, and bitscore cutoff is 50

Got the file MYB.pep.fa that had 411 sequences. Adding additional Beta vulgaris sequences that are not in RefBeet v1.1:







The starting fasta file MYB.swipe.fa had a total of 414 sequences.

Align, trim and tree building using the default script

Get BAli-Phy from Append bin to the pass

Jul 9, 2015: partial genome walking

## Gather the baits

From Sam: start at the MYB (identified by Hatelstadt et al.)  that is to the 5’ of the CYP/DODA cluster. Then there is actually a second MYB that appears to be an Amaranthaceae-specific paralog of the 5’ MYB, that is to the 3’ side of the CYP/DODA cluster. I would recommend this as an end point.  Although we don’t know if the 5’ MYB is the novel paralog, and this 3’ MYB has not been characterised.

The MYB identified by Hatlestad et al. 2014 is:

>gi|356968735|gb|JF432080.1| Beta vulgaris R2R3 MYB transcriptional regulator mRNA, complete cds

>gi|356968736|gb|AET43457.1| R2R3 MYB transcriptional regulator [Beta vulgaris]

Search this sequence against the Beta vulgaris genome v1.2

makeblastdb -in Beta.pep.fa -parse_seqids -dbtype prot -out Beta.pep.fa

swipe -d Beta.pep.fa -i MYB.pep.fa -p blastp -o MYB.pep.fa.swipe -m 8 -e 1

These are the top hits

Beta_vulgaris@gi_356968736_gb_AET43457_1        lcl|Beta@Bv4_084080_rwwj_t1        50.00        134        67        0        16        149        14        147        2e-35        145.2

Beta_vulgaris@gi_356968736_gb_AET43457_1        lcl|Beta@Bv_000930_crae_t1        57.63        118        50        0        16        133        14        131        3.3e-35        144.4

Beta_vulgaris@gi_356968736_gb_AET43457_1        lcl|Beta@Bv5_114810_nmrg_t1        40.10        202        114        3        14        211        12        210        5.7e-35        143.7

Beta_vulgaris@gi_356968736_gb_AET43457_1        lcl|Beta@Bv1_005970_qxpi_t1        46.26        147        78        1        16        162        14        159        1.7e-34        142.1

Beta_vulgaris@gi_356968736_gb_AET43457_1        lcl|Beta@Bv2_027660_ihfg_t1        48.63        146        64        2        16        151        13        157        8.2e-34        139.8

So the previously un-annotated MYB genes in Beta vulgaris genome annotation v1.1 are still unannotated in v1.2

MYB (Hatelstadt, Bv2g027795_jkkr) ------> DODA -> CYP76AD1 ----> MYB (Bv2g030925_ralf)

According to the RefBeet genome size (567 MB) and browsing the genome, the Stracke paper used RafBeet v1.2 instead of 1.1 (569 MB). Both the absolute and relative coordinates are different between Beta vulgaris genome annotation v1.1 and v1.2.

DODA (id ptjc) had the relative coordinate of Bv2_030650 in v1.1 but is Bv2_029920 in v1.2

CYP76AD1 (ucyh) had the relative coordinate of Bv2_g030670 in v1.1 but is Bv2_029930 in v1.2

The mysterious gene with no hits between DODA and CYP76AD1 in v1.1 is gone in v1.2. It had low cDNA coverage and didn't have any significant hits in nr so it was probably mis-annotated.

So the region to walk on v1.2 would be

MYB (Hatelstadt, Bv2g027795_jkkr) ------> DODA (Bv2_029920_ptjc) -> CYP76AD1 (Bv2_029930_ucyh)----> MYB (Bv2g030925_ralf)

Get the R2R3-MYB genes from Stracke et al. 2014

According to Stracke, both AET43456 and AET43457 (different cultivars from Hatlestad et al. 2014) correspond to  Bv2g027795_jkkr

Copied out all peptide sequences annotated from v1.2 that are between chr2 027795 and 030925

This region includes 4 R2R3-MYB genes

027795 is in the MYB1 lineage almost identical to the one from Hatelstadt

030925 is a Beta-specific duplication in MYB1, sister to 027795, and duplicated after diverging from *Alternathera*, with long terminal branch

The other two R2R3-MYB genes are not in the MYB part we pulled out.

Added jkkr, mxck and ralf, which are previously unannotated, to the bait.

## Cluster the baits

makeblastdb -in Beta_genome_section.fa -parse_seqids -dbtype prot -out Beta_genome_section.fa

swipe -d Beta_genome_section.fa -i Beta_genome_section.fa -p blastp -o Beta_genome_section.fa.swipe -m 8 -e 10

Filtered results by:

- Bitscore > 50

- Remove self hits

- remove lcl|

Looks like there are some segmental duplication that baits can be clustered. Splice variants should be merged too.

python ../src/ Beta_genome_section.fa.swipe Beta_genome_section.fa baits/

bait with maximum 20 seqs, minimal bitscore 50, and evalue cutoff 1 (xxxx_1.fa) and 0.75 (xxxx_1_1.fa). Some also had a third cut of 0.5 if needed.

All runs look like:

python ../../src/ nzck.pep.fa ../../all_pep 20 2

Search for Bv2_ for all the genes that are on chromosome 2 (some times multiple are on one gene tree)

aeat is kind of interesting

in erxy (12 baits), 027900, 027930, 027950, 027970 is interesting too. Pull these out and use these four baits to pull 10 per taxa and run again. It's a receptor kinase

sqtu: DNA_processing_A superfamily: cytokinin riboside 5'-monophosphate phosphoribohydrolase LOG8-like

zarw: no idea what is going on. High rate of evolution, lots of horizontal gene transfer among families (?? or massive contamination ??). Behaves like a transposon but doesn't get annotated as a transposon. Use the cds to take another look



Also is interesting: with only 2 caryophyllaceae and no molluginaceae vesicle-associated membrane protein 711 [Beta vulgaris subsp. vulgaris]



1 molluginaceae and no caryophyllaceae

Its closely-realted lineage is highly represented by Cayrophyllaceae

There are some worrisome contamination of polygonaceae got nested in Caryophyllaceae. Will need to clean this up

nswe contains NB-ARC domain and is one of the R genes. It's highly duplicated

nzck is super conserved ADP-ribosylation factor 1-like protein. Almost identical to arabidopsis, and multiple, nealy identical copies in each species

Jul 10, 2015: optimize genome walking on the 205 taxa data set

Added the new spinach genome annotation and replace the sra set


on all 163 transcriptome peptides. The shell scripts all look like

cd-hit -i AAXJ.pep.fa -o AAXJ.cdhit.pep.fa  -c 0.99 -n 5 -T 4

This way redundant seqs would be less of an issue and I can only take top 10 per taxon instead of 20, making alignment and tree-building much easier and more accurate

Copied all 162 taxonID.cdhit.pep.fa

Prepare baits

Remove peptides that are less than 30 aa in lengths, sort out baits by chromosome (just for easier parallel) and sort splice variants into one bait file

Genome walking is conceptually similar to the idea of phylome


The commands look like

$ python ../../src/ chr1/Bv1_000010_iiuw ../../all_pep/ 10 2 ../3_results

Downloaded all the polyphenol oxidase of Caryophyllales from GenBank,  and use one to bait

swipe -d all.pep.fa  -i  ppo.pep.fa -a 8 -p blastp -o ppo.pep.fa.swipe -m 8 -e 0.00001

Aug 17, 2015: Check Beta walking results (genomes only)

Of the 26841 total genes in Beta vulgaris, only 18861 had .subtree files. The rest are often very short seqs. Using Beta vulgaris for genome walking may miss some genes due to the quality of annotation of Beta vulgaris

Swipe each of the cd-hit transcriptome peptide data set against Beta.pep.fa

$ makeblastdb -in Beta.pep.fa -parse_seqids -dbtype prot -out Beta.pep.fa

Swipe all the 162 transcriptomes against Beta.pep.fa

Aug 18, 2015: test OrthoFinder on 4 genomes in Malphigiales

python -f '/media/data/YA/euph_project/test_orthofinder_4genomes/data' -t 6

I didn’t like they filter for length normalized reciprocal best hits


Feb 7, 2014 Cleanup cary rates data sets

- Remove EGOS since HMFE is the combined data set including EGOS

- Added LDEL

- Changes some spellings but eventually need to find a way to check

Did a little test on MJM1070 (ILU1). So there is a tradeoff between redundancy vs. completeness

stranded        parafly        pfam        direction        total_coverage(bp)        total_coverage(num)        >=60%        >=80%        redundancy

yes        yes        yes        both        3202636        11292        6745        5332        6.468030464

yes        yes        yes        plus        3075471        11005        6460        5008        3.344661517

no        no        no        N/A        3724364        11812        7849        6197        3.66313918

yes        no        yes        both        4015653        12805        8631        6969        6.856384225

yes        no        yes        plus        3948266        12383        8468        6865        5.560607284

So as of now it’s better to use trinity stranded with original Butterfly settings

Feb 10, 2014 Cleanup cary rates data sets

download bedtools-2.17.0 (apt-get doesn’t work)

cd into the dir


all the executables are in the bin

Download Utricularia gabba genome and gff files from

Remove “lcl|” from contig names so that contig names and .gff files match

$ ~/apps/bedtools-2.17.0/bin/bedtools getfasta -fi Utricularia_gibba_genome_19475_name_changed.fa -bed Utricularia_gibba_gid19475.bed -name -s -fo Utricularia_gibba.fa

Somehow the bedtools -split doesn’t work. It only output the unspliced genes

I wrote my own tool for format conversion:


It worked but the bed file only had the gene location including all the intron


- Translated using transdecoder without Pfam.

- Some were translated in both + and - strand

- Some genes are not complete due to partial contig/scaffold

- Mostly good. Removed all the minus strand results from both pep and cds before cd-hit

Phytozome (19 in total)

- Removed Phaseolus vulgaris, Cucumber, Brassica rapa, Gossypium raimondii, Theobroma cacao,Citrus sinensis, Citrus clementina. These are from unpublished genome from the outgroups. It doesn’t seem to add anything

- Also removed Arabidopsis lyrata since it doesn’t add things and its close relative A. thaliana is available

- Fix seq names:

$ sed -i 's/|/__/g' *.fa

$ sed -i 's/:/_/g' *.fa

$ sed -i 's/\./_/g' *.fa

$ sed -i 's/ transcript.*//g' Gmax_275_Wm82.a2.v1.*

$ sed -i 's/ /__/g' Gmax_275_Wm82.a2.v1.*

$ sed -i 's/=/_/g' Gmax_275_Wm82.a2.v1.*

$ sed -i 's/-/_/g' *.fa

$ sed -i 's/__polypeptide.*//g' Gmax_275_Wm82.a2.v1.*

- Notices that there are 15 more Carica papaya protein sequences than cds

May 9, 2014 Process 454 data sets

Registered and downloaded Newbler but  the installation guide is always looking for libraries that I cannot find :(. I Really don’t like these commercial softwares that have a GUI and pretend to be “easy to use” but not really...

Download seqclean

Installed MIRA.

Couldn’t get it to run. Give up

Jun 3, 2014 Benchmark framedp vs transdecoder using grape data sets

farmedp was able to translate twice as many original transcripts and 10% more coverage. Although it takes about 3 days to run each data set. Just let it run on the background for now.


Dec 17, 2012 Test Illumina TruSeq stranded library prep kit

Percent plus / minus after translation

data_set        total_pep        minus_pep        percent_minus

MJM1649        42458                5537                13.04%

MJM1652        29568                3050                10.32%

MJM1697        43026                2669                6.20%

MJM1736        33774                2534                7.50%

MJM1737        42919                4369                10.18%

MJM1741        42316                3215                7.60%

MJM1749        31828                3843                12.07%

MJM1751        40489                4215                10.41%

MJM1755        42688                4930                11.55%

MJM1771        45810                4410                9.63%

SFB29                35243                3740                10.61%

SFB31                37428                3850                10.29%

- Those that are minus strand are short. Consistent with 1% negative strand as claimed by the stranded kit.

- Compared both fixing the name to \1 and \2 versus not. The first step of stranded assembly in Trinity fixes the names as part of the pipeline and the final assembly all look the same

Check pep coverage against Arabidopsis proteome (27416 peptides)

Very close around 50%, 63.5 for the Brassicaceae. Check subsampled data sets too

data_set        #uniq_hits        perc

MJM1649        14182                51.73%

MJM1652        12349                45.04%

MJM1697        13726                50.07%

MJM1736        13843                50.49%

MJM1737        14327                52.26%

MJM1741        13527                49.34%

MJM1749        13278                48.43%

MJM1751        13809                50.37%

MJM1755        17407                63.49%

MJM1771        14038                51.20%

SFB29                13362                48.74%

SFB31                13544                49.40%        

- Looking good! Although blasting to genome distantly related the percentage is higher than the Ricinus percentage. The top hits certainly do not represent orthologs but it should be close

Checking the six 2012 samples using NEB Stranded Enzyme mix with Illumina TruSeq v2 (non-stranded)

- Originally I assembled them with default settings  of trinity version 2013-2-25 and then blastx for direction and found around 60% plus and 40% minus.

- Assembled all six data sets using stranded option of Trinity version 20131110 (not pasafly), translate, and see the direction of the peps and cdss.

- The fasta output for ILU1, ILU2 and ILU3 are all about twice the size as the old non-stranded assembly. However, this can be from the different trinity versions too

- Transdecoder 20131117 using default parameters without pfam

        plus                minus                perc_minus

ILU1        77935                22394                22.32%

ILU2        195299        36797                15.85%

ILU3        120391        35482                22.76%

Seems to be a sort-of-worked directional prep. Use non-directional assembly for these six samples

Check accumulation curve to decide whether multiplexing 8 or 10 is reasonable (currently 6) subsamples to 75%, 50% and 25%

75% and full set are fairly similar but occasionally the full set gives longer assemblies.

9/4/2013 Test ARC assember


Install SPAdes

download from

simply add the bin to the PATH

Then install ARC

download from

cd into the dir

python install

set up the configuration file. cd into the working dir with the config file, and type /home/yayang/apps/ARC-1.0.0/bin/ARC

Since blat doesn’t take fastq, had to change the fastq read files to fasta if using blat for mapping

None of the newbler ones worked.

Mapping percentages are low: 456543/6220964. = 0.07338782220890525 for blat using default settings;

as for bowtie2:

6220964 reads; of these:

  6220964 (100.00%) were paired; of these:

    6140417 (98.71%) aligned concordantly 0 times

    56458 (0.91%) aligned concordantly exactly 1 time

    24089 (0.39%) aligned concordantly >1 times


    6140417 pairs aligned concordantly 0 times; of these:

      17111 (0.28%) aligned discordantly 1 time


    6123306 pairs aligned 0 times concordantly or discordantly; of these:

      12246612 mates make up the pairs; of these:

        11877039 (96.98%) aligned 0 times

        123191 (1.01%) aligned exactly 1 time

        246382 (2.01%) aligned >1 times

4.54% overall alignment rate

Running bowtie2-spades for four days but it never finish. Give up on it.

Jan 22, 2013 Sort out data sets from SRA & GenBank


Combining two data sets that are about the same time

- Ma 2013 plosone: 162969 sequences. GAHZ versus GAIA are control vs. salt treatment.

- Fan 2013 plosone: SRA database accession SRX302090, reads SRR896619. These are single-end 108 bp reads. Assembled with Trinity default single-end.

Combined these two, change seq names and cd-hit-est

$ cd-hit-est -i all.fa -o all.cdhit.fa -c 0.99 -n 10 -r 1 -T 8

Jan 23, 2013 Using a Brassicaceae transcriptome to benchmark translation tools


transdecoder 42,688 peptides, of which 37758 were plus direction

$ awk '{print $1}' MJM1755.fa.rawblastp | sort | uniq | wc -l

41250 peptides had Arabidopsis thaliana hits, of them 37095 were plus

Checked individual peptides that didn’t have Arabidopsis hits. They are alternative short peptides that are off frame, often around 100 aa or shorter. Tried using FrameDP to translate MJM1755 but it was taking more than a day. Giving up on it

Jan 23, 2013 Benchmark Trinity vs. SOAPdenovo-Trans

Benchmark Trinity v.20131110 vs. SOAPdenovo-Trans v1.03. I concatenated the two resulting blastp results for the third one. Wrote a script python for doing this.

file_name                       coverage(bp)        coverage(#genes)        #>=60%        #>=80%        redundancy

SRR866906.scafSeq           2123918.0               12542                   2445            1131            2.5008770531

SRR866906.Trinity            4991129.0               14854                   11018           8843            3.70014810825

SRR866906.scafSeq-Trinity   5119053.0               15696                   11342          9024            5.5

So Trinity has improved dramatically over the past few versions versus SOAPdenovo-Trans is not good now. I don’t really feel like combining them. It will bring in lots of misassemblies

Jan 28, 2013 RSEM & exemplar for Mike and Norman

perl ~/apps/trinityrnaseq_r20131110/util/RSEM_util/ --transcripts 2_assembly_trinity20131110/MJM1755.Trinity.fasta --seqType fq --SS_lib_type RF --left 1_cleaning_files/MJM1755_R1.fastq.thinned.cleaned --right 1_cleaning_files/MJM1755_R2.fastq.thinned.cleaned --thread_count 10 --output_dir 2_assembly_trinity20131110/RSEM --prefix MJM1755

Updated the file in optimize_assembler repository.

Does the exemplar picked by RSEM has the same gene coverage as the full set?

Using the Brassicaceae MJM1755 to test it.

Subsample the blastp to Arabidopsis result by the exemplar names and run the

file_name        total_coverage(bp)        total_coverage(num_genes)        #assembled_to_>=60%        #assembled_to_>=80%        redundancy        perc_gene_outof27416

MJM1649        4159029        10937        8484        7363        2.0950900613        39.89%

MJM1652        3648004        9741        7568        6492        1.8704445129        35.53%

MJM1697        4104333        10735        8486        7394        2.5109455054        39.16%

MJM1736        4278863        10947        8902        7666        1.9495752261        39.93%

MJM1737        4151019        11071        8528        7324        2.1540059615        40.38%

MJM1741        3912492        10554        7942        6743        2.3389236309        38.50%

MJM1749        4321398        10873        8805        7601        1.9027867194        39.66%

MJM1751        4100880        10741        8421        7265        2.2792105018        39.18%

MJM1771        4105418        10896        8408        7206        2.4195117474        39.74%

SFB29                3765255        10456        7720        6410        1.9265493497        38.14%

SFB31                4099979        10687        8364        7084        2.0685880041        38.98%

MJM1755        6808363        16687        14370        12891        1.4677892971        60.87%

MJM1755        7096059        17113        14935        13461        2.2904809209        62.42%

So picking the highest covered isoform per subcomponent is not perfect. Do not do it for phylogenomics but probably a good idea for designing baits to get cleaner sequences


1. Assemble using Trinity

2. blastx

3. cd-hit-est to reduce redundancy

4. Connected components

5. mafft

7. fasttree

8. Mafft local-pair, trimAl, dna only

9. raxml, dna only

10. prank

11 mm,pp


6/19/2012 (model orthologous group based clustering)

6/21/2012 (check cluster results)

6/22/2012 (mcl)


6/29/2012 (compare clustering methods)

7/6/2012 (write individual homolog files)

7/8/2012 (process alignment and build trees)

7/9/2012 (check alignment)

7/10/2012 (test pruning paralogs)

7/11/2012 (fix alignment)

7/2012/2012 (matrix for NSF full proposal)

7/17/2012 (matrix for NSF full proposal)

9/4/2012 - 10/4/2012 (test clustering and merge assemblies)

10/10/2012 (test hifix with 13 eudicot proteomes)

11/15/2012 (test mcl)

1/24/2013 (benchmark SWIPE)

1/15/2013 (visualize cluster results)

2/6/2013 (graph tools)

2/14/2013 (test translation)

3/26/2013 (test OMA HOGs)

3/27/2013 (prepare 10 spps dataset for clustering)

3/28/2013 (test clustering methods based on all-by-all blastn)



3/29/2013 (align and check clusters)

4/2/2013 (Cluster using phytozome clusters)

4/4/2013 (10spp DNA all-by-all)

4/8/2013 (all-by-all)

4/11/2013 (processing 27 phytozome eudicot + 7spp)

4/12/2013 (checking the alignment)

4/17/2013 (find connected components –> mafft –> clean –> raxml--> cut long branches –> write homolog files)

4/24/2013 (prepare files for SMBE talk)

4/25/2013 (prepare files for SMBE talk)

4/26/2013 (blastx to run for 3 weeks)

5/29-31/2013 (clustering)

100 taxa dataset (73 ingroup + 27 eudicot outgroup):

6/3/2013 (clustering aa seq for the 100 spp dataset)

6/7/2013 (flip DNA directions with blast output)

6/10/2013 (check alignment results and cut)

6/11/2013 (third alignment)

6/18/2013 (streamline pipeline)

6/21/2013 (Bio.Phylo to networkx)

7/2/2013 (swipe)

7/3/2013 (matrix stats)

715/2013 (summarizing jackknife results)

7/17/2013 (pep connected components)

7/18/2013 (pep align-cut)

8/2/2013 (test MCL)

8/6/2013 (finishing up clustering)

8/8/2013 (fsa)

8/9/2013 (raxml)

8/12/2013 (tree stats)

8/15/2013 (sate for improving pep alignment)

8/19/2013 (pep final raxml)

8/21/2013 (cary life history contrast)

8/22/2013 (pep contrast)

8/23/2013 (rates)

8/26/2013 (rates)

Key facts about the Kolmogorov-Smirnov test

Interpreting the P value

9/3/2013 (concatenate –> species tree)

pep species tree

dna species tree

9/4/2013 (all gene tree bootstrap results)



9/5/2013 (jackknife pep)

9/9/2013 (gene tree discordance)

9/10/2013 (tblastx,traitRate)

9/12/2013  (GO categories)

9/13/2013 (clade, contrast vs. GO)

9/16/2013 (sign test for woody/herbaceous)

sign test for all five clades

10/10/2013 (back translation)

10/15/2013 (HyPhy)

To do:

To do later:


1. Assemble using Trinity

$ ulimit -s unlimited

$ perl ./ --seqType fq --JM 20G --left /media/eebsmithnfs1/NEXTGEN_projects/cary_phylogeo/data/ILU_assembly/fastq/ILU6-read_.fq.thinned.shuffled.1 --right /media/eebsmithnfs1/NEXTGEN_projects/cary_phylogeo/data/ILU_assembly/fastq/ILU6-read_.fq.thinned.shuffled.2 --CPU 11 --output ILU6

Will need to re-assembly using SOAPdenovo-Trans once the bugs are fixed and it takes larger datasets

2. blastx

DNA cut clean (in optimize assembler folder)



translation used pre-blastx, pre-cd-hit-est DNA seqs.. using transdecoder(ILU1-6) or framedp (beet) with default values. db for framedp is the eudicot 27 species. Both framedp and transdecoder output multiple peptide sequences per DNA seq. Be careful when changing the names and save the position information

3. cd-hit-est to reduce redundancy


The 0.995 cutoff for DNA is probably too conservative. It is to preserve ILU2

$ cd-hit-est -i ILU1.cutfa -o ILU1.cut-cdhit.fa -c 0.995 -n 10 -r 1 -T 1


$ cd-hit -i AAXJ.pep -o -c 0.99 -n 5 -T 1

4. Connected components


all-by-all blastn to get connected components

$ >all.fa.rawblastn

$ >all.rawblastn.edges

$ > .ccs with list of connected components, and .edge files for each connected components. filter min_ingroup_taxa = 4 at this step

$ > .fa for each cc

with SEQ_LEN_FILTER set to 300

$ >_R.fa for each cc

Total 23985 connected components

For cc1:

awk '{if($4>700)print $1"\t"$2"\t"$3"\t"$4}' cc1.edges >cc1.edges.filterbit700


cmd = "blastp -db ../all.pep.db -query "+DIR+i+" -evalue 0.00001 -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' -out "+DIR+i[:-3]+"blastp -num_threads=11 -max_target_seqs 100"

- Way slower than blastn. Could use swipe too but it seems to be equally slow. Different from mutual best hits, connected components should be robust and not too dependent on blastn heuristics

- For pep, no flipping direction is needed. However regions without any blast coverage should be cut to get rid of messy seq after frameshift.



filter min_ingroup_taxa = 8 at this step

$ python ~/clustering/ all.pep all.rawblastp_bitscore-filter350.edges.ccs cc_350

with SEQ_LEN_FILTER set to 100


Although ccs1 is still huge do not remove it. If it’s protein seq it doensn’t matter if it’s from chloroplast operon


5. mafft

local-pair to .aln files


cc24 crashes. Used auto (FFT-NS-2) for both cc24 and cc24-2 after cut by 4. Lumen protein? Cleaned alignment looks ok. Same thing happed to cc2, probably due to empty seq left after phyutility. fixed it.


6. clean alignment by filtering columns with less than 0.1 occupancy


$ python inDIR outDIR


MIN_CHR = 100

Originally tried using trimAl. It turned out to be too stringent and completely get rid of seqs that doesn’t align well, and can lose some minor homologs all together. Also, trimAl is not flexible enough to deal with very divergent alignments that violates its assumptions. Even just use -gt 0.9 it still trims away everything.


$ python inDIR outDIR


MIN_CHR = 50

also phyutility needs an -aa flag for amino acid

*Should use the column occupancy curve instead of a fixed percentage for the phyutility cut.

7. fasttree


from -cln files and then break up trees progressively using branch length cutoffs 4.0, 2.0 and finally 1.0. Do not use cutoff lower than 1.0 unless using trimal for trimming matrix. Lower than 1.0 will break up outgroups



Not using trimAl because when the matrix is very messy trimAl will remove most columns



$ inDIR outDIR

output fasta files that are prank-ready to the outDIR. Skip to step 10

8. Mafft local-pair, trimAl, dna only

Not using prank yet because prank cannot handle seqs that are fairly diverse

Occationally trim al  produces empty alignment for clusters that are too diverse, for example cc3025

$ python inDIR outDIR

9. raxml, dna only


cd into the folder


Remove intermediate files


use 1.5 to cut, and 8 for minimal ingroups


Testing a 0.75. Since prank is having problem.

10. prank






$ python ~/clustering/ . .

$ python ~/clustering/ . .



- Removed 4 datasets: JBGU, HTDC, QAIR, HSXO

11 mm,pp

$ cut by 0.5


$ python ~/clustering/ ccs1_mm >ccs1_pp.log

12 final ortho files


- if two tips are from the same taxa (first four characters the same), remove one leaving the one with more characters in the cleaned alignment

- if two tips are from the same taxa and are paraphyletic, we are unable to tell whether it’s from gene duplication or seq variation from assembly or biological reasons. Again, remove one leaving the one with more characters in the cleaned alignment

- Assuming no hybridization, the final split should be split by speciation, the rest by gene duplication



Setting a high cutoff for min_taxa to avoid special cases

Remove outliers and small datasets

Remove 4 datasets: JBGU, HTDC, QAIR, HSXO

Use final trimal wrapper again to trim alignment at 0.1, 50


1. cd

translation by transdecoder from ILU[1-6].fa before blastx cleanup. Also translated beet using transdecoder. Too slow for other datasets so

6/19/2012 (model orthologous group based clustering)

Download protein family information from PLAZA2.5 ftp site

Also download 13 eudicot protein fasta files from PLAZA2.5

Download tomato and potato protein fasta files from

Make a blast database using all 13 eudicot

cat prote* > proteome_eudicots

makeblastdb -in proteome_eudicots -dbtype prot -parse_seqids

1. blast potato/tomato against 13 eudicot

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query ITAG2.3_proteins.fasta -evalue 0.00001 -outfmt 5 -out ITAG.blast.xml -num_threads=39 -max_target_seqs 1

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query PGSC_DM_v3.4_pep.fasta -evalue 0.00001 -outfmt 5 -out PGSC.blast.xml -num_threads=39 -max_target_seqs 1

2. blast four 1kp sequences against 13 eudicot

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query -evalue 0.00001 -outfmt 5 -out IWIS.blast.xml -num_threads=39 -max_target_seqs 1

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query -evalue 0.00001 -outfmt 5 -out KDCH.blast.xml -num_threads=39 -max_target_seqs 1

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query -evalue 0.00001 -outfmt 5 -out KTQI.blast.xml -num_threads=39 -max_target_seqs 1

time /scratch/ncbi-blast-2.2.26+/bin/blastp -db /scratch/proteome_eudicots -query -evalue 0.00001 -outfmt 5 -out LBZM.blast.xml -num_threads=39 -max_target_seqs 1

3. blast potato+tomato+four 1kp against itself (all by all)

:/[^ -|]/ to find bad characters

In the ITAG part of the concatenated file, I changed ‘â’ to ‘a’ (I didn’t make any change on the original uncatenated files)

>Solyc09g082990.2.1 genomic_reference:SL2.40ch09 gene_region:64083974-64088590 transcript_region:SL2.40ch09:64083974..64088590- go_terms:GO:0048040 functional_description:"GDP-D-mannose-3â ,5â -epimerase 2"

Similarly, in the PGSC part of the concatenated all-by-all file, I changed all the tabs to ‘ | ’ in vi:

:%s/\t/ | /g

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query -evalue 0.00001 -outfmt 5 -out IWIS_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query -evalue 0.00001 -outfmt 5 -out KDCH_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query -evalue 0.00001 -outfmt 5 -out KTQI_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query -evalue 0.00001 -outfmt 5 -out LBZM_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query ITAG2.3_proteins.fasta -evalue 0.00001 -outfmt 5 -out ITAG_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

time ncbi-blast-2.2.26+/bin/blastp -db all-by-all -query PGSC_DM_v3.4_pep.fasta -evalue 0.00001 -outfmt 5 -out PGSC_all-by-all_blast.xml -num_threads=39  -max_target_seqs 5

6/21/2012 (check cluster results)

1. Tomato against 13 eudicots: some have matching GO but some are hard to tell since GO terms can be nested

Concatenate all the 13 eudicot GO lists together minus header:

sed '1d' ./PLAZA2.5/go.*.csv > temp

2. Tomato, potato and 4 1kp species against 13 eudicots. Wrote a script

Print out # of uniq gene families

awk -F "\t" '{print $1}' temp | uniq | wc -l

6/22/2012 (mcl)

All-by-all blasts: only saved 5 top hits (4 actually minus hit to itself). Save >20 in the future

Cat all the sequences, make a blast database, and blast each fasta file against the blast database. Check distribution of e-value in R, and pick the cutoff of e-value for clustering


mcl can be found here

$ mcl mcl_input --abc -te 12 -tf 'gq(0)' -I 2.1 -o mcl_out_0_thresh_2.1

where mcl_input is the file with

id1 id2 -log10(evalue)

on each line

and -te 12 are the number of cores for multicore machine (12 is a little overkill , so it could be more like 8 or 10 if you have other things running)

-tf 'gq(0)' is the cutoff for the -log10(evalue). this is 0 which just mean 5 [our default blast cutoff of 1e-5]. you could make that 20 or even more for more stringent cutoffs

-I 2.1 is the inflation value. higher numbers make smaller clusters and lower numbers produce larger clusters

6/29/2012 (compare clustering methods)

All by all: tested e-value 5, 10, 20 and 30; inflation values 1.4, 2, 4, and 6. Use high e-value cut-off (20) and low inflation values (2). However, these values are dependent on datasets.

To eudicots: less # of clusters than from all by all

May cluster to eudicots first and then do all by all for the rest.

7/6/2012 (write individual homolog files)

Wrote a script to write individual ortholog files. Make two dictionaries and they using blast hits to connect them.

hit_id ------ortho_dict--------> ortholog id




seq_id-----fasta_dict--------> sequences

Had to modify it separately for potato and tomato due to the sequence id is broken up by either tab or space.

Alignment with Mafft. Done within less than 10 hours.

Check the alignment in MacClade

- Sometimes one species has two sequences on either side of the protein that are probably one gene but didn’t get assembled

- Some weird seq in alignment that might need to be filtered out. No idea how they got clustered together

- Some alignment obviously has more than one gene per species. This may cause problems if the alignment only has small number of species and do not get broken up

- Some alignment files are empty - they only have a single sequence to begin with

- Some larger files look like random short sequences throwing together. I blasted two of the biggest one: they are retrotransposons!

7/8/2012 (process alignment and build trees)

cd aligned

perl ~/brlab_nextgen/Gblockswrapper

mkdir gb

mv *.aln-gb gb/

rm *.htm

cd gb

perl ~/brlab_nextgen/fasta2phylip

perl ~/brlab_nextgen/raxwrapper

7/9/2012 (check alignment)

Added a filter to so that only aa chain >= 100 will go to Mafft

Use Phyutility instead of Gblocks for cleaning up gaps in alignment

phyutility -clean 0.2 -aa -in INFILE.aln -out OUTFILE.aln.cln

The clean up still doesn’t look ideal

Made a phyutility wrapper to replace Gbox

7/10/2012 (test pruning paralogs)


Ouput: .mm tree files

Chopped up sequences:

- paraphyly

- single sequence that mess things up

- chop ends of sequences that are messy

7/11/2012 (fix alignment)

Final matrix for NSF full proposal: 55 1kp + potato/tomato + 5 ILU = 62 taxa in 62 datasets

Use only the aligned part for paralog pruning will only include more conserved regions and get rid of chimeras, and then use the full sequence for tree-building.

7/2012/2012 (matrix for NSF full proposal)

Download beet protein data from

There is no beet protein sequence available for download

Tried 0.3, 0.5 and 0.7 for the -cleanmessy parameter in phyutility, and 0.5 seems to work well. Since some datasets are really choppy, lowered gap filter to 0.1

Matrix construction pipeline:

Orthology files built from sequence segments with blast hits →

grep to check that all files are represented

Mafft alignment →

Phyutility clean up gaps and characters that doesn’t align→

Gblocks get rid of regions that doesn’t align (relaxed filters and do both with and without Gblockes) Gblocks handle sequences flanking gaps as non-conserved. Therefore give up on Gblocks. Instead Stephen wrote a hidden feature in Phyutility to delete characters in an alignment that doesn’t align well

Filter sequences in cleaned matrix that are shorter than 30% of the matrix →

Fast2Phylip →

RAxML gene family trees →

Monophyly masking →

Paralog pruning →

Mafft align each ortholog s.s. →

check and decide how to clean up →

Fast2Phylip →

final RAxML step

7/17/2012 (matrix for NSF full proposal)

In order to determine occupancy (how many taxa we need for approximately how much missing data), we can do the following

cd into the directory with the cleaned ortholog (passed paralogy pruning) alignments

python ~/brlab_nextgen/ . aln-cln_stats

This will get all the requisite information into an outfile aln-cln_stats (can change the name). Then run R and do

a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)


downp = a[downo,1] / TOTAL NUMBER OF TAXA


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down15 = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down15])[length((cc[down15]))] = PERCENTAGEOCCUPIED

#sp        occupancy

22        0.5095899

27        0.5936639

Then, ran the script like this to get a list of alignments to include

python ~/brlab_nextgen/ ../new_cleaned/ 50 27 ../27_50occup

Filter alignment with paralogs, i.e. unusual long branches

Also tried using fasttree for fast tree reconstruction. Branch length, esp. for long branch length are quite different. Had to use 0.8 for filtering branch length when using fasttree, but 2.5 for RAxML results.

get a list of alignments, and make a shell file to run concatenation in phyutility

make a models file by copy the concatenated; delete everything else except the partition generated by phyutility

delete lines at the beginning

set wrap!


60_occup took 13.4 hrs to finish on bigmem2 with 9 cores, and on lab linux 14.6 hours with 11 cores

50_occup took 21.5 hrs to finish on bigmem1 with 9 cores

Check how well represented the taxa are in the clusters

9/4/2012 - 10/4/2012 (test clustering and merge assemblies)


- All-by-all blast: e-value predominantly 180

- Run using e value cutoff of 180, but inflation value of 1.4 2, 4 and 6 for clustering give exactly the same results. All four runs give exactly the same clustering results

- Wrote a script to fix the direction by bl2seq, and to merge multiple-k and prank to align on the go.

- Checking alignment shows that some non-related sequences were put into the same alignment. Either inflation value need to be higher or there may be a bug in fixing the direction

Similarity scores

The e-value is dependent on size of database and often get saturated. There are many similarity scores that have been developed, and some of these statistics are compared in the SVM papers.

bl2seq do not depend on overall dataset size. However, it deals with a single sequence file at a time, and even you can pipe it, it is still not efficient. It is also not easy to run in parallel.

- Compare to the protein alignment to see whether things improve

SVM and HMM has been extensively used for protein family clustering. However, MCL may work for our dataset since it’s a lot simpler and has clear-cut blast results.Machine learning algorithms are more for comparing among distantly related model organisms

Test FrameDP instead of prot4EST

yayang@blacklablinux1:~/test_eudicots-sol$ perl ~/apps/FrameDP/bin/ --cfg ~/brlab_nextgen/FrameDP_yy_eudicots-sol.cfg --infile ~/WMLW_51.50.exemplar.fa --outdir ~/test_eudicots-sol (two processors running)

yayang@blacklablinux1:~/apps/framedp-1.2.0/db$ perl ~/apps/FrameDP/bin/ --cfg ~/brlab_nextgen/FrameDP_yy.cfg --infile ~/WMLW_51.50.exemplar.fa --outdir ~/test(ten processors running)

Too bad that frameDP doesn’t do extension or deal with seq with no hits..

I benchmarked translators: translation/bench_mark_translators.ods

blat everything against phytozome peptides

between ramedp and transdecoder: framedp always return better reference coverage

between BGI and Trinity, BGI gives higher coverage regardless of translator

10/10/2012 (test hifix with 13 eudicot proteomes)

To evaluate the performance of silix/hifix, trying it on the Eudicot dataset downloaded from PLAZA 2.5 first.

$ cd ~/clustering

$ cat prot* >~/clustering/

$ makeblastdb -in -parse_seqids -dbtype prot

$ mv

#input fasta file has to end with .fasta, otherwise hifix cannot find the file

$ blastp -db -query -evalue 0.00001 -outfmt 6 -out eudicots_all-by-all.blastp -num_threads=10 -max_target_seqs 100

$ silix  eudicots_all-by-all.blastp  -f FAM --net >  eudicots_SLX.fnodes

$ mkdir hifix_temp

$ hifix eudicots_SLX.fnodes -d ~/clustering/hifix_temp > eudicots_HFX.fnodes

Runs fine but I killed the hifix run after a day. It’s too slow. Either have to speed it up or do something else. The combination of silix-hifix is likely the way to go, but may need some modification. At least hifix is written in Python and should be relatively easy to modify

11/15/2012 (test mcl)

I created a repository that everyone should have access to. I also created some utility scripts including one for making a simple all by all blast outfile.

Right now I am doing this as the all by all blast line

blastn -db 18S_26S.unall.db -evalue 1 -query 18S_26S.unall -num_threads 8 -max_target_seqs 100 -out 18S_26S.unall.rawblast -outfmt "6 qseqid sseqid qlen slen evalue length pident nident mismatch positive ppos bitscore score"

for initial mcl test

can just test various parts of the table with awk so

for evalues

awk -F "\t" '{print $1"\t"$2"\t"$5}' 18S_26S.unall.rawblast > 18S_26S.unall.rawblast.5

transform to -log10 and greater than 0 and I of 2.1

mcl 18S_26S.unall.rawblast.5 --abc --abc-neg-log10 -te 12 -tf 'gq(0)' -I 2.1

transform to -log10 greater than 20 evalue and I of 2.1

mcl 18S_26S.unall.rawblast.5 --abc --abc-neg-log10 -te 12 -tf 'gq(20)' -I 2.1

4 clusters in each case

for pident

awk -F "\t" '{print $1"\t"$2"\t"$7}' 18S_26S.unall.rawblast > 18S_26S.unall.rawblast.7

mcl 18S_26S.unall.rawblast.7 --abc -te 12 -tf 'gq(0)' -I 2.1

3 clusters

26S_Oryza seems to be a problem

trying atpB and matK

blastn -db atpB_matK.unall.db -evalue 1 -query atpB_matK.unall -num_threads 8 -max_target_seqs 100 -out atpB_matK.unall.rawblast -outfmt "6 qseqid sseqid qlen slen evalue length pident nident mismatch positive ppos bitscore score"

for evalues

awk -F "\t" '{print $1"\t"$2"\t"$5}' atpB_matK.unall.rawblast > atpB_matK.unall.rawblast.5

mcl atpB_matK.unall.rawblast.5 --abc --abc-neg-log10 -te 12 -tf 'gq(0)' -I 2.1

for pident

awk -F "\t" '{print $1"\t"$2"\t"$7}' atpB_matK.unall.rawblast > atpB_matK.unall.rawblast.7

mcl atpB_matK.unall.rawblast.7 --abc -te 12 -tf 'gq(0)' -I 2.1

1/24/2013 (benchmark SWIPE)

Test SWIPE v.2.05

- Need to build a binary blast style database first. Before that, get rid of all the “|” in the seq ids so that the flag -parse_seqids won’t attempt to follow the ncbi protein naming conventions

$ makeblastdb -in all -dbtype prot -parse_seqids

$ swipe -d all -i all -M BLOSUM62 -a 8 -p blastp -o all.swipe -m 8 -e 1

To benchmark SWIPE vs. blastp:

Took the first 100 sequences form WMLW to make test.fa

$ time swipe -d all -i test.fa -M BLOSUM62 -a 20 -p blastp -o all.swipe -m 8 -e 1

real        0m17.747s

user        4m42.854s

sys        0m3.024s

$ time blastp -db all -query test.fa -evalue 1 -outfmt 6 -out test.blastp -num_threads=20 -max_target_seqs 100

real        0m18.378s

user        1m7.712s

sys        0m0.100s

1/15/2013 (visualize cluster results)

$ sed -i 's/lcl|//g' all.swipe.reformated

$ awk -F '\t' '{if ($12>=100 && $1!=$2)print $1,$2,$3,$4,$12}' all.swipe.reformatted >all.swipe.reformated.filtered

swipe output file columns:

0-Query id                1-Subject id        2-% identity        3-alignment length        4-mismatches

5-gap openings        6-q. start                7-q. end                8-s. start                        9-s. end 10-e-value                11-bit score

2/6/2013 (graph tools)

Python graph libraries:

igraph: Works with over a million nodes

networkx: Written in C and only use python for interface


2/14/2013 (test translation)

Stop codons in blastx alignment

Found in xml files, and got carried over to pro4EST results, except the old 1kp translation before ILUs.

- Blast versions? Tried both blast 2.2.26 and 2.2.27: no difference. Both output stop codons in alignments. Same in the online version of blast.

- Stop codons in the middle of blast hits often associated with Ns in DNA, and obvious frameshift and shifting back. There’s no way to correct this using markov models that doesn’t take homology into account (or there is any?)

3/26/2013 (test OMA HOGs)

OMA only takes protein sequences. It does not allow “*”, which is stop in aa seqs. Checked that both the potato and tomato non-redundant protein sets do not have internal stop codons. Removed all stop codons at the end. No stop at the BGI aa output

Four species: potato, tomato, XSSD, WMLW

Will take 2 days just to run the all by all comparison. Definitely need to parallel this part in the future. However, I can’t get bsub to work, and the instruction            

bsub: Command not found.

               You need to modify your ~/.login file, add lines:

                     if ( -r /usr/local/etc/d0local.login ) then
                           source /usr/local/etc/d0local.login

doesn’t seem to work either

3/27/2013 (prepare 10 spps dataset for clustering)

tomato, potato: no filtering. Directly use the downloaded non-redundant aas

Blastx these eight files against all eudicot from phytozome v9 and removed chimeras. Some of the bigger files are from combined mixed tissues







Cut trans chimeras with the size filter set to 300 bp

$  makeblastdb -in 10spp -dbtype nucl -parse_seqids

$ swipe -d 10spp -i 10spp -a 12 -p blastn -o 10pp.swipe -m 8 -e 0.01

swipe took 14 hours to finish 0.5%. It’s not really optimized for DNA. blastn finished in 5 min! Comparing blastn vs. swipe the output are very similar, except the evalue and scores are slightly different, the ordering of output differs, and blastn missing minor hits, up to evalue 10-26 

Looking at the initial swipe results - probably need to do a cap3 step before blastx to merge highly similar seqs

3/28/2013 (test clustering methods based on all-by-all blastn)

Preparing blast files:

$ blastn -db 10spp_DNA.fa.db -query 10spp_DNA.fa -perc_identity 20 -evalue 10000 -num_threads 10 -max_target_seqs 100 -out 10spp_DNA.fa.rawblastn -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'


Downloaded and unpacked the package

Check and download dependencies:

~/apps/OrthoSelect$ perl perl_scripts/ -o Linux -p t

Says bioperl cannot be downloaded but I have bioperl already

Checked bioperl using

$ perl -e 'use Bio::Perl'

No error message returned so bioperl has been installed correctly

Couldn’t pass the first step.

yayang@blacklablinux1:~/apps/OrthoSelect/perl_scripts$ perl

Can't locate lib/ in @INC (@INC contains: /home/yayang/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /home/yayang/perl5/lib/perl5/x86_64-linux-gnu-thread-multi /home/yayang/perl5/lib/perl5 /etc/perl /usr/local/lib/perl/5.14.2 /usr/local/share/perl/5.14.2 /usr/lib/perl5 /usr/share/perl5 /usr/lib/perl/5.14 /usr/share/perl/5.14 /usr/local/lib/site_perl .) at line 21.

I’ve no idea what “” is, and searching doesn’t return anything


awk -F '\t' '{if($1!=$3)print $1,$3,$16}' 10spp_DNA.fa.rawblastn >10pps.bitscore

sed -i 's/  / /g' 10pps.bitscore

awk -F '\t' '{if($1!=$3)print $1,$3,$15}' 10spp_DNA.fa.rawblastn >10pps.evalue

sed -i 's/  / /g' 10pps.evalue

awk '{if($3==0.0)print $1,$2,180;else print $1,$2,-log($3)/log(10)}' 10pps.evalue >10pps.evalue-log

3/29/2013 (align and check clusters)

Looking at mcl results from 10spp DNA

- Seems that using bitscore for mcl the groups are very sensitive to parameters. Also bitscore doesn’t take into account of seq size. Back to use evalue

- Among the mcl ouput using evalues, evalue cutoff of 5 or 10 looks more reasonable than 0 or 180

Numbers of clusters: The higher the -log(evalue) cutoff, and the higher the inflation value, the more broken up the clusters

   42394 10pps.evalue-log.mcl_out_Othresh1.4_e10

   40874 10pps.evalue-log.mcl_out_Othresh1.4_e5

   49595 10pps.evalue-log.mcl_out_Othresh2_e10

   48127 10pps.evalue-log.mcl_out_Othresh2_e5

   61622 10pps.evalue-log.mcl_out_Othresh4_e10

   60165 10pps.evalue-log.mcl_out_Othresh4_e5

   67703 10pps.evalue-log.mcl_out_Othresh6_e10

   66248 10pps.evalue-log.mcl_out_Othresh6_e5

- Took the results of 10pps.evalue-log.mcl_out_Othresh2_e5 and did alignment using mafft to fix DNA direction and then use prank to do align the alignment output of mafft. However, neither --adjustdirectionaccurately nor --adjustdirection seem to adjust the direction well

Try to use prank to make translation, align and translate back

$ prank -d=homolog241.fa.aln -o=homolog241.fa.aln.pep -translate -once -DNA -F

Mismatch in backtranslation: (BJKT@2012117;556) --- - != X.

Backtranslation failed. Exiting.

4/2/2013 (Cluster using phytozome clusters)

Downloaded eudicot gene clusters from phytozome v. 9.1 >Tools >BioMart

Get rid of “|”

$ sed -i 's/|/__/g' original_mart_expert.txt >reformated.fa

wrote a script to remove duplicated seqs:

$ python reformated.fa db

$ makeblastdb -in phytozome_v9.1_genefamilies.fa -dbtype prot -parse_seqids

Made alignment with the eudicot clusters. They looked ok, although clearly sometimes have multiple paralogs in one cluster

Tried using cap3 -o 200 -p 99 on downloaded BGI DNA assemblies. Cap3 jumps between seqs to make consensus, and there’s not much to combine. –> do not need to combine DNA. Simply get rid of the “scaffold-” at the beginning of each seq

Downloaded translation by BGI but they only translated about 30% of all seq that have significant hits. Still doing a blastp to see what’s going on.

4/4/2013 (10spp DNA all-by-all)

DNA all-by-all with tighter cut-offs

higher similarity cutoff and higher length cutoff.

Looking at number of clusters from mcl results of DNA all-by-all blastn again:

    41719 10pps.bitscore.mcl_out_Othresh1.4_bit0

    46378 10pps.bitscore.mcl_out_Othresh1.4_bit100

    57530 10pps.bitscore.mcl_out_Othresh1.4_bit200

    71664 10pps.bitscore.mcl_out_Othresh1.4_bit300

    49206 10pps.bitscore.mcl_out_Othresh2_bit0

    53595 10pps.bitscore.mcl_out_Othresh2_bit100

    63904 10pps.bitscore.mcl_out_Othresh2_bit200

    77048 10pps.bitscore.mcl_out_Othresh2_bit300

    57757 10pps.bitscore.mcl_out_Othresh4_bit0

    61927 10pps.bitscore.mcl_out_Othresh4_bit100

    71473 10pps.bitscore.mcl_out_Othresh4_bit200

    83571 10pps.bitscore.mcl_out_Othresh4_bit300

    61956 10pps.bitscore.mcl_out_Othresh6_bit0

    66050 10pps.bitscore.mcl_out_Othresh6_bit100

    75195 10pps.bitscore.mcl_out_Othresh6_bit200

    86685 10pps.bitscore.mcl_out_Othresh6_bit300

    40553 10pps.evalue-log.mcl_out_Othresh1.4_e0

    42394 10pps.evalue-log.mcl_out_Othresh1.4_e10

   118082 10pps.evalue-log.mcl_out_Othresh1.4_e180

    40874 10pps.evalue-log.mcl_out_Othresh1.4_e5

    47815 10pps.evalue-log.mcl_out_Othresh2_e0

    49595 10pps.evalue-log.mcl_out_Othresh2_e10

   120821 10pps.evalue-log.mcl_out_Othresh2_e180

    48127 10pps.evalue-log.mcl_out_Othresh2_e5

    59854 10pps.evalue-log.mcl_out_Othresh4_e0

    61622 10pps.evalue-log.mcl_out_Othresh4_e10

   125413 10pps.evalue-log.mcl_out_Othresh4_e180

    60165 10pps.evalue-log.mcl_out_Othresh4_e5

    65941 10pps.evalue-log.mcl_out_Othresh6_e0

    67703 10pps.evalue-log.mcl_out_Othresh6_e10

   126956 10pps.evalue-log.mcl_out_Othresh6_e180

    66248 10pps.evalue-log.mcl_out_Othresh6_e5

Checking the previous mcl from all-by-all blastn 10spp output. Sometimes a single query-hit pair has multiple HSPs, each has low bitscore or high evalue. Using evalue cutoff of 10 is a bit low. Trying 20:

$ mcl 10pps.evalue-log --abc -te 10 -tf 'gq(20)' -I 2 -o 10pps.evalue-log.mcl_out_Othresh2_e20

- Making clusters with output length filter = 600 looks way better than 300

- Perhaps breaking them into very tight clusters and then attach them in a way similar to hifix. If doing this in DNA then have to constantly flipping the directions. This may cause some problems

Writing a script to compare between different MCL parameters to see where the differences lies

aa all-by-all together with model eudicot

- Cat all phytozome model eudicot with cluster info with 7 of the 10spp aa datasets (minus potato tomato that already in the phtozome; beet that doesn’t have translation for download)

$ blastp -db ./all.db -query all.fa -evalue 0.00001 -outfmt '6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' -out all.blastp -num_threads=39 -max_target_seqs 100

4/8/2013 (all-by-all)

DNA all-by-all together with model eudicot

$ makeblastdb -in all.fa -parse_seqids -dbtype nucl -out all.DNA.fa.db

$ time blastn -db all.fa.db -query all.DNA.fa -evalue 0.00001 -num_threads 10 -max_target_seqs 100 -out all.fa.rawblastn -outfmt '6 qseqid qlen sseqid slen frames pident nident length mismatch gapopen qstart qend sstart send evalue bitscore'

Blastn is super fast. It takes no time compared to blastx or blastp

$ awk -F '\t' '{if($1!=$3 && $16>300)print $1,$3,$15}' all.fa.rawblastn >all.evalue

$ sed -i 's/  / /g' all.evalue

$ awk '{if($3==0.0)print $1,$2,180;else print $1,$2,-log($3)/log(10)}' all.evalue >all.evalue-log

$ mcl all.evalue-log --abc -te 10 -tf 'gq(20)' -I 2 -o all.evalue-log.mcl_out_Othresh2_e20

- Alignments looks great, especially the ones come out of prank. The only problem is that prank is slow, and it only uses one core.

- Seq length cutoff may be too aggressive given the backbone alignment. Can use 300 instead of 600

- Added lines to filter out clusters that have low number of ingroup seqs

Pushing the matrix construction through:

  1. Clean the matrix: phyutility get rid of columns with more than half empty. I’m not doing the clean messy or the filtering short seq step since prank seems to always make gaps for messy regions, and handle local alignment quite well
  2. Using --localpair --maxiterate 1000 for mafft followed by prank is way too slow. Didn’t finish 27 eudicot + 7 1kp after 4 days, without parallel for prank. The mafft --auto still picked --localpair. Finished alignment for 34 spps in 1.5 days.
  3. Wrote a python version of fasta2phylip to replace the old perl one. Couldn’t use the standard biopython library since it truncates seq names to 10 characters. The relaxed format is supported in recent biopython libraries but I wrote my own anyways.

4/11/2013 (processing 27 phytozome eudicot + 7spp)

- Wrote a script to find connected components from MCL input file and output cluster files in the same format as MCL. With inflation value = 1 MCL never converges.

- Check to see how MCL correspond to phytozome clusters

e value \ Inflation
















Checked using awk. It turns out that there’s no evalue that is lower than 20 at all

Checked distribution of pident: nothing lower than 70.

Checked distribution of nidentity and realized that there are many values < 100

Using these filters to go from rawblastn to evalue-log:

$ awk -F '\t' '{if($6>0.7 && $7>200 && $7/$2>0.3) print $1,$3,$15}' all.fa.rawblastn >all.evalue

$ sed -i 's/  / /g' all.evalue

$ awk '{if($3==0.0)print $1,$2,180;else print $1,$2,-log($3)/log(10)}' all.evalue >all.evalue-log

Turns out that $7/$2>0.3 (nmatch/query length) is probably too stringent. 0.2 is better

These are the numbers, including single seq clusters

389269 mcl_I1_e30  

422365 mcl_I1.4_e30

456892 mcl_I2_e30

4/12/2013 (checking the alignment)

- Connected component sometimes looks like just have one gene, good alignment, and eudicot ortho id is the same, but got broken up at I = 1.4. Probably because of incomplete seqs.

- The phytozome ortho id looks good in small gene families, but suspicious in large gene families. I don’t feel like using best hit to eudicot ortho groups for assigning ortho groups. Also fast-evolving seqs form taxa like nepenthus may have higher blastn and bitscores to a random gene family

- The mafft direction-flipping is not always accurate, especially when homology is remote. There are model seqs that come in opposite directions with different gene family numbers

- Tried pulling out a connected component and do mcl on it. It’s very sensitive to evalues and broke up the cluster way too much even under I = 1.001

use 7spp DNA to blastn against eudicot DNA:

$ time blastn -db db -query 7spp.fa -evalue 0.00001 -outfmt '6 qseqid qlen sseqid slen pident nident length mismatch gapopen qstart qend sstart send evalue bitscore' -out 7spp.rawblastn -num_threads=10 -max_target_seqs 3

real        5m9.445s

user        16m18.825s

sys        0m6.988s

4/17/2013 (find connected components –> mafft –> clean –> raxml--> cut long branches –> write homolog files)

$ python ~/clustering/ '/home/yayang/test_clustering/10spp/6_all-by-all_DNA_phytozome-7pp/all.fa' '/home/yayang/test_clustering/10spp/6_all-by-all_DNA_phytozome-7pp/mcl_I1_e30_connected-components' .

Only align connected components end with “00”

$ python ~/clustering/ .

$ python ~/clustering/ aligned

(gap_filter = 0.2)

$ python ~/clustering/ .

Looking at the tree and the corresponding alignment. I think the long branch is caused by wrong direction of the seqs. I don’t have a good solution about this. Probably it doesn’t matter for now. Using longbranch_cutoff = 2.0 is good for now. Also tried prank. It puts in way too many gaps, 2and makes the difference between well-aligned parts and non-homologous parts less obvious.

definitely can’t use best hit for clustering. Some of the pre-defined families doesn’t make sense on the tree. They’re polyphyletic

I modified the find_connected_component file to have a more stringent filtering on raw blast results and streamlined code a bit. Doing it all over again:

$ python ~/clustering/ all.fa.rawblastn

(using pident_cut = 0.6 and bitscore_cut = 200)

output 356,514 connected components

$ mkdir connected_components

$ python ~/clustering/ all.fa all.fa.rawblastn.ccs connected_components

output 13,903 fasta files that have 2 or more ingroup sequences

made a new directory ccs00 that have all connected components end with 00.

using Raxml. May change to fasttree for final dataset

4/24/2013 (prepare files for SMBE talk)

- De-shuffle the cleaned and shuffled reads of ILU1-6, and check like this to make sure the de-shuffle works correctly:

$ grep ' 1:' ILU6-read_.fq.thinned.shuffled.2

- Assemble using Trinity 2013-2-25 since it assembles many more genes. SOAPdenovo-Trans would be nice if it doesn’t crash in ILU2 and has better documentation. The command looks like $ perl ./ --seqType fq --JM 20G --left /media/eebsmithnfs1/NEXTGEN_projects/cary_phylogeo/data/ILU_assembly/fastq/ILU6-read_.fq.thinned.shuffled.1 --right /media/eebsmithnfs1/NEXTGEN_projects/cary_phylogeo/data/ILU_assembly/fastq/ILU6-read_.fq.thinned.shuffled.2 --CPU 11 --output ILU6

4/25/2013 (prepare files for SMBE talk)

Downloaded 70 1kp SOAPdenovo-Trans assemblies and translations. Got a summary of available taxa to Mike Moore. Three potential places for more labwork:

- Portulacaceae + Cactaceae + relatives

- Carnivorous plants

- Phytolaccoid clade

Either way need to troubleshoot RNA extraction

4/26/2013 (blastx to run for 3 weeks)


- Fixed seq names on all 66 1kp, beet, and 6 ILU seqs

- Run blastx on all 73 using exactly the same cutoffs. Keep the contaminated dataset WGET and HSXO for now

5/29-31/2013 (clustering)

100 taxa dataset (73 ingroup + 27 eudicot outgroup):

Add “@” to ingroup seq names and shorten names:

$ sed -i 's/>\([A-Z][A-Z][A-Z][A-Z1-6]\)-/>\1@/g' *.cutfa

$ sed -i 's/-.*//g' *.cutfa

Now the seq names all look like:


Also looked at ncbi sra database for additional datasets

Downloaded 3

Also downloaded the precompiled sra tookit

cd-hit-est to reduce redundancy

cd-hit-est -i all.fa -o all.cdhit.fa -c 0.99 -n 10 -r 1 -T 8

Probably using 0.995 is more conservative than 0.99. More redunction in ILU datasets than BGI ones since Trinity was used for ILU and are more redundant

Fix the criteria for pulling together connected components

Changed to a more conservative one. This way seqs that doesn’t have significant overlap with anything else will be kicked out of any clusters. Plotted the perc Q range from the raw blastn output and got a fairly flat distribution. This definitely has to be filtered.

–> Clusters look much better! Very long seqs can be pulled to the cluster by shorter matches

Filters used for the first round of clustering in the 100 taxa dataset:

pident_cut = 0.8

bitscore_cut = 500.0

perc_Q_range_cut = 0.7

Tried parameters to break down the largest cluster (34m). It turns out to be an chloroplast operon. Just ignore it for now. Also tried breaking down the 6m cluster2, ended up with the largest cluster being 4m. Keeping them for now.

Testing alignment


Always get an error. No idea how to solve it

ERROR: Running MUMmer failed with exit code 134.

terminate called after throwing an instance of 'fsa::Standard_exception'

Aborted (core dumped)

Giving up for now.


- Tested auto vs. local pair for alignment. As long as the clusters are tight enough they do not seem to make difference.

- --addustdirectionaccurately vs. --adjustdirection, the former is much slower but more accurate.

- Problem with mafft is that the flip direction part cannot be paralleled. It takes the most time. Have to manually parallel it

- Modified the mafft_align_folder script to only align datasets with >= 3 ingroups

- By manually parallel all alignment was done using flip direction accurately with the 7ssp+27outgroup dataset!

- Definitlely use Mafft local pire for first alignment. Prank makes more reasonable alighment but really seperates things out and it’ll be hard to make a cut

Looking at the alignment:

- The largest clusters will likely need some human intervention

- Occasionally model seq have both directions in the same connected components due to divergent sequences pulled together. At this point I think it shouldn’t matter since the uncertainly is mainly in the outgroup with the ingroups fairly tight

- Wrote a script to find messy alignment by detecting the same phytozome gene family being aligned in both directions. However, apparently seqs in the phytozome database do occur in reverse direction occasionally. Even if outgroups were occasionally flipped it doesn’t seem to affect ingroups


- FastTree is much faster than RAxML

- Using a gamma flag to prevent short branches

6/3/2013 (clustering aa seq for the 100 spp dataset)

BGI dataset

66 BGI translations by GeneWIse: In the translation folder downloaded from BGI, the dna are the ones have database hit, and are about 30% of the assembled seqs.

Fixed names and they look like:


Translated seq vs. DNA seq do not correspond to each other. Will need to find out how to match the DNA vs. aa alignments later

Translate the 6 ILU and the one beet dataset

Uploaded beet EST data to It only translates inframe sections and do not extend. It’s very conservative. See how it goes. It’s probably good this way since we’re also using DNA

Your unique Job ID is: trn1369840057_397

Submitted 11am, May 29

By 11am, Jun3 there is still no response. I think they suspended the server

Also downloaded wise2 from and installed it

cd into src directory

make all

all the executables are now in the src/bin/ directory

A translation pipeline is needed since wise is just a pairwise alignment tool.

Decide on not trying ESTscan. It’s similar to transdecoder, using a training set for the HMM model and do the rest without blast info. Whereas GeneWise and FrameDP both use blast results. Compare FrameDP, TransDecoder vs reference would be useful for now.

6/7/2013 (flip DNA directions with blast output)

Updated blast+ to 2.2.28 on lab linux

The problem is that when alignment software assumes homology when there’s not much there, it mess up everything except one well-aligned block.

Going back to test changing the gaprate in prank. Instead of the default 0.025, looking at 0.01 vs. 0.001

6/10/2013 (check alignment results and cut)

- 23986 connected components in total

- removed cc1 (chloroplast), with 23985 left

- 10 prank runs on lab linux, finished 3206 over the weekend (2.5 days)

- Prank made very clean alignment. However, after fasttree it has very little power to separate genes. Use mafft local-pair for the first alignment

- After second alignment, prank still make unreasonable alignment, probably due to assume homology

- Use a second mafft local-pair. Also tried using prank for the second alignment (divided connected components)

- Tried using 0.5 as phyutility filter. It’s too high. Some seq were completely removed

- Use 0.2 for the second mafft local pair, with 0.1 for the first

6/11/2013 (third alignment)

- Tested using mafft local-pair vs. prank once or twice

- After trimming set to 0.2, amount of chr left mafft local pair > prank once > prank twice

6/18/2013 (streamline pipeline)

- Realized that the remove kink function would have error if cut at the base, and the rest two are tips. Fixed it and another bug in Re-started all the runs.

- Downloaded the Apr 2013 version of prank and trying to see how many iteration, and whether to use -F. Tested once -F, once, twice, and three times. The more iteration, the bigger the alignment. Visualizing alignment and the cleaned alignment I cannot really tell any difference. They all have the same number of sequences, just the column number slightly different. Made trees using raxml. The resulting branch length are slightly different, and deep nodes have different topology. However, I don’t think it would affect pruning. -> use -once -DNA -F for future prank

- All the tree manipulations are buggy due to rooted trees, especially when it comes to monophyly/paraphyly masking and paralog pruning

6/21/2013 (Bio.Phylo to networkx)

Installed matplotlib by

$ sudo apt-get install python-matplotlib

Also installed ipython

$ sudo apt-get install ipython-notebook

Turned out that there is no usable python library to connect tree to graph. Joseph suggested using R

7/2/2013 (swipe)

Not sure why I forgot about swipe and did blastp instead. Perhaps it doesn’t matter that much for the kind of clustering method I’m using - it’s fairly tolerant to heuristics

$ time swipe -d all.pep.db -i all.pep -M BLOSUM62 -a 10 -p blastp -o all.swipe -m 8 -e 0.00001

killed it after running for 10 hrs. Way too slow.

7/3/2013 (matrix stats)

cd into the dir with all the trimmed final alignments

python ~/clustering/ . ../aln-cln_stats

In R:


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1],t='l',col='blue',lwd=3,ylim=c(1,96),xlim=c(1,10000),xlab='gene region',ylab='number of taxa')


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1]/96,t='l',col='blue',lwd=3,ylim=c(0,1),xlim=c(1,10000),xlab='gene region',ylab='percent taxa')

downp = a[downo,1] / TOTAL NUMBER OF TAXA


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down])[length((cc[down]))] = PERCENTAGEOCCUPIED

#sp        occupancy

82        0.8971208

47        0.7001222

14        0.5058244

$ python ~/clustering/ ../12a_96taxa_orthologs/aln-cln 50 82 82_90occup

$ python ~/clustering/ ../12a_96taxa_orthologs/aln-cln 50 47 47_70occup

$ python ~/clustering/ ../12a_96taxa_orthologs/aln-cln 50 14 14_50occup

count how many genes are in each matrix:

$ wc -l *occup

 10457        14_50occup

  5507        47_70occup

  1068        82_90occup

Concatenate using phyutility:

Make the file by taking the 88_90occup file, swap \n with space, and add “phyutility -concat -out 90_occup.nex -in” in front. The shell script looks like:

phyutility -concat -out 90_occup.nex -in …(separated by space)

Move this shell script to the dir with all the .aln-cln files. Run it. It outputs a nexus file with header info on the partitions. Manually change it into a phylip format file: remove headers leaving only the dimension; remove the “;” at the end; remove the tab in front of each taxa by using %s/^\t//g in vi (don’t do it in geany it will crash)

Also make the .models file with the nex header

module purge

module load openmpi/1.4.4-intel

raxmlHPC-PTHREADS-SSE3-icc -T 9 -m PROTCATWAG -p 6666 -q 90occup.models  -s 90occup.phy -n 90occup.bestTree

Raxml always found empty columns and will reduce them. Running from the reduced matrix and model file on bigmem1, but the original file on Joseph’s machine. The full one on Joseph’s machine took ~12 hrs to finish, but the bigmem1 one was only close to finish after about 18 hours.

Check taxa representation in genes:

$ python ~/clustering/ 51_70occup ../12_final_orthologs 51_70occup_taxa_rep

In the lowest (15_50occp) dataset,

Checking outliers and very small datasets to exclude:

JBGU Amaranthus palmeri always have very long branch attached to random places in the outgroup. It is also a small dataset

HTDC is always in the outgroup too and has an extremely small dataset

QAIR has only 3 genes in the 15_50occup dataset. Remove

HSXO is not even in the 5_50occup dataset. Remove

WGET Kochia (Amaranthaceae) was contaminated in analysis 2012, but looks all right now. BGI probably redid it. Keep

JLOV pereskia is small too but looks good

CGGO is very small but looks good

Wrote a script to remove these four taxa from all aligned fasta files. Redid the to remove .mm files with the four to remove.

After getting the 96 taxa alignment, and unpruned .mm trees, calculate taxa representation again

Select .mm trees that have monophyletic outgroup and re-root them.

For trees with one single outgroup taxa have to root one clade into the ingroup. Only use ingroup with >5 names in subsequent analysis

$ python ~/clustering/

2545 trees with at least 60 ingroup leaves were checked, in which rate shift of >1.5 contrast were detected in 2531 trees.


Only looked at trees with at least 60 ingroup taxa.

- If require at least two taxa overlap, out of 1903 trees checked, 6743 duplication sets found in 1539 trees

- if last least one taxa overlap, out of 1903 trees checked, duplication sets found in 1836 trees

Use to reroot the best tree from raxml, and then


To map bipartitions from rooted trees of non-overlapping sets to change the node labels from ratio to percentage to write exclusion files

715/2013 (summarizing jackknife results)

raxml can make unrooted consensus trees, whereas phyutility requires rooting

cat all the result trees into jk30_all_trees

$ raxmlHPC-PTHREADS-SSE3 -J MR -z jk30_all_trees -T 2 -m GTRCAT -n jk30

7/17/2013 (pep connected components)

Clustering pep:                                        

Clustering without eudicot to all blastp. Comparing this with the reall all by all and see whether there is any difference. It would be different in small clusters but very small clusters are of no use anyways.

Modified the script so that it records range, but do not record direction of hits

Trying clustering with a very loose cutoff.

pident_cut = 0.5

bitscore_cut = 150.0

perc_qrange_cut = 0.7

got only 4137 ccs with >= 8 ingroup taxa. Definitely need tighter bitscore_cut

number of ccs: bitscrore_cut 250 (5687 ccs) and bitscore_cut350 (6180 ccs)

7/18/2013 (pep align-cut)

Checked the difference among fasttree defalt, WAG vs WAG-gamma, and cannot tell any significant difference in branch length. Use WAG fit now.

The largest connected component was pulled together by trans-membrane domain. Definitely need more stringent bit score cutoff to break it

8/2/2013 (test MCL)

for DNA:

Tested ProGraphMSA. It’s basically a prank wrapper.

Played with alignment. Using genafpair instead of localpair

Played with trimal. Instead of using auto use a fixed similarity score

python dna_mcl_trimal_wrapper . 0.1

Need to check trimmed alignment to make sure that it doesn't remove everything. Put an output skinny alignment at the end script on the trimal wrapper. Take these out and use

I’m not filtering retrotransposon at this stage. We know so little about them!

Also using 0.5 will kick out some fast-evolving proteins but those are difficult to deal with anyways in the end.

Since DNA has the UTRs, and are tend to be forced to align, trimal should be used for initial trimming with, with occupancy set to 0.1, fixed min_char = 150. Changed dna cuts from 3,1 to 2,0.5. For clusters cut to 0, use (min_column_occu = 0.05, min_char = 150). occupancy cutoff is set to 0.1 for cuoff1, and 0.3 for cutoff 2

It seems that the reason that mcl gives more clusters is that the model seqs are able to pull many short, poorly assembled transcript together. Without mcl these seqs will be “peeled away” from the clusters by the more stringent blast filters

Always have to check the size of matrix if the first trimming step uses trimal

8/6/2013 (finishing up clustering)

dna: to prank after cutting, and phyutility cutting with occupancy cutoff of 0.1 (prank output tend to be very sparse)...didn’t work out. Too variable for prank to handle

pep: prank and mafft at the same time. Give up on prank...too many unreasonable gaps. Final aligning still use mafft genafpair, and clean with trimal -gt 0.2, -st 0.0001, min_chr  max(occupancy*length,50)

Tried macse again but it still didn’t finish after running overnight for a 400-seq dataset. It finished for a much smaller dataset but it seems that it forces global alignment.

8/8/2013 (fsa)

Although fsa install instructions says MUMmer is only needed for long seq, for DNA MUMmer is required. Put MUMmer in the path when installing otherwise fsa runs will fail without an helpful error message.

for dna: use final 0.3 occupancy for cleaning, min_chr = 300

for pep: use final 0.3  occupancy for cleaning, min_chr = 100

FSA still push blocks out and force things to align, making topology not well-aligned

Compare pep fsa vs. mafftt

for mafft: -st 0.0002, min_char = max(float(occupancy)*length,100, occupancy = 0.3

for fsa: use final 0.3 occupancy for trimal, min_chr = 100

For pep, use fsa for now. fsa pushes things out, but also pushes possible frame-shift out. Both mafft and fsa are worse than dna for toplogy because of very conservative motifs and very labile regions, and therefore lack of informative sites after trimming.

Compare dna fsa vs. mafft:

I’ve already known that mafft does a terrible job with dna...

for fsa:  use final 0.3 occupancy for trimal cleaning, min_char=max(float(occupancy)*length,100

8/9/2013 (raxml)

Raxml tend to give longer branch estimation than fasttree

for pep: final mafft, trimal 100,0.3 both columns and rows, raxml. Long branch cutoff

for dna: fsa, trimal 300, 0.2 for columns only. Long branch cutoff  = 0.5

8/12/2013 (tree stats)

Looked at the dna_mcl raxml output from fsa with both trimal and phyutility cleanups. The backbone is bad. Probably prank-based method would work better. Use phyutility with 0.2 for columns only and MIN_LENTH = 300

By looking at alignments: for dna alignment, the leftover part after cutting are mostly the conserved motif, since the rest are very hard to align even with aa. For aa however, since there’s not much changes in the motif part, the majority of phylogenetic signal comes from the non-conserved regions, and is therefore not only have less informative sites and highly rely on the aa substitution matrix, but also highly rely on the alignment, for which none of the alignment program seem to be able to handle, esp. for the rate heterogeneity both among columns and rows.


cut branches longer than 0.5

also cut ingroup  tips that are >0.2 and have >10 contrast


separate out rooted clades with >= 65/69 ingroups or 20/22 outgroups



Comparing the outgroup inference to De Smet 2013 PNAS: out method cannot detect any terminal duplications, or any multiple rounds of duplications on one single internal node. Resolution is highly dependent on taxon sampling

8/15/2013 (sate for improving pep alignment)

install sate:

install dentroPhy by downloading the package from

cd into the dir, and type

sudo python install

Then download sate source code, cd into sate-core, and type

sudo python develop

It creates a number of links for external tools (not using. Stick with prank)

For final alignment cleanup, use trimal with min_column occupancy = 0.3 for pep, min 100

Also tried using a user-defined guide tree for mafft. However, mafft requires a very strict input format, and the ruby script for converting newick format guidetree to the native mafft format never works

Prank itself is iterative. Since it is parametric it deals poorly with rate heterogeneity

8/19/2013 (pep final raxml)

Use 0.6 for final long branch cut instead of 0.5 for dna.

a mess at cc148-1.

The pep dataset is messier in topology and branch lengths, probably due to less characters and possible frameshift.

Aneupolyploid probably played a role in the peaks of duplication. It is continum or discrete

Extensive chromosomal variation in a recently formed natural allopolyploid species, Tragopogon miscellus(Asteraceae).

Just by looking at the cary genome duplications, it appears that genome duplications are followed by increase of net speciation. However, genome duplication may be associated with life history changes as well. the paper “Recently Formed Polyploid Plants Diversify at Lower Rates” needs further scrutiny, especially its method  of detecting recently formed polyploids - by chromosome counts only

 Well-studied polyploid species

commonly exhibit 25–75 % retention of paralogous gene pairs from the most

recent WGD event (reviewed in Lynch 2007 ; Otto 2007 ),

removed the outlier line from the file all_no_dup

0.0485752872        3        2.4424601199e-06        2        19886.8527548

8/21/2013 (cary life history contrast)

Fixed the script for extracting clade. Use two ingroup taxon number cutoffs:

65 for cary, and 20 for rosids for gene duplication

For life history with dna dataset:


[1] 9111


[1] 7391

> 7391/9111

[1] 0.8112172

8/22/2013 (pep contrast)

For life history with pep 60-20 dataset:

> length(a[,1])

[1] 18871

> sum(a[,1]>0)

[1] 13435

> 13435/18871.

[1] 0.711939

For life history with pep 65-20 dataset:

> length(a[,1])

[1] 7513

> sum(a[,1]>0)

[1] 5259

> 13435/18871.

[1] 0.711939

Prepare alignment files for bootstrap

4418 alignments

Then clean by phyutility 0.1,50


8/23/2013 (rates)

fixed the phylo3 library so that reroot doesn’t shift the node labels any more

For pep

tested bootstrap on pep dataset, and Using only extracted clades with average bootstrap values >= 70 helped a lot with mapping duplications

the pattern look quite a bit cleaner!!

Keep on boostrapping the pep alignments written from mm trees

for dna:

Bootstrap the dna dataset:

modified “sate” to “.final.fa” in and applied to the dna dataset, and used the same with 0.1 and 300

make new alignment from mm that has clades in extracted_filter60

accidentally put the dna bootstrap files to the bigsmith pep_mcl dir

8/26/2013 (rates)

While using bootstrap value filter for map duplications, for rates it should be fine without bootstrap values for model clades but directly cut from mm trees


> null=read.table('out_null')[,1]

> model=read.table('modelout_dup')[,1]

> dup=read.table('out_dup')[,1]

> ks.test(null,model)

data:  null and model

D = 0.1163, p-value = 0.004245

alternative hypothesis: two-sided

> ks.test(null,dup)

data:  null and dup

D = 0.0771, p-value = 5.402e-07

alternative hypothesis: two-sided

> ks.test(model,dup)

data:  model and dup

D = 0.1617, p-value = 4.339e-05

alternative hypothesis: two-sided

try filter by 10:

> null=read.table('out_null_filter10')[,1]

> model=read.table('modelout_dup_filter10')[,1]

> dup=read.table('out_dup_filter10')[,1]

> ks.test(null,model)

data:  null and model

D = 0.116, p-value = 0.004387

alternative hypothesis: two-sided

> ks.test(null,dup)

data:  null and dup

D = 0.0569, p-value = 0.0006272

alternative hypothesis: two-sided

Warning message:

In ks.test(null, dup) :

  p-values will be approximate in the presence of ties

> ks.test(model,dup)

data:  model and dup

D = 0.145, p-value = 0.0003645

alternative hypothesis: two-sided

Warning message:

In ks.test(model, dup) :

  p-values will be approximate in the presence of ties

pep subset of about 1000 bootstraptrees. Use bs filter for cary but not for model

> null=read.table('out_null')[,1]

> model=read.table('modelout_dup')[,1]

> dup=read.table('out_dup')[,1]

> ks.test(null,dup)

data:  null and dup

D = 0.0698, p-value = 0.03344

alternative hypothesis: two-sided

> ks.test(null,model)

data:  null and model

D = 0.0314, p-value = 0.2619

alternative hypothesis: two-sided

> ks.test(dup,model)

data:  dup and model

D = 0.0589, p-value = 0.1788

alternative hypothesis: two-sided

None is significant for pep after filtering by 10

Key facts about the Kolmogorov-Smirnov test

•The two sample Kolmogorov-Smirnov test is a  nonparametric test that compares the cumulative distributions of two data sets(1,2).

•The test is nonparametric. It does not assume that data are sampled from Gaussian distributions (or any other defined distributions).

•The results will not change if you transform all the values to logarithms or reciprocals or any transformation. The KS test report the maximum difference between the two cumulative distributions, and calculates a P value from that and the sample sizes. A transformation will stretch (even rearrange if you pick a strange transformation) the X axis of the frequency distribution, but cannot change the maximum distance between two frequency distributions.

•Converting all values to their ranks also would not change the maximum difference between the cumulative frequency distributions (pages 35-36 of Lehmann, reference 2). Thus, although the test analyzes the actual data, it is equivalent to an analysis of  ranks. Thus the test is fairly robust to outliers (like the Mann-Whitney test).

•The null hypothesis is that both groups were sampled from populations with identical distributions. It tests for any violation of that null hypothesis -- different medians, different variances, or different distributions.

•Because it tests for more deviations from the null hypothesis than does the Mann-Whitney test, it has less power to detect a shift in the median but more power to detect changes in the shape of the distributions (Lehmann, page 39).

•Since the test does not compare any particular parameter (i.e. mean or median), it does not report any confidence interval.

• Don't use the Kolmogorov-Smirnov test if the outcome (Y values) are categorical, with many ties. Use it only for ratio or interval data, where ties are rare.

•The concept of one- and two-tail P values only makes sense when you are looking at an outcome that has two possible directions (i.e. difference between two means). Two cumulative distributions can differ in lots of ways, so the concept of tails is not really appropriate. the P value reported by Prism essentially has many tails. Some texts call this a two-tail P value.

Interpreting the P value

The P value is the answer to this question:

If the two samples were randomly sampled from identical populations, what is the probability that the two cumulative frequency distributions would be as far apart as observed? More precisely, what is the chance that the value of the Komogorov-Smirnov D statistic would be as large or larger than observed?

If the P value is small, conclude that the two groups were sampled from populations with different distributions. The populations may differ in median, variability or the shape of the distribution.

This test has poor statistical efficiency.

Many nonparm stats are based on ranks and therefore

measure differences in location. The K-S examines a

single maximum difference between two distributions.

If a statistical difference is found between the

distributions of X and Y, the test provides no insight as to

what caused the difference. The difference could be due

to differences in location (mean), variation (standard

deviation), presence of outliers, type of skewness, type of

kurtosis, number of modes, and so on.

**Therefore it’s not important how significant the p-value is, but what makes the significance go away


1. pep has higher contrasts than DNA in general

9/3/2013 (concatenate –> species tree)

pep species tree

python ~/clustering/


MIN_CHR = 50

cd into the dir with all the trimmed final alignments

python ~/clustering/ . ../aln-cln_stats

In R:


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1],t='l',col='blue',lwd=3,ylim=c(1,96),xlim=c(1,20000),xlab='gene region',ylab='number of taxa')


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1]/96,t='l',col='blue',lwd=3,ylim=c(0,1),xlim=c(1,20000),xlab='gene region',ylab='percent taxa')

downp = a[downo,1] / 96


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down])[length((cc[down]))] = PERCENTAGEOCCUPIED

#sp        occupancy

90        0.9496444

82        0.8986288

40        0.7031581

14        0.5010091


a = read.table('aln-cln_stats_cary')

downo = order(a[,1],decreasing=T)

plot(a[downo,1],t='l',col='blue',lwd=3,ylim=c(1,70),xlim=c(1,20000),xlab='gene region',ylab='number of taxa')


a = read.table('aln-cln_stats_cary')

downo = order(a[,1],decreasing=T)

plot(a[downo,1]/69,t='l',col='blue',lwd=3,ylim=c(0,1),xlim=c(1,20000),xlab='gene region',ylab='percent taxa')

downp = a[downo,1] / 69


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down])[length((cc[down]))] = PERCENTAGEOCCUPIED

#sp        occupancy

60        0.9049526

10        0.4917345

$ python ~/clustering/ . 50 90 ../90taxa_95occup

$ python ~/clustering/ . 50 82 ../82taxa_90occup

count how many genes are in each matrix:

$ wc -l *occup

 2519 82taxa_90occup

  416 90taxa_95occup

Concatenate using phyutility:

Make the file by taking the 88_90occup file, swap \n with space, and add “phyutility -concat -aa -out ../90taxa_95occup.nex -in ” in front. The shell script looks like: phyutility -concat -aa -out ../90taxa_95occup.nex -in …(separated by space)

Move this shell script to the dir with all the .aln-cln files. Run it. It outputs a nexus file with header info on the partitions. Do not open the nex file - it will crash any text editor. Use this script to change the format.

Also make the .models file with the nex header

module purge

module load openmpi/1.4.4-intel

raxmlHPC-PTHREADS-SSE3-icc -T 9 -m PROTCATWAG -p 6666 -q 90taxa_95occup.models  -s 90taxa_95occup.phy -n 90taxa_95occup

Raxml always found empty columns and will reduce them. Running from the reduced matrix and model file on bigmem1, but the original file on Joseph’s machine. The full one on Joseph’s machine took ~12 hrs to finish, but the bigmem1 one was only close to finish after about 18 hours.

Check taxa representation in genes:

$ python ~/clustering/ 51_70occup ../12_final_orthologs 51_70occup_taxa_rep

In the lowest (15_50occp) dataset,

dna species tree

The file ending for alignment files are “.final.fa.aln” instead of “.sate.aln” as for pep

python ~/clustering/


MIN_CHR = 150

cd into the dir with all the trimmed final alignments

python ~/clustering/ . ../aln-cln_stats

In R:


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1],t='l',col='blue',lwd=3,ylim=c(1,96),xlim=c(1,20000),xlab='gene region',ylab='number of taxa')


new = read.table('aln-cln_stats')

old = read.table('aln-cln_stats_old')

downnew = order(new[,1],decreasing=T)

downold = order(old[,1],decreasing=T)

plot(new[downnew,1],t='l',col='green',lwd=3,ylim=c(1,96),xlim=c(1,10000),xlab='gene,region',ylab='number of taxa')


legend(40,c("without mcl","with mcl"),col=c("brown","green"),lwd=3)


a = read.table('aln-cln_stats')

downo = order(a[,1],decreasing=T)

plot(a[downo,1]/96,t='l',col='blue',lwd=3,ylim=c(0,1),xlim=c(1,20000),xlab='gene region',ylab='percent taxa')

downp = a[downo,1] / 96


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down])[length((cc[down]))] = PERCENTAGEOCCUPIED

#sp        occupancy

89        0.9516129

82        0.9009362

52        0.7004967

17        0.5016369


a = read.table('aln-cln_stats_cary')

downo = order(a[,1],decreasing=T)

plot(a[downo,1],t='l',col='blue',lwd=3,ylim=c(1,70),xlim=c(1,20000),xlab='gene region',ylab='number of taxa')


a = read.table('aln-cln_stats_cary')

downo = order(a[,1],decreasing=T)

plot(a[downo,1]/69,t='l',col='blue',lwd=3,ylim=c(0,1),xlim=c(1,20000),xlab='gene region',ylab='percent taxa')


new = read.table('aln-cln_stats_cary')

old = read.table('aln-cln_stats_cary_old')

downnew = order(new[,1],decreasing=T)

downold = order(old[,1],decreasing=T)

plot(new[downnew,1],t='l',col='green',lwd=3,ylim=c(1,69),xlim=c(1,10000),xlab='gene,region',ylab='number of taxa')


legend(40,c("without mcl","with mcl"),col=c("brown","green"),lwd=3)

downp = a[downo,1] / 69


To see how many taxa are for each of these occupancies checking

cc = (cumsum(downp)/1:length(downp))

down = a[downo,1] >= NUMBEROFTAXATOCHECK (e.g., 15)

(cc[down])[length((cc[down]))] = PERCENTAGEOCCUPIED

#sp        occupancy

60        0.9053055

10        0.4323362

Also calculated the no mcl, no iter-cut results prior to the conference

Compared to pep occupancy, dna occupancy is higher at the high end, but drops off faster. Compared to without mcl (also without cc1), mcl results lost lots of outgroups, but is better at the lower end.

cd into the dir with .aln-cln files

$ python ~/clustering/ . 150 89 ../89taxa_95occup

$ python ~/clustering/ . 150 82 ../82taxa_90occup

count how many genes are in each matrix:

$ wc -l *occup

 2519 82taxa_90occup

  416 90taxa_95occup

Concatenate using phyutility:

Make the file by taking the 88_90occup file, swap \n with space, and add “phyutility -concat -aa -out ../90taxa_95occup.nex -in ” in front. The shell script looks like: phyutility -concat -aa -out ../90taxa_95occup.nex -in …(separated by space)

Move this shell script to the dir with all the .aln-cln files. Run it. It outputs a nexus file with header info on the partitions. Do not open the nex file - it will crash any text editor. Use this script to change the format.

Also make the .models file with the nex header

module purge

module load openmpi/1.4.4-intel

raxmlHPC-PTHREADS-SSE3-icc -T 9 -m PROTCATWAG -p 6666 -q 90taxa_95occup.models  -s 90taxa_95occup.phy -n 90taxa_95occup

There is a fine ballance between mcl’s efficiency vs. signal sharpness. DNA dataset suffered from loosing outgroups

9/4/2013 (all gene tree bootstrap results)


All bootstrap finished. However, cc6-1 best tree never finish after several days. Using the mm best tree for putting bootstrap numbers on

$ raxmlHPC-PTHREADS-SSE3 -f b -t -z -T 2 -m GTRCAT -n bipartitions

Found a total of 471 taxa in first tree of tree collection

Expecting all remaining trees in collection to have the same taxon set

RAxML output files with the run ID <bipartitions> already exist

in directory /home/yayang/cary/dna_mcl_8_mm_rates/bootstrap/ ...... exiting

Removed cc6-1 from the data set.

2769 trees remain


ls -lrct to check whether there is any runs got stuck

cc4-1 has the same problem of ML search never finish

$ raxmlHPC-PTHREADS-SSE3 -f b -t -z RAxML_bootstrap.cc4-1.sate-mm.aln-cln.phy -T 2 -m PROTCATWAG -n bipartitions

There are 4420 alignment written from cary clades >= 60 taxa.

Map duplications to species tree:

After bootstrap and extract cary clades again, there’s slightly decrease of number of clades passing the size filter. This is normal due to the uncertainly at the base

python ~/clustering/

9/5/2013 (jackknife pep)

To reroot species tree using fixed outgroup

$ python ~/clustering/ 95occup.models .

416 genes in total

291 genes excluded in each round of jackknife analysis

The shell script for running jackknife on bigmem looks like:

module purge

module load openmpi/1.4.4-intel

raxmlHPC-PTHREADS-SSE3-icc -F -T 9 -E jk30_exclude101 -m PROTCATWAG -p 6666 -q 95occup.models -s 95occup.phy -n 95occup.jk30.jk30_exclude101

raxmlHPC-PTHREADS-SSE3-icc -F -T 9 -m PROTCATWAG -p 6666 -q 95occup.models.jk30_exclude101 -s 95occup.phy.jk30_exclude101 -n 95occup.jk30_exclude101

Script for mapping jk bipartition to the best tree:

$ cat *result* >jk30_all_trees

$ raxmlHPC-PTHREADS-SSE3 -f b -t '/home/yayang/cary/pep_mcl_9_pp_species_tree/90taxa_95occup_bestTree/pep_RAxML_bestTree.95occup.reroot' -z jk30_all_trees -T 2 -m PROTCATWAG -n jk30_bipart

Did the same for the 82taxa_90occup dataset

9/9/2013 (gene tree discordance)

Testing SaguaroGW

Download the source code from 

Unzip, cd into the directory, and type make

Further reading the paper and website makes me think that it is modeling based on SNP..

Try mapping either use bipartition or graph instead

9/10/2013 (tblastx,traitRate)


For dna dataset, used the new detect chimera script, cut, and used 0.99 instead of 0.995 for cd-hit-est

tblastx is extremely slow and the output file is very large due to many HSPs for each query-hit pair. Divide it up into subsets dna datasets, remove the four problematic datasets with 96 taxa remain.

I killed all three tblastx runs a day later. Right now I don’t see we need it. In the future we can always use protein for clustering and match the dna seqs.

At this point it looks circular that you have to date the tree to get an ultrametric tree and do this. Sister clade comparison is good enough


download the source code from

To install, I followed the instructions on traitRate webpage exactly

1. Go to directory libs/phylogeny/ by typing: "cd download_dir/libs/phylogeny/"

2. To compile the library type: "make doubleRep"

3. Go to the traitRate source directory by typing: "cd ../../programs/traitRate/"

4. Compile the program by typing: "make doubleRep".

sudo cp traitRate.doubleRep /usr/bin

9/12/2013  (GO categories)

GO categories for each mm

Download the annotation information file for each of the 27 eudicots from phytozome > bulk data

Columns of the list according to my guess:

1-peptide id



4-transcript (isoforms? it sometimes differ from 3) Checked the potato file that downloaded non-redundant sequences are sometimes in column 3 and sometimes 3. Have to use both for transcripts.

The rest of the terms doesn’t follow the same format

Concatenate all the info files together

$ cat *.txt >all_27spp_info.txt

$ all_27spp_info.txt

This produces two simple list of pep go and dna go

produce match of homolog id to GO terms

9/13/2013 (clade, contrast vs. GO)

9/16/2013 (sign test for woody/herbaceous)

sign test for all five clades

wc -l contrast*

   364 contrast1

  1033 contrast2

  3580 contrast3

  1169 contrast4

  1049 contrast5

$ grep '-' contrast1 | wc -l


$ grep '-' contrast2 | wc -l


$ grep '-' contrast3 | wc -l


$ grep '-' contrast4 | wc -l


$ grep '-' contrast5 | wc -l



contrast1 = clade E, total 364, negative 122

> binom.test(122,364)

        Exact binomial test

data:  122 and 364

number of successes = 122, number of trials = 364, p-value = 3.101e-10

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

 0.2868127 0.3862244

sample estimates:

probability of success


contrast2 = clade D, total 1033, negative 105

> binom.test(105,1033)

        Exact binomial test

data:  105 and 1033

number of successes = 105, number of trials = 1033, p-value < 2.2e-16

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

 0.08388985 0.12170714

sample estimates:

probability of success


contrast3 = clade C, total 3580, negative 1085

> binom.test(1085,3580)

        Exact binomial test

data:  1085 and 3580

number of successes = 1085, number of trials = 3580, p-value < 2.2e-16

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

 0.2880440 0.3184234

sample estimates:

probability of success


contrast4 = clade B, total 1169, negative 30

> binom.test(30,1169)

        Exact binomial test

data:  30 and 1169

number of successes = 30, number of trials = 1169, p-value < 2.2e-16

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

 0.01738016 0.03643398

sample estimates:

probability of success


contrast5 = clade A, total 1049, negative 18

> binom.test(105,1033)

        Exact binomial test

data:  105 and 1033

number of successes = 105, number of trials = 1033, p-value < 2.2e-16

alternative hypothesis: true probability of success is not equal to 0.5

95 percent confidence interval:

 0.08388985 0.12170714

sample estimates:

probability of success





                mean                median

contrast1        0.330246        0.2134092

contrast2        1.080695        0.824468

contrast3        0.6273645        0.2765135

contrast4        1.732226        1.500113

contrast5        2.11223        1.773047

10/10/2013 (back translation)

Input dna fasta files:

- For 1KP, use the DNA from the translation dir so that the seq names match

- For both ILU and Beta, use the cds from translation folder. Fix seq names. Pal2nal doesn’t do reverse compliment so have to use cds

- Fix names for all so that 1KP ones has exact match and ILU and Beta matches the beginning

- Use peptide number to match between eudicot transcripts and peptides

1. Extract all mm trees with min taxon limit of 5 (basically no limit)


Input: cary.dna from concatenating cds from 69 ingroup data sets

Tree: extracted clades from all mm trees with min taxon limit of 5 (basically no limit)

Output unaligned cds for each of the 15361 extracted clades

Has to use cds with direction fixed since pal2nal doesn’t check for reverse direction

Cannot use the phytozome downloads since the peptide name dosen’t match. Have to only use ingroups


back translation for all pep alignments, regardless of outgroups

Had to change all the names in  cary_extracted_clades_min_taxa_5, devided trees from “-” to “_” when re-rooting. Not that there are still “-” in cary clades

- 63 finished in 2.5 days. Will take 2 years to finish all 15000 genes with 10 cours running on the lab linux

- to read dirs of codeml python wrapper output and cary out chi2 test when the likelihood ratio is >0.

Had to kill the codeml run with model 1 (every branch has its own omega). Run didn’t finish over 3 days.

10/15/2013 (HyPhy)

Only GUI and cann’t easity pipe it!

11/5/2013 (codeml and GO)


- Running only model 1a and 2a to save time. We are looking at genes with most significant overall selection signal anyways

- Divided all 15361 cary clades into 10 dirs and the runs should be done by Friday


- focuses on human. Not useful for us


Test set job id 104957435.

$ python ~/clustering/ all_mm arabidopsis_locusid_mm

$ python ~/clustering/ arabidopsis_locusid_mm codeml_ch2_stats_background

$ python ~/clustering/ arabidopsis_locusid_mm codeml_ch2_stats_selected_pvaluecut0.01

This subset job id 901280580 (GO),

218247092 (all other 3 tests)

Done. Basically our genes are mostly widely expressed genes among tissue types

Simple GO enrichment

1. Paralogs with contrast >= 1 vs. <1

input data: all 15361 cary clades in the dir 1-cary_extracted_clades_min_taxa_5

$ python ~/clustering/ 1-cary_extracted_clades_min_taxa_5 cary_dup

of 7310 background cary clades with at least one node having at least two duplicated taxa between the two sides, 2821 have at least 1 contrasts >= 1, and 1344 has at least 1 contrasts >= 2

12 columns are converted into ??? because of stop codons

 6 columns are converted into ??? because of stop codons

3 columns are converted into ??? because of stop codons

11/18/2013 (millipede and insect datasets)

SRA toolkit

doanload from

- extract fastq files:

$ /home/yayang/apps/sratoolkit.2.3.3-4-ubuntu64/bin/fastq-dump.2.3.3-3 -A SRR941771  --split-files

- Fix seq nems

$ sed -i 's/ length=50/\/1/g' *_1.fastq

$ sed -i 's/ length=50/\/2/g' *_2.fastq

- Clean reads

$ python ~/optimize_assembler/ SRR941771_1.fastq SRR941771_2.fastq SRR941771

The SRA one in insect data set is old illumina phred scores

Everything else is normal. That’s why you have to use R to plot

Using settings


trim_tail_quals = 20         #trim seq at the tail end with quality score lower than this number

keep_len = 30                         #only keep sequences longer than this after trimming ends

Amino acid and transcripts from genome annotation for the outgroup Ixodes was downloaded from

Daphnia pulex downloaded from

11/21/2013 (translation)

download the latest version (20131117) of transdecoder and make with pfam binary files

$ make prep_pfam

this downloads pfam domain info and save to TransDecoder_r20131117/pfam/Pfam-AB.hmm.bin

download parafly-r2013-01-21

$ ./configure --prefix=/usr/bin

$ sudo make install

Then install mpi

$ sudo apt-get install mpich2 openmpi-bin

$ export LD_LIBRARY_PATH=/home/yayang/apps/TransDecoder_r20131117/util/lib64

$ ./TransDecoder -t ./Trinity.fa --search_pfam /home/yayang/apps/TransDecoder_r20131117/pfam/Pfam-AB.hmm.bin --reuse --workdir trinity-transdecoder-pfam --CPU 5