2023 Taiwan Marine Bioinformatics Practical Workshop
Genomics is beautiful, powerful, and useful
2
Dimensions of a Genome
Source: Gracey and Cossins (2003)
Tsai-Ming, Lu (ICOB, AS)
呂在明 (中研院細生所)
Mei-Fang, Lin (Dep. Mar. Res., NSYSU)
林梅芳(中山大學海資系)
Yi-Jyun, Luo (BRC, AS)
駱乙君(中研院生多中心)
TW Marine Bioinfo workshop
Practical training to whoever needed.
(You will need to teach yourself more other things, including the sequencing theories, statistics…etc)
We provide
林梅芳 (Mei-Fang Lin)
Marine Genomics and Evolution Lab.
海洋基因體與演化實驗室
5
- Genomics
- Marine Biology
Phylogenomics
Transcriptomics
Comparative genomics, Population genomics
- Molecular Evolution
Cnidarian
Echinoderm
The instructors-乙君
Source: https://biocorecrg.github.io/PHINDaccess_RNAseq_2020/
The dataset
>Seq1
MTLVAEHLLMDTFGSDFDSLPPSLFKDFPEDGFNMKKKSMTSIEEDIMSDYSFPPTPPISPGCSSIASEIGDPERIQPVCDELEDDFNFAAEEKSLYFQENDFKDILIKDCMWNG
ASCII (American Standard Code for Information Interchange)
In the real dataset
In this experiment:
Mei-Fang Lin,Shunichi Takahashi,Sylvain Forêt,Simon K. Davy,David J. Miller, Transcriptomic analyses highlight the likely metabolic consequences of colonization of a cnidarian host by native or non-native Symbiodinium species, Biol Open, 2019
Alga-infection in corallimorpharian
Stage 1
Stage 2
Stage 3
Control
Group C
Group D
A1R1
C2R1
F2R1
A1R2
C2R2
F2R2
A1R3
C2R3
F2R3
A2R1
D2R1
E1R1
A2R2
D2R2
E1R2
A2R3
D2R3
E1R3
B2R1
C1R1
E2R1
F1R1
B2R2
C1R2
E2R2
F1R2
B2R3
C1R3
E2R3
F1R3
SRR8470268
SRR8470259
SRR8470262
SRR8470265
SRR8470256
SRR8470253
Obtain the subset data
Full assembly (nucleotides)
Full assembly (amino acids)
Subset for assembly (Illumina HiSeq 2500)
Quality assessment
Mapping
14
Break
General workflow of RNA-seq data analysis
Zhao et al. 2016
RNA-seq
Different transcript length with different coverage levels
Total observed read counts
Normalized read counts
FPKM: fragments per kilo- base of transcript per million mapped reads
Single-cell RNA-seq
RNA-seq vs Single-cell RNA-seq
| RNA-seq | Single-cell RNA-seq |
Expression masurement | average expression level | distribution of expression levels |
Advantages |
|
|
Disavantages |
|
|
Trinity
--left file1_R1.fq.gz,file2_R1.fq.gz,file3_R1.fq.gz \
--right file1_R2.fq.gz,file2_R2.fq.gz,file3_R2.fq.gz \
--CPU 2 \
--max_memory 1G \
--seqType
De novo assembly of reads using Trinity
>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246] AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA
“transcripts” (gene)
“isoform”
Examine assembly stats
$TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 377
Total trinity transcripts: 384
Percent GC: 38.66
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 3373
Contig N20: 2605
Contig N30: 2219
Contig N40: 1936
Contig N50: 1703
Median contig length: 772
Average contig: 1047.80
Total assembled bases: 402355
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 3373
Contig N20: 2605
Contig N30: 2216
Contig N40: 1936
Contig N50: 1695
Median contig length: 772
Average contig: 1041.98
Total assembled bases: 392826
N50
10
11
8
7
4
3
All = 43 (the genome size)
50 % = 21.5 (half of the genome size)
11+10+8= 29
So N50 = 8
the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome
BUSCO
Based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs
busco -i Trinity.fasta -l [LINEAGE] -o [OUTPUT_NAME] -m transcriptome
C:89.0%[S:85.8%,D:3.2%],F:6.9%,M:4.1%,n:3023
In the report:
Complete
single-copy
duplicated
Fragmented
Missing
Popular tools
Differential expression analysis
Trinity/Analysis/DifferentialExpression/run_DE_analysis.pl
--matrix mapping_result
--samples_file samples.txt
--method DESeq2/edgeR
--output
Working environment
Alga-infection in corallimorpharian
Stage 1
Stage 2
Stage 3
Control
Group C
Group D
A1R1
C2R1
F2R1
A1R2
C2R2
F2R2
A1R3
C2R3
F2R3
A2R1
D2R1
E1R1
A2R2
D2R2
E1R2
A2R3
D2R3
E1R3
B2R1
C1R1
E2R1
F1R1
B2R2
C1R2
E2R2
F1R2
B2R3
C1R3
E2R3
F1R3
SRR8470268
SRR8470259
SRR8470262
SRR8470265
SRR8470256
SRR8470253
Gene set enrichment analysis (GSEA)
Input file for GSEA:
1st column: Gene ID
2nd column: Description
3rd column: cross-sample normalized data (1 sample/col)
…