3 of 30

Tsai-Ming, Lu (ICOB, AS)

呂在明（中研院細生所）

Mei-Fang, Lin (Dep. Mar. Res., NSYSU)

林梅芳（中山大學海資系）

Yi-Jyun, Luo (BRC, AS)

駱乙君（中研院生多中心）

4 of 30

TW Marine Bioinfo workshop

Practical training to whoever needed.

(You will need to teach yourself more other things, including the sequencing theories, statistics…etc)

We provide

5 of 30

林梅芳 (Mei-Fang Lin)

Marine Genomics and Evolution Lab.

海洋基因體與演化實驗室

- Genomics

- Marine Biology

Phylogenomics

Transcriptomics

Comparative genomics, Population genomics

- Molecular Evolution

Cnidarian

Echinoderm

6 of 30

The instructors-乙君

7 of 30

Source: https://biocorecrg.github.io/PHINDaccess_RNAseq_2020/

8 of 30

The dataset

Fasta format

Fastq format

>Seq1

MTLVAEHLLMDTFGSDFDSLPPSLFKDFPEDGFNMKKKSMTSIEEDIMSDYSFPPTPPISPGCSSIASEIGDPERIQPVCDELEDDFNFAAEEKSLYFQENDFKDILIKDCMWNG

ASCII (American Standard Code for Information Interchange)

9 of 30

In the real dataset

In this experiment:

30 samples in total
Each sample has about 10-18 million reads
Reference transcriptome/genome available
Illumina pair-end 100 bp

Mei-Fang Lin,Shunichi Takahashi,Sylvain Forêt,Simon K. Davy,David J. Miller, Transcriptomic analyses highlight the likely metabolic consequences of colonization of a cnidarian host by native or non-native Symbiodinium species, Biol Open, 2019

10 of 30

Alga-infection in corallimorpharian

Stage 1

Stage 2

Stage 3

Control

Group C

Group D

A1R1

C2R1

F2R1

A1R2

C2R2

F2R2

A1R3

C2R3

F2R3

A2R1

D2R1

E1R1

A2R2

D2R2

E1R2

A2R3

D2R3

E1R3

B2R1

C1R1

E2R1

F1R1

B2R2

C1R2

E2R2

F1R2

B2R3

C1R3

E2R3

F1R3

SRR8470268

SRR8470259

SRR8470262

SRR8470265

SRR8470256

SRR8470253

11 of 30

Obtain the subset data

https://github.com/MeiFangLin/TW_bioinfo_2022

Full assembly (nucleotides)

Transcriptome_c1_k6.fasta

Full assembly (amino acids)

Transcriptome_c1_k6.fasta.transdecoder.pep

Subset for assembly (Illumina HiSeq 2500)

12 of 30

Quality assessment

fastqc file.fastq.gz -o

15 of 30

General workflow of RNA-seq data analysis

Zhao et al. 2016

17 of 30

Different transcript length with different coverage levels

Total observed read counts

Normalized read counts

FPKM: fragments per kilo- base of transcript per million mapped reads

18 of 30

Single-cell RNA-seq

19 of 30

RNA-seq vs Single-cell RNA-seq

	RNA-seq	Single-cell RNA-seq
Expression masurement	average expression level	distribution of expression levels
Advantages	Useful for comparative transcriptomics Useful for quantifying expression signatures from ensembles	Useful for studying cell-specific changes in transcriptome Datasets range increasing
Disavantages	Insufficient for studying complex tissue systems Useless for studying stochastic nature of gene expression	Required new analysing methods Still very expensive

22 of 30

Trinity \

--left file1_R1.fq.gz,file2_R1.fq.gz,file3_R1.fq.gz \

--right file1_R2.fq.gz,file2_R2.fq.gz,file3_R2.fq.gz \

--CPU 2 \

--max_memory 1G \

--seqType

head trinity_out_dir/Trinity.fasta

De novo assembly of reads using Trinity

Trinity -help

23 of 30

>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246] AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

“transcripts” (gene)

“isoform”

24 of 30

Examine assembly stats

$TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 377

Total trinity transcripts: 384

Percent GC: 38.66

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 3373

Contig N20: 2605

Contig N30: 2219

Contig N40: 1936

Contig N50: 1703

Median contig length: 772

Average contig: 1047.80

Total assembled bases: 402355

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 3373

Contig N20: 2605

Contig N30: 2216

Contig N40: 1936

Contig N50: 1695

Median contig length: 772

Average contig: 1041.98

Total assembled bases: 392826

25 of 30

N50

N50 statistic defines assembly quality in terms of contiguity

All = 43 (the genome size)

50 % = 21.5 (half of the genome size)

11+10+8= 29

So N50 = 8

the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome

26 of 30

BUSCO

Based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs

busco -i Trinity.fasta -l [LINEAGE] -o [OUTPUT_NAME] -m transcriptome

C:89.0%[S:85.8%,D:3.2%],F:6.9%,M:4.1%,n:3023

In the report:

Complete

single-copy

duplicated

Fragmented

Missing

27 of 30

Popular tools

Differential expression analysis

edgeR
DESeq2 (refer to PHINDaccess_RNAseq_2020)
limma

Trinity/Analysis/DifferentialExpression/run_DE_analysis.pl

--matrix mapping_result

--samples_file samples.txt

--method DESeq2/edgeR

--output

28 of 30

Working environment

29 of 30