1 of 30

2023 Taiwan Marine Bioinformatics Practical Workshop

2 of 30

Genomics is beautiful, powerful, and useful

2

Dimensions of a Genome

Source: Gracey and Cossins (2003)

3 of 30

Tsai-Ming, Lu (ICOB, AS)

呂在明 (中研院細生所)

Mei-Fang, Lin (Dep. Mar. Res., NSYSU)

林梅芳(中山大學海資系)

Yi-Jyun, Luo (BRC, AS)

駱乙君(中研院生多中心)

4 of 30

TW Marine Bioinfo workshop

Practical training to whoever needed.

(You will need to teach yourself more other things, including the sequencing theories, statistics…etc)

We provide

5 of 30

林梅芳 (Mei-Fang Lin)

Marine Genomics and Evolution Lab.

海洋基因體與演化實驗室

5

- Genomics

- Marine Biology

Phylogenomics

Transcriptomics

Comparative genomics, Population genomics

- Molecular Evolution

Cnidarian

Echinoderm

6 of 30

The instructors-乙君

7 of 30

Source: https://biocorecrg.github.io/PHINDaccess_RNAseq_2020/

8 of 30

The dataset

  • Fasta format

  • Fastq format

>Seq1

MTLVAEHLLMDTFGSDFDSLPPSLFKDFPEDGFNMKKKSMTSIEEDIMSDYSFPPTPPISPGCSSIASEIGDPERIQPVCDELEDDFNFAAEEKSLYFQENDFKDILIKDCMWNG

ASCII (American Standard Code for Information Interchange)

9 of 30

In the real dataset

In this experiment:

  • 30 samples in total
  • Each sample has about 10-18 million reads
  • Reference transcriptome/genome available
  • Illumina pair-end 100 bp

Mei-Fang Lin,Shunichi Takahashi,Sylvain Forêt,Simon K. Davy,David J. Miller, Transcriptomic analyses highlight the likely metabolic consequences of colonization of a cnidarian host by native or non-native Symbiodinium species, Biol Open, 2019

10 of 30

Alga-infection in corallimorpharian

Stage 1

Stage 2

Stage 3

Control

Group C

Group D

A1R1

C2R1

F2R1

A1R2

C2R2

F2R2

A1R3

C2R3

F2R3

A2R1

D2R1

E1R1

A2R2

D2R2

E1R2

A2R3

D2R3

E1R3

B2R1

C1R1

E2R1

F1R1

B2R2

C1R2

E2R2

F1R2

B2R3

C1R3

E2R3

F1R3

SRR8470268

SRR8470259

SRR8470262

SRR8470265

SRR8470256

SRR8470253

11 of 30

Obtain the subset data

  • https://github.com/MeiFangLin/TW_bioinfo_2022

Full assembly (nucleotides)

  • Transcriptome_c1_k6.fasta

Full assembly (amino acids)

  • Transcriptome_c1_k6.fasta.transdecoder.pep

Subset for assembly (Illumina HiSeq 2500)

12 of 30

Quality assessment

  • fastqc file.fastq.gz -o

13 of 30

Mapping

14 of 30

14

Break

15 of 30

General workflow of RNA-seq data analysis

Zhao et al. 2016

16 of 30

RNA-seq

17 of 30

Different transcript length with different coverage levels

Total observed read counts

Normalized read counts

FPKM: fragments per kilo- base of transcript per million mapped reads

18 of 30

Single-cell RNA-seq

19 of 30

RNA-seq vs Single-cell RNA-seq

RNA-seq

Single-cell RNA-seq

Expression masurement

average expression level

distribution of expression levels

Advantages

  • Useful for comparative transcriptomics
  • Useful for quantifying expression signatures from ensembles
  • Useful for studying cell-specific changes in transcriptome
  • Datasets range increasing

Disavantages

  • Insufficient for studying complex tissue systems
  • Useless for studying stochastic nature of gene expression
  • Required new analysing methods
  • Still very expensive

20 of 30

Trinity

21 of 30

22 of 30

  • Trinity \

--left file1_R1.fq.gz,file2_R1.fq.gz,file3_R1.fq.gz \

--right file1_R2.fq.gz,file2_R2.fq.gz,file3_R2.fq.gz \

--CPU 2 \

--max_memory 1G \

--seqType

  • head trinity_out_dir/Trinity.fasta

De novo assembly of reads using Trinity

  • Trinity -help

23 of 30

>TRINITY_DN1000_c115_g5_i1 len=247 path=[31015:0-148 23018:149-246] AATCTTTTTTGGTATTGGCAGTACTGTGCTCTGGGTAGTGATTAGGGCAAAAGAAGACAC ACAATAAAGAACCAGGTGTTAGACGTCAGCAAGTCAAGGCCTTGGTTCTCAGCAGACAGA AGACAGCCCTTCTCAATCCTCATCCCTTCCCTGAACAGACATGTCTTCTGCAAGCTTCTC CAAGTCAGTTGTTCACAGGAACATCATCAGAATAAATTTGAAATTATGATTAGTATCTGA TAAAGCA

“transcripts” (gene)

“isoform”

24 of 30

Examine assembly stats

$TRINITY_HOME/util/TrinityStats.pl trinity_out_dir/Trinity.fasta

################################

## Counts of transcripts, etc.

################################

Total trinity 'genes': 377

Total trinity transcripts: 384

Percent GC: 38.66

########################################

Stats based on ALL transcript contigs:

########################################

Contig N10: 3373

Contig N20: 2605

Contig N30: 2219

Contig N40: 1936

Contig N50: 1703

Median contig length: 772

Average contig: 1047.80

Total assembled bases: 402355

#####################################################

## Stats based on ONLY LONGEST ISOFORM per 'GENE':

#####################################################

Contig N10: 3373

Contig N20: 2605

Contig N30: 2216

Contig N40: 1936

Contig N50: 1695

Median contig length: 772

Average contig: 1041.98

Total assembled bases: 392826

25 of 30

N50

  • N50 statistic defines assembly quality in terms of contiguity

10

11

8

7

4

3

All = 43 (the genome size)

50 % = 21.5 (half of the genome size)

11+10+8= 29

So N50 = 8

the size of the contig which, along with the larger contigs, contain half of sequence of a particular genome

26 of 30

BUSCO

Based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs

busco -i Trinity.fasta -l [LINEAGE] -o [OUTPUT_NAME] -m transcriptome

C:89.0%[S:85.8%,D:3.2%],F:6.9%,M:4.1%,n:3023

In the report:

Complete

single-copy

duplicated

Fragmented

Missing

27 of 30

Popular tools

Differential expression analysis

  • edgeR
  • DESeq2 (refer to PHINDaccess_RNAseq_2020)
  • limma

Trinity/Analysis/DifferentialExpression/run_DE_analysis.pl

--matrix mapping_result

--samples_file samples.txt

--method DESeq2/edgeR

--output

28 of 30

Working environment

29 of 30

Alga-infection in corallimorpharian

Stage 1

Stage 2

Stage 3

Control

Group C

Group D

A1R1

C2R1

F2R1

A1R2

C2R2

F2R2

A1R3

C2R3

F2R3

A2R1

D2R1

E1R1

A2R2

D2R2

E1R2

A2R3

D2R3

E1R3

B2R1

C1R1

E2R1

F1R1

B2R2

C1R2

E2R2

F1R2

B2R3

C1R3

E2R3

F1R3

SRR8470268

SRR8470259

SRR8470262

SRR8470265

SRR8470256

SRR8470253

30 of 30

Gene set enrichment analysis (GSEA)

Input file for GSEA:

1st column: Gene ID

2nd column: Description

3rd column: cross-sample normalized data (1 sample/col)

  1. Extract Ctrl and Grp3 data at stage3 from GSE125433_allCounts_S003.txt
  2. Select genes with read count >10 in both Ctrl and Grp3.
  3. Change the gene ID to GSEA acceptable identifier
  4. 6006 genes have corresponding Human Swiss-Prot IDs.