1 of 92

Choosing Genomics Tools

Candace Savonen and Carrie Wright

https://bit.ly/genomics_itn

2 of 92

Except where otherwise indicated, The contents of this slide presentation are available for use under the Creative Commons Attribution 4.0 license.

You are free to adapt and share the work, but you must give appropriate credit, provide a link to the license, and indicate if changes were made.

Sample attribution: [Title of work] by the Johns Hopkins Data Science Lab. CC-BY 4.0

Terms of Use

3 of 92

Schedule for today

  • Introduction
    • Slido icebreaker
    • What’s the ITN?
  • Genomic Data Overview/discussions
    • Choosing between tools
    • Good metadata guidelines
    • Very general overview of sequencing data
    • Good genome annotating guidelines
  • Genomic Data Presentations:
    • Jacob Greene
    • Cailin Jordan
  • Xena Activity - Mary Goldman

https://bit.ly/genomics_itn

4 of 92

Join at slido.com�#7176191

Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.

5 of 92

Have your phone

(or a separate tab) handy for interactive polls!

Join at slido.com�#7176191

6 of 92

What's your favorite candy?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

7 of 92

What is your email?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

8 of 92

What would you like to learn from this workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

9 of 92

Informatics Technology for Cancer Research (ITCR)

10 of 92

Informatics Technology for Cancer Research (ITCR)

… and more!

11 of 92

What is the ITN?

ITCR Training Network

Catalyzing informatics research through training opportunities

12 of 92

We are all busy - especially researchers!

https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif

13 of 92

Technology is changing quickly & it’s hard to keep up! �ITCR developers keep making more awesome software!

https://media.giphy.com/media/lRnUWhmllPI9a/giphy.gif

14 of 92

Our guiding principle…

15 of 92

Research will advance faster if good informatics tools are accessible to a broad audience

16 of 92

Democratizing informatics also holds great power to improve diversity in research

https://c.tenor.com/lOM2TVfL0joAAAAM/democracy-mypostcard.gif

17 of 92

User preparedness

Gap

Tool usability

Informatics research is hindered by a gap between different types of experts

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

18 of 92

User preparedness

Gap

Tool usability

Catalyzing Informatics for Research

CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a

19 of 92

Elements of ITN:

  1. Make courses about informatics

  • Make tools for researchers to do outreach

  • Provide live education opportunities

  • Enhance community engagement in cancer research

20 of 92

ITN courses

21 of 92

Current ITN courses: itcrtraining.org/courses

Management

Software Development

Tools and Resources

Best Practices

Leadership for Cancer Informatics Research

Documentation & Usability

Computing for Cancer Informatics

Introduction to Reproducibility

AI for Software Development

Introduction to Overleaf and LaTeX for Writing Scientific Articles

Advanced Reproducibility

Software Development beyond Coding (coming soon!)

Choosing Genomics Tools

Ethical Data Handling

NIH Data management and Sharing Policy

22 of 92

23 of 92

Image by candace Savonen with Avataars and Openmoji.org

Your data are ready.

24 of 92

Image by Candace Savonen with Avataars, pixabay and openmoji.org

Genomic data

What is this and what do I do with it?

25 of 92

CC-BY

Concepts discussed in Choosing -omics Tools course:

What does your genomic data type represent?

What are the most common data processing steps for your data type

Find resources, tools and tutorials to help you process and interpret your data

26 of 92

General Chapters

Data Specific Chapters

27 of 92

What kind of genomic data are you working with most frequently?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

28 of 92

A wikipedia for -omic analysis

Datatypes included so far:

  • RNA-seq
  • scRNA-seq
  • WGS/WXS
  • ATAC-seq
  • ChIP-seq
  • Microarrays
  • Methylation data

And hope to add more! Let us know if you’d like to contribute! (Stipends available for grad students)

29 of 92

A wikipedia for -omic analysis

This is a “living” course - as technologies and data type handling recommendations change, our course will too!

  • Adding new data types
  • Updating recommendations
  • Adding new tools and methods

30 of 92

Genomics data analysis workflows

31 of 92

Genomics workflows in a very general sense

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

32 of 92

To inform us on our computational steps, we need to know a bit about the origins of our raw data!

33 of 92

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

34 of 92

Let’s talk a bit about how the genomic sausage is made!

35 of 92

Where do the raw data come from?

Made with Biorender

36 of 92

What do we need to know about this process in terms of data analysis?

37 of 92

Made with Biorender

38 of 92

What are metadata?

39 of 92

Let’s say you wanted to do an analysis with some data…

40 of 92

41 of 92

Metadata: Anything and everything that should be known about your samples!

A B C D

E F G H

sample_id

mouse_id

processing_date

treatment

A

1

3-10-21

None

B

1

4-12-21

None

C

2

3-10-21

None

D

2

4-12-21

None

E

3

3-10-21

Morphine

F

3

4-12-21

Morphine

G

4

3-10-21

Morphine

H

4

4-12-21

Morphine

I know everything I need to know about these samples from their metadata!

42 of 92

What are important things to keep in mind when creating metadata?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

43 of 92

Examples of metadata categories:

  • Patient/organism of origin
  • Patient/organism information
    • Demographics
    • Disease state
    • Treatment state
    • Time point (if applicable)
  • Processing information
    • Batch information
    • Processing details (E.g. Isolation methods: Poly-A vs Ribo-minus)
  • Anything that should be known about the samples and their handling!

44 of 92

If you have human data the metadata probably is loaded with PII and/or PHI

45 of 92

Rules for creating metadata (from Broman & Woo, 2017)

Be Consistent

Choose good names for things

Write Dates as YYYY-MM-DD

No Empty Cells

Put Just One Thing in a Cell

Make it a Rectangle

1

46 of 92

Rules for creating metadata continued (from Broman & Woo, 2017)

Create a Data Dictionary

No Calculations in the Raw Data Files

Do Not Use Font Color or Highlighting as Data

Make Backups

Use Data Validation to Avoid Errors

47 of 92

48 of 92

49 of 92

How does sequencing work?

Made with Biorender

50 of 92

Sequence related biases

  • GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s

  • Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced)

  • Length bias - longer targets are more likely to be amplified or sequenced

These biases are worsened by PCR amplification!

51 of 92

Some tools have algorithms that can mitigate these biases – you may have to use the right options!

52 of 92

How does sequencing work?

Made with Biorender

53 of 92

What parts of the genome are you targeting?

54 of 92

Single-end vs paired-end

Image from https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-6/

55 of 92

How does sequencing work?

Made with Biorender

56 of 92

A very very general sequencing file format workflow

Image by Candace Savonen using SmartDraw

57 of 92

Depth and coverage

Image made by Candace Savonen with Biorender

58 of 92

Alignment

Image from Biorender

59 of 92

What types of file formats are you most commonly working with for your genomic data?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

60 of 92

READs! What are they?

Image by Candace Savonen using SmartDraw

61 of 92

What is a FASTQ file even?

62 of 92

What tools would you like to use to cook your data?

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

63 of 92

What programs or languages do you use to process and handle your data?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

64 of 92

65 of 92

Programming tools common for genomics tools:

R programming – great for stats and genomic data

Python - a bit more versatile and generally applicable, computationally powerful

For more resources for learning these: https://hutchdatascience.org/code_review/more_resources.html

66 of 92

Reproducible analysis tools for GUI

67 of 92

Image from Jeremy Goecks

68 of 92

69 of 92

70 of 92

71 of 92

Approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions”

  • Ziemann, Eren, El-Osta, 2016

72 of 92

73 of 92

How do you go about choosing what tools to use with your data?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

74 of 92

Considerations for choosing tools:

Is it appropriate for your data type?

Is it an interface or programming language you feel comfortable with?

How much computing power do you have?

Are there benchmarking papers that compare the tool options?

Is the tool well documented and usable?

Is the tool well-maintained?

Is the tool generally accepted by the field?

75 of 92

76 of 92

77 of 92

What annotations are you generally looking to describe your data with?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

78 of 92

What tools would you like to use to cook your data?

Image by Candace Savonen using IconFinder

Raw Data

Normalized Data

Summarized Data

Plots and Results!

79 of 92

Reference genomes

Made with Biorender

80 of 92

Genome versions - they are important!

From the Genome Reference Consortium

https://www.ncbi.nlm.nih.gov/grc/human

Different version names (for human)

81 of 92

82 of 92

83 of 92

Ensembl annotation data

84 of 92

What’s a GTF file look like?

85 of 92

Cailin Jordan (5-7 min)

86 of 92

Jacob Greene (5-7 min)

87 of 92

88 of 92

How likely are you to use what you learned in your daily work?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

89 of 92

How likely would you be to recommend this workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

90 of 92

What did you like most about the workshop?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

91 of 92

Please share any recommendations you have for improvements.

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

92 of 92

Demographics Survey