Choosing Genomics Tools
Candace Savonen and Carrie Wright
https://bit.ly/genomics_itn
Except where otherwise indicated, The contents of this slide presentation are available for use under the Creative Commons Attribution 4.0 license.
You are free to adapt and share the work, but you must give appropriate credit, provide a link to the license, and indicate if changes were made.
Sample attribution: [Title of work] by the Johns Hopkins Data Science Lab. CC-BY 4.0
Terms of Use
Schedule for today
https://bit.ly/genomics_itn
Join at slido.com�#7176191
ⓘ
Click Present with Slido or install our Chrome extension to display joining instructions for participants while presenting.
Have your phone
(or a separate tab) handy for interactive polls!
Join at slido.com�#7176191
What's your favorite candy?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What is your email?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What would you like to learn from this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Informatics Technology for Cancer Research (ITCR)
Informatics Technology for Cancer Research (ITCR)
ITCR tools: itcr.cancer.gov/informatics-tools
What is the ITN?
ITCR Training Network
Catalyzing informatics research through training opportunities
We are all busy - especially researchers!
https://media.giphy.com/media/q6RoNkLlFNjaw/giphy.gif
Technology is changing quickly & it’s hard to keep up! �ITCR developers keep making more awesome software!
https://media.giphy.com/media/lRnUWhmllPI9a/giphy.gif
Our guiding principle…
Research will advance faster if good informatics tools are accessible to a broad audience
Democratizing informatics also holds great power to improve diversity in research
https://c.tenor.com/lOM2TVfL0joAAAAM/democracy-mypostcard.gif
User preparedness
Gap
Tool usability
Informatics research is hindered by a gap between different types of experts
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
User preparedness
Gap
Tool usability
Catalyzing Informatics for Research
CC-BY jhudatascience.org - Image made by Candace Savonone using https://getavataaars.com/ and https://thenounproject.com/ a
Elements of ITN:
ITN courses
Current ITN courses: itcrtraining.org/courses
Management | Software Development | Tools and Resources | Best Practices |
Leadership for Cancer Informatics Research | Documentation & Usability | Computing for Cancer Informatics | Introduction to Reproducibility |
| AI for Software Development | Introduction to Overleaf and LaTeX for Writing Scientific Articles | Advanced Reproducibility |
| Software Development beyond Coding (coming soon!) | Choosing Genomics Tools | Ethical Data Handling |
| | | NIH Data management and Sharing Policy |
Image by candace Savonen with Avataars and Openmoji.org
Your data are ready.
Image by Candace Savonen with Avataars, pixabay and openmoji.org
Genomic data
What is this and what do I do with it?
CC-BY
Concepts discussed in Choosing -omics Tools course:
What does your genomic data type represent?
What are the most common data processing steps for your data type
Find resources, tools and tutorials to help you process and interpret your data
General Chapters
Data Specific Chapters
What kind of genomic data are you working with most frequently?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
A wikipedia for -omic analysis
Datatypes included so far:
And hope to add more! Let us know if you’d like to contribute! (Stipends available for grad students)
A wikipedia for -omic analysis
This is a “living” course - as technologies and data type handling recommendations change, our course will too!
Genomics data analysis workflows
Genomics workflows in a very general sense
Image by Candace Savonen using IconFinder
Raw Data
Normalized Data
Summarized Data
Plots and Results!
To inform us on our computational steps, we need to know a bit about the origins of our raw data!
Image by Candace Savonen using IconFinder
Raw Data
Normalized Data
Summarized Data
Plots and Results!
Let’s talk a bit about how the genomic sausage is made!
Where do the raw data come from?
Made with Biorender
What do we need to know about this process in terms of data analysis?
Made with Biorender
What are metadata?
Let’s say you wanted to do an analysis with some data…
Metadata: Anything and everything that should be known about your samples!
A B C D
E F G H
sample_id | mouse_id | processing_date | treatment | … |
A | 1 | 3-10-21 | None | … |
B | 1 | 4-12-21 | None | … |
C | 2 | 3-10-21 | None | … |
D | 2 | 4-12-21 | None | … |
E | 3 | 3-10-21 | Morphine | … |
F | 3 | 4-12-21 | Morphine | … |
G | 4 | 3-10-21 | Morphine | … |
H | 4 | 4-12-21 | Morphine | … |
I know everything I need to know about these samples from their metadata!
What are important things to keep in mind when creating metadata?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Examples of metadata categories:
If you have human data the metadata probably is loaded with PII and/or PHI
Rules for creating metadata (from Broman & Woo, 2017)
Be Consistent
Choose good names for things
Write Dates as YYYY-MM-DD
No Empty Cells
Put Just One Thing in a Cell
Make it a Rectangle
1
Rules for creating metadata continued (from Broman & Woo, 2017)
Create a Data Dictionary
No Calculations in the Raw Data Files
Do Not Use Font Color or Highlighting as Data
Make Backups
Use Data Validation to Avoid Errors
How does sequencing work?
Made with Biorender
Sequence related biases
These biases are worsened by PCR amplification!
Some tools have algorithms that can mitigate these biases – you may have to use the right options!
How does sequencing work?
Made with Biorender
What parts of the genome are you targeting?
Single-end vs paired-end
Image from https://open.oregonstate.education/appliedbioinformatics/chapter/chapter-6/
How does sequencing work?
Made with Biorender
A very very general sequencing file format workflow
Image by Candace Savonen using SmartDraw
Depth and coverage
Image made by Candace Savonen with Biorender
Alignment
Image from Biorender
What types of file formats are you most commonly working with for your genomic data?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
READs! What are they?
Image by Candace Savonen using SmartDraw
What is a FASTQ file even?
What tools would you like to use to cook your data?
Image by Candace Savonen using IconFinder
Raw Data
Normalized Data
Summarized Data
Plots and Results!
What programs or languages do you use to process and handle your data?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Programming tools common for genomics tools:
R programming – great for stats and genomic data
Python - a bit more versatile and generally applicable, computationally powerful
For more resources for learning these: https://hutchdatascience.org/code_review/more_resources.html
Reproducible analysis tools for GUI
Image from Jeremy Goecks
From UFOtekkie’s twitter
“Approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions”
How do you go about choosing what tools to use with your data?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Considerations for choosing tools:
Is it appropriate for your data type?
Is it an interface or programming language you feel comfortable with?
How much computing power do you have?
Are there benchmarking papers that compare the tool options?
Is the tool well documented and usable?
Is the tool well-maintained?
Is the tool generally accepted by the field?
What annotations are you generally looking to describe your data with?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What tools would you like to use to cook your data?
Image by Candace Savonen using IconFinder
Raw Data
Normalized Data
Summarized Data
Plots and Results!
Reference genomes
Made with Biorender
Genome versions - they are important!
From the Genome Reference Consortium
https://www.ncbi.nlm.nih.gov/grc/human
Different version names (for human)
What data file do I need?
What data file do I need?
Ensembl annotation data
What’s a GTF file look like?
Cailin Jordan (5-7 min)
Jacob Greene (5-7 min)
How likely are you to use what you learned in your daily work?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
How likely would you be to recommend this workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
What did you like most about the workshop?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Please share any recommendations you have for improvements.
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Demographics Survey