1 of 8

Genotype-phenotype associations

Ben C Calverley

CBB440: Applied Bioinformatics and Computational Biology

1

2 of 8

Welcome!

Who am I?

  • Huddersfield, UK
  • MMath Mathematics, University of Oxford
  • MSc Theoretical Physics, King’s College London
  • PhD Quantitative and Biophysical Biology, University of Manchester
  • Postdoc in Balch Lab, Scripps Research
  • Co-chair, Scripps Research Pride Alliance

2

3 of 8

Learning objectives

  • Understand the meaning of genotype and phenotype.
  • Learn about CRISP-DM, a useful tool for data science.
  • Learn how to explore and understand data and form hypotheses for testing.
  • Understand key principles of genotype-phenotype association studies.
  • Learn statistical and computational techniques for modeling these relationships.
  • Gain hands-on experience with tools and real-world data.

3

4 of 8

Why?

  • Personalised medicine
  • Deep learning & AI
  • Fundamental understanding
  • Specific diseases
    • NPC
    • AAT
    • CF
    • CAD

4

5 of 8

Module overview

Week 1: CRISP-DM, data exploration, hypotheses

Week 2: Different analysis methods

Week 3: Visualising and analysing results

Session 1

  • What is CRISP-DM?
  • Synthesising data
  • Exploring and preparing data for analysis
  • PCA and dimensional reduction

  • GWAS

Session 2

  • Exercises in data exploration and preparation
  • Generating hypotheses
  • Exercises

5

6 of 8

Coronary Artery Disease

In the United States:

  • Heart disease is the leading cause of death for men, women, and people of most racial and ethnic groups.
  • One person dies every 33 seconds from cardiovascular disease.
  • In 2022, 702,880 people died from heart disease. That's the equivalent of 1 in every 5 deaths.
  • Coronary heart disease is the most common type of heart disease. It killed 371,506 people in 2022.
  • About 1 in 20 adults aged 20+ have CAD
  • In the United States, someone has a heart attack every 40 seconds.

Source: https://www.cdc.gov/heart-disease/data-research/facts-stats/index.html

6

7 of 8

Our dataset – UK Biobank

  • Genetics: Whole genome sequencing for all 500,000 participants, whole exome sequencing for 470,000 participants, genotyping (800,000 genome-wide variants and imputation to 90 million variants).
  • Health linkages: Linkage to a wide range of electronic health-related records, including death, cancer, hospital admissions and primary care records. 
  • Biomarkers: Data on more than 30 key biochemistry markers from all participants, taken from samples collected at recruitment and the first repeat assessment. 
  • Activity monitor: Physical activity data over a 7-day period collected via a wrist-worn activity monitor for 100,000 participants plus a seasonal follow-up on a subset.
  • Online questionnaires: Data on a range of exposures and health outcomes that are difficult to assess via routine health records, including diet, food preferences, work history, pain, cognitive function, digestive health and mental health.
  • Samples: Blood & urine was collected from all participants, and saliva for 100,000.

Source: https://www.ukbiobank.ac.uk/enable-your-research/about-our-data

7

8 of 8

To the code!

8