1 of 34

Datasets

Sebastian

@s_urchs (slack: @surchs)

2 of 34

Before we start:

3 of 34

If you ask yourself...

how

  • do I choose a good dataset?
  • do I find open datasets?
  • do I get the data?
  • do I work with the data?

then this is for you!

4 of 34

How do I choose a good dataset?

Ask yourself:

  • How easy can I get access?
  • How “raw” is it?
  • How useful is it?

Copied liberally from Chris Gorgolewski’s slides (@chrisgorgo)

5 of 34

Ease of access

Signed Data Usage Agreement

Access through managed database

“Just get it”

Direct download

“It’s right there”

Hard

Easy

6 of 34

How raw is the data set

Organized

Preprocessed

Derivatives

  • DICOM
  • Idiosyncratic organization
  • Must be converted
  • Ideally standard organization (BIDS)
  • Text-based metadata
  • Must be preprocessed
  • Minimally or fully preprocessed
  • Standardized organization
  • Data Quality metrics
  • Must be analyzed
  • Statistical maps
  • Summary metrics

More control

Less work

7 of 34

How useful is the data set

Data Quality

  • Is there a publication describing the data
  • Did other studies re-use the data
  • Ask around

Meta Data Quality

  • Are the data well described (inclusion, acquisition, processing)
  • Are the metrics available that you are interested in?
  • How much missing data are there?

Data Cost

  • How large is the data set (storage space, download)
  • Are the data preprocessed?
  • Could you preprocess them (time, resources, knowledge) ?

8 of 34

How do I find open datasets

9 of 34

FCP-INDI

10 of 34

Open-Neuro

11 of 34

Canadian Open Neuroscience Portal

12 of 34

openMorph

13 of 34

nilearn

14 of 34

How do I get the data

“I just want the data”

“I’m just testing things”

“I want to know what happened to my data”

15 of 34

Nilearn example

  1. Pick a dataset
  2. Ask nilearn to get it
  3. Load the data

16 of 34

Nilearn - pick a dataset

17 of 34

Nilearn - get the data

18 of 34

Nilearn - access the data

19 of 34

Nilearn - access the data

20 of 34

Amazon S3 example

  1. Pick a dataset
  2. Login to the bucket
  3. Download the data

21 of 34

Amazon S3 - pick a dataset

  • Go to https://openneuro.org/, search for ‘midnight’

  • Click ‘download’

22 of 34

Amazon S3 - pick a dataset

  • Copy the AWS S3 path

23 of 34

Amazon S3 - get the data

  • Enter the info from openneuro in the “Path” field.

  • Copy the line from openneuro
  • Replace “sync” with “ls”

24 of 34

How do I work with the data

  1. Respect the data
    1. Sign and follow data usage agreement
    2. Securely store identifiable information
    3. Cite the data source

25 of 34

How do I work with the data

  • Respect the data
    • Sign and follow data usage agreement
    • Securely store identifiable information
    • Cite the data source
  • Document what you do
    • Where did you download from?
    • What command did you use (script if possible)
    • How did you select data?
    • What processing did you use?

26 of 34

How do I work with the data

  • Respect the data
    • Sign and follow data usage agreement
    • Securely store identifiable information
    • Cite the data source
  • Document what you do
    • Where did you download from?
    • What command did you use (script if possible)
    • How did you select data?
    • What processing did you use?
  • Organize your project
    • Use version control!
    • Use virtualization
    • Follow standards (BIDS)
    • You will thank yourself later

27 of 34

I want more

data

28 of 34

More databases

https://zenodo.org/

Repository for data associated with publication

Digital object identifier

Any license

Hosted by CERN (cool)

https://figshare.com/

Repository for data

Digital object identifier

Creative Commons license

Has commercial side

https://datasetsearch.research.google.com/

Dataset search engine

Doesn’t store anything

Let’s you search other databases

29 of 34

Preprocessed data

30 of 34

Longitudinal / reproducibility data

31 of 34

Derivative data

https://neurovault.org/

Repository of statistical maps of completed studies

http://neurosynth.org/

Aggregated activation data maps with keyword search

Neurosynth-genes has the gene expression data from the Allen brain institute

32 of 34

Good people to talk to

https://neurostars.org/

Great resource for asking question and getting feedback

33 of 34

Cool data that take longer to get

https://www.ukbiobank.ac.uk/

> 100.000 individuals

Deep meta data

Genetics (prospective whole genome sequencing)

Extensive imaging data

Medical records

http://portal.brain-map.org/

Human brain gene expression maps

Histological and developmental atlases

Extensive mouse data

https://db.humanconnectome.org/

Very high resolution data

Publicly available data for 1200 healthy individuals

Long, repeat imaging data (task and resting state)

Deep meta data

34 of 34

Share that brain

Katherine Karlsgodt, 2019