1 of 32

Data management best practices

March 2024

Dayane Araújo

www.ebi.ac.uk/training

2 of 32

Welcome to this virtual course

Course and events organisers

  • Dayane Araújo — Scientific Training Officer (dayane@ebi.ac.uk)
  • Brenda Stride — Postdoctoral programme Manager (stride@embl.de)
  • Sophie Spencer — Events Organiser (sophie@ebi.ac.uk)

3 of 32

After this session you should be able to

  • Define open science and explain FAIR principles.�
  • Summarise EMBL’s position with regards to open science and access appropriate sources of further information.�
  • Locate appropriate data resources, metadata standards and tools to support you in making the outputs of your research both open and FAIR.�
  • Participate in cultural change in support of fair research assessment and accurate attribution, for you, your co-authors and those who support your science.

4 of 32

Helpful information

5 of 32

Course handbook

  • Includes all relevant information about the course�
  • Presentation slides and other relevant materials�
  • Accessible for 6 months after the course

Bookmark it!

6 of 32

How to ask question

  • How?�
    • Raise your hand and speak up�
    • Write in the chat box�
    • Write in the Q&A document

7 of 32

The importance of “good” data management

7

8 of 32

3 reasons to share your data…

8

Selfish

    • You need to know what is happening / has happened to your group’s data
    • Accessing your own data, reviewing and analysing, getting recognition

Scientific

    • Traceability, reproducibility, reliability

Good Citizen

    • Sharing with the wider community
    • Much of the data generated is only used to answer one or two very specific questions, yet it is potentially valuable to many others
    • Being more ‘green’: others may not need to do lab work that has already been done if it is well described and shared

9 of 32

Key thing!

Start with the plan……

9

“Your primary collaborator is yourself six months from now, and your past self doesn’t answer e-mails.”

Rachael Ainsworth,

astrophysicist at the University of Manchester, UK

10 of 32

Data management plan (DMP)

EMBL requires that all projects have a data management plan………………..

………….funded by grants, or is part of the PhD and postdoctoral projects, or is intended to support a scientific article

10

11 of 32

 Creating a Data Management Plan

  • As a default, EMBL researchers should use this EMBL Data Management Plan Template

11

12 of 32

How to get started?

Data management checklist

13 of 32

Best practice tips

13

14 of 32

Organising data

14

15 of 32

File names and folder structures

15

  • Is your folder structure and file naming convention consistent?
  • Is it clear to and adopted by everyone in your team? What about collaborators?
  • Do you use a README.txt file explaining naming conventions, folder structure and changes?
  • Are your files consistently versioned?

16 of 32

Analysing data

16

17 of 32

Planning analyses

When planning an analysis, we need to consider and keep track of:

  • The format of the data at all stages
  • Any pre-processing steps
  • Tools/software used, including which release/version
  • Parameters we select when using tools

Workflows go a step beyond keeping track of these elements by allowing us to build in decisions we make, such as parameters we select, ensuring reproducibility of our analyses.

17

18 of 32

Workflows

Computational workflows allow for automation of multi-step analyses and support the reproducibility of analyses.

Workflows can include tools written in different programming languages. A workflow consists of a set of rules, which each have an input and output.

18

19 of 32

Tools for developing workflows

There are many different platforms and tools for creating workflows, such as:

  • NextFlow - used by scientists to write, deploy and share data-intensive workflows. It is scalable and can be integrated with GitHub.

  • SnakeMake - an open source workflow management system used to create reproducible and scalable data analyses.

  • Galaxy - an open-source platform for data analysis that allows users to create and use workflows of tools, including running code, through its graphical web interface.

Learn more about workflows with this short introduction.

19

20 of 32

Storing data

20

21 of 32

Storing your data

  • Is it backed up?
  • Does your university provide storage?
  • Is it in a place that you can find and access it? Or collaborators?
  • How long will the data be accessible?
  • What if someone asks you for your data in 5 or 10 years time?

22 of 32

Record keeping

22

23 of 32

Good record keeping

23

Have you tried Electronic Lab Notebooks

OR Computational Notebooks:

    • Jupyter Notebooks

24 of 32

Sharing data

24

25 of 32

Good data management = good data�

25

26 of 32

Choosing a data repository

26

General purpose

Discipline-specific

Adapted from ‘Managing and making the most of your data’ – Marta Teperek and Yasemin Türkyilmaz- van der Velden

27 of 32

Ontologies – adding more information

Ontologies make it easier to…

  • Understand and interpret data
  • Search and find links between data
  • Combine datasets

27

cerebellum

cardiac atrium

cardiac ventricle

ventricle of heart

brain

heart

hypothalamus

organism part

atrium of heart

=

=

28 of 32

Finding ontologies

28

29 of 32

What if I am only using public data?

29

30 of 32

You can still apply these practices!

  • What was the dataset identifier?
  • Where and when did you access and download it?
  • What did you do to the data?
  • Have you got the correct citations for the data?
  • Don’t forget the tools and software as well!

30

31 of 32

How to cite data

Minimum information:

Check with the database for any specific way to cite them.

31

Author(s). Year. Title. Repository. (Version). Identifier

32 of 32

Any questions?

32