1 of 38

Data reuse

Find, access and evaluate research data for reuse.

David Rayner�Training coordinator�Swedish National Data service, SND

| University of Gothenburg - Chalmers University of Technology - Karolinska Institutet - KTH Royal Institute of Technology - Lund University - Stockholm University - Swedish University of Agricultural Sciences - Umeå University - Uppsala University

2 of 38

Overview of the lesson

Swedish National Data Service

2

3 of 38

What does “reuse research data” mean?

“using research data produced or collected by others for your own research”. You can:

  • build on previous research
  • synthetize data from multiple studies to derive broader insights (“meta-analysis”)
  • obtain “ground-truthing” and calibration data for your research
  • avoid unnecessary experiments and unnecessary costs
  • ease the burden on over-researched populations.

Swedish National Data Service

3

4 of 38

Other definitions of data ”reuse”?

  • in the simplest situation, data are collected by one individual, for a specific research project, and the first “use” is by that individual to ask a specific research question. If that same individual returns to that same dataset later, whether for the same or a later project, that usually would be considered a “use”

Pasquetto, Irene V., Randles, Bernadette M., and Borgman, Christine L. 2017 On the Reuse of Scientific Data. Data Science Journal 16: art8. DOI:10.5334/dsj2017-008

  • A common characterization of “reuse” is as secondary use for purposes other than that for which the data were originally collected

Huggett J. Reuse Remix Recycle: Repurposing Archaeological Digital Data. Advances in Archaeological Practice. 2018;6(2):93-104. doi:10.1017/aap.2018.1

Swedish National Data Service

4

5 of 38

What is research data re-used for?

Image: U3167879 (CC BY-SA 4.0)

  • New studies.
  • Feasibility studies, e.g. for project proposals.
  • Initialization, calibration, verification.
  • Teaching.

Lost or Found? Discovering Data Needed for Research

https://doi.org/10.1162/99608f92.e38165eb

Swedish National Data Service

5

6 of 38

Lost or Found? Discovering Data Needed for Research

https://doi.org/10.1162/99608f92.e38165eb

What is research data re-used for?

Swedish National Data Service

6

7 of 38

Collection

Lost or Found? Discovering Data Needed for Research

https://doi.org/10.1162/99608f92.e38165eb

8 of 38

That’s interesting, because…

Based on narrative interviews with 8 participants into reusing qualitative data:

  • Prior connection to the original data and original investigators (who collected and owned data) was the condition for reuse.
  • Choosing data from someone they already know is an important part of their trust judgment of data, because qualitative data is the end product of original investigators’ worldview, research philosophy, and experiences.
  • The researchers usually relied on original investigators during the process of understanding data for reuse.

“Making a square fit into a circle”: Researchers’ experiences reusing qualitative data�https://doi.org/10.1002/meet.2014.14505101140

Swedish National Data Service

8

9 of 38

So, our research question for today is:

  • Although “Knowing the Creator” ranked low as a criteria for deciding whether to use secondary data across all respondents in the Gregory et al study, did it rank higher amongst Social Scientists?
  • To answer this question, we are going to need to find some data!

Swedish National Data Service

9

10 of 38

Finding data

11 of 38

Finding data – Resources for seeking data.

  • People - colleagues, collaborators, supervisors, data authors and support staff
  • General search engines
  • Domain data repositories and collections
  • Literature
  • Government data portals
  • Platforms with user-created data
  • Websites
  • General data repositories
  • Museums, libraries
  • Data vendors
  • Internal systems
  • Social media forums
  • Industry associations

Swedish National Data Service

11

12 of 38

Exercise 1: Find some data!

  • Split into groups of 3 or 4 people.
  • Each group is assigned a search strategy:
  • Strategy:

    • Literature
    • Repository Search
    • Web search

Swedish National Data Service

12

13 of 38

Strategy A – Literature Search

  • Use the title/citation provided to find the research article.
  • Read the research article – does it provide a link/reference to the data?

Lost or Found? Discovering Data Needed for Research

https://doi.org/10.1162/99608f92.e38165eb

Swedish National Data Service

13

14 of 38

Strategy B: Data repositories megasearch!

  • https://explore.openaire.eu �=> Search

Swedish National Data Service

14

15 of 38

Strategy C – General web search

  • Google, Bing, 🦆🦆💨, AI tools.

Swedish National Data Service

15

16 of 38

17 of 38

Access data

18 of 38

Access data.

  • Freely accessible data – check Terms of Use.
  • Personal data? – treat it as if you had collected yourself. Register personal data handling.
  • “Restricted” access from a repository? Follow restrictions!
  • You learned about data by reading a paper or at a conference?
    • Email corresponding author of article.
    • Can be good to offer collaboration!
    • Ask about terms-of-use if they provide data! You can suggest you will treat it as CC-BY, see if they agree!

Swedish National Data Service

18

19 of 38

Exercise 2: Let’s access some data!

Swedish National Data Service

19

20 of 38

Exercise 2: Let’s access some data!

  • Might this dataset also help us answer our research question?
  • Hard to tell, documentation is not very good 😲
  • We can request the dataset to see what it contains!
  • Task – in your groups:
    • Go to the dataset.
    • Read the metadata (you don’t need to read the publication that is cited!)
    • ONE PERSON in the group -> “+ Add request for data”, follow the instructions!
    • Finish your request with “SciLifeLab course test request – please discard!”

Accessibility and reuse of research data - a study of the attitudes among professors and graduate students, 2009

https://doi.org/10.5878/3h5f-mj41

Swedish National Data Service

20

21 of 38

Can you trust the data?

22 of 38

23 of 38

Paper mills in research

Swedish National Data Service

23

24 of 38

25 of 38

26 of 38

Can you trust the data?

  • Are the data hosted by a reputable agency or trusted research data repository?
  • Are the data from a dataset with a permanent identifier (e.g., a DOI)?
  • Is there a research paper associated with the dataset?
  • Have the data been reviewed or curated?
  • Are the data used by other researchers?
  • Is the data source clearly stated (if dataset contains other data)?
  • Do the data follow established standards?
  • Are contact details available for further inquiries?
  • Are there active support channels, discussion groups for the data?
  • Is there a guide for how to use the data?

Swedish National Data Service

26

27 of 38

Exercise 3: Can you trust the data?

  • In your groups
    • Look through the list of items for “Can you trust the data?”
    • Is there anything that you think is important that is missing? Add it in!
    • Rank the items in order from most important to least importance. You can have ties.
  • Look at the DANS dataset landing page, and the DANS website.
    • Assess the dataset using the criteria from “Can you trust the data?”
    • Start with the items you ranked as most important.
    • Is this dataset trustworthy??

Swedish National Data Service

27

28 of 38

Evaluate data suitability.

Profile the data, then decide if it’s worth working with!

29 of 38

Are the data suitable for your needs?�Ask yourself: what do you need?!

  • Do you need specific data, or are you looking for data to help answer a research question? PM10 & NO2 concentrations, or “air quality data”?
  • Do you need source-data (e.g. observations, time-series) or can you work with derived data such as incidence counts, cross-tabulations, temporal or spatially-aggregated data?
  • Can you “make do” with related data?
    • Related species?
    • Close geographical area, close in time to your study?
    • Survey question that is “similar enough” to be comparable?
  • Often helps to look at the data or make a histogram or line-plot.

Swedish National Data Service

29

30 of 38

How standardized is data collection?

“Many of us who have actually conducted clinical research, managed clinical studies and data collection and analysis, and curated data sets have concerns about the details [of other datasets].

Special problems arise if data are to be combined from independent studies and considered comparable. How heterogeneous were the study populations? Were the eligibility criteria the same? Can it be assumed that the differences in study populations, data collection and analysis, and treatments, both protocol-specified and unspecified, can be ignored?"

Data Sharing

10.1056/NEJMe15165

Swedish National Data Service

30

31 of 38

Data incompatibilities.

  • Data heterogeneity
    • e.g. “2020”, “01”, “01” vs “2020-01-01” vs “1 Jan 2020”
  • Different classification schemes, controlled vocabularies.
    • Re-grouping categorical data is less work than categorizing free-text data!
    • Search & Replace, OpenRefine.
  • Geographic aggregations don’t match.
    • e.g. Climate/ecological data might be presented for desert, semi-desert, tundra. Can be hard to combine with social science data for administrative areas.
  • Temporal resolution.
    • Cannot study extreme events using annual data.

Swedish National Data Service

31

32 of 38

Evaluate the data quality

33 of 38

Do the data have sufficient quality?

Some warning signs!

  • Inconsistent data representations
    • e.g. different date formats used in same column
  • Duplicate records.
  • Many missing values.
  • Different scales used in the same column
    • e.g. kB, MB, GB
  • Lots of survey responses are “other” or given as free-text comments.

Photo by Marcel Strauß on Unsplash

Swedish National Data Service

33

34 of 38

Is the documentation good enough?

You should be able to find the following information about the data from the dataset metadata or an associated research publication:

  • why the data were collected/generated
  • who collected/generated the data
  • how and when the data were collected
  • how the data were processed
  • any quality assurance procedures that were used.

This information will also help you decide whether the data are suitable for your needs.

Swedish National Data Service

34

35 of 38

36 of 38

Exercise 4: Data Suitability & Quality – Final Exercise!

Task A (half the groups)

  • From the DANS dataset, download and open datadiscovery_questionnaire.pdf
  • We are interested in extracting the rows for researchers specialize in Social Science (See D1, Part 4: Demographics, “In which subject discipline do you specialize?”)
    • Is there a subject choices category for “Social Science”?
    • Are there any other subject choices that should be counted as “Social Science”?
    • Are there any subject choices that are not easy to include/exclude as Social Science?

Swedish National Data Service

36

37 of 38

Exercise 4: Data Suitability & Quality – Final Exercise!

Task B (half the groups)

  • From the DANS dataset, download and open the datafile datadiscovery_researchers.csv
  • scroll to column disc_other (column 148, or “ET” in Excel)
    • Roughly what fraction of respondents used the “Other, Please specify” field?
    • Do many of these appear to be engaged in Social Science?
    • Would you feel you need to reclassify the “Other” responses before extracting the rows for researchers who specialize in Social Science?

Swedish National Data Service

37

38 of 38

Summary! Now you know about…

  • What is research data “reuse”.
  • Strategies for finding research data.
  • How to apply for access to research data.
  • A checklist you can use to assess if you trust the data.
  • Some suggestions on how to assess if the data are suitable for your use.
  • Some suggestions on how to check data quality.
  • GOOD LUCK!!

Swedish National Data Service

38