1 of 56

JRN data management

Data management orientation for new students, some nagging…

Greg Maurer (Lead IM)

2 of 56

Orientation to JRN data management

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

This will go a bit fast – you can always contact me at jornada.data@nmsu.edu. Also see the Jornada IM website at https://jornada-im.github.io

3 of 56

Modern expectations in scientific research

  1. Quality data and metadata
  2. Open science & reproducible research
    • Accessible data, methods, and publications
    • Can other researchers confirm the results?
  3. Community efforts & collaboration among research networks
  4. Data reuse

4 of 56

Modern expectations in scientific research

  1. Quality data and metadata
  2. Open science & reproducible research
    • Accessible data, methods, and publications
    • Can other researchers confirm the results?
  3. Community efforts & collaboration among research networks
  4. Data reuse

But be aware – these ideals are NOT fully realized

5 of 56

Agenda

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

6 of 56

Long Term Ecological Research (and you)

  • Grew out of the International Biological Program (1968 to 1974)
  • First sites funded by NSF in ~1980 (Jornada Basin LTER in 1982)
  • Early effort to understand Earth and its living systems.
  • Much of LTER value lies in the ”long term” data.
    • At the leading edge of “open science” for many years
    • Data is open to current & future scientists (like you), managers, policy makers, the public, etc.

7 of 56

LTER: one of many data creation networks

  • Data collection in ecology/earth systems has grown exponentially.
  • Modern research draws on many networks.
  • Open research and data interoperability are key.

8 of 56

Want to make the most of LTER data you collect?

  • It is a grand contribution to science! Publish a paper on it!
  • When you do… share the data and metadata.
  • Open research and data sharing:
    • Helps Science!!!
    • FUN!!!
    • Required by NSF and other funding agencies
    • Required by ESA Journals
    • Required by AGU Journals

Don’t be Sméagol

9 of 56

Specific requirements for Jornada researchers

Generally, we follow the LTER Network Data Access policy to...

  1. Submit data and metadata to Jornada IMs yearly
  2. Publish data and metadata at the time research results are peer-reviewed and published, or no later than 2 years after collection.

Additionally we ask that you:

  1. Keep project information up to date in our database (we’ll remind you).
  2. Acknowledge Jornada support and data
    1. Acknowledge Jornada support in papers, dissertations, and other publications.
    2. Cite the Jornada data used in your research.

10 of 56

Who can help you get research done?

John Anderson

JRN Research Site Manager

  • Approves new LTER projects
  • May help with experimental design
  • The Jornada “know-it-all”

Greg Maurer

JRN Information Manager

  • Data curation/archiving
  • Publishing LTER data
  • Data analysis/janitorial, good coding practices police

And, of course, any PI you work with at the Jornada (or another) LTER site.

11 of 56

There is a large Jornada IM team…

  • USDA: John Ragosta and Darren James
  • Programming & qa/qc: Geovany Ramirez
  • Deputy IMs: Shelly Valdovinos, Madeleine Soss
  • EDI Data Science Summer Fellow: Brianda Hernandez-Rosales

Contact us! jornada.data@nmsu.edu

12 of 56

Agenda

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

13 of 56

How should you collect your data?

Think about:

  1. Data type
  2. Data structure
  3. File format
  4. How will you use it?

14 of 56

Scientific data types

  • Tabular
    • Variables and observations are arranged in rows and columns.
    • Most data can be represented this way.
  • Special cases
    • Imagery (Photos/remote sensing/UAV/Phenocams)
    • High frequency time series
    • Geospatial data
    • Genomic (or other ”omics”) data
    • These may have unique structures and/or file formats.

15 of 56

Data structure

  • Depends on data type
    • (some data types have only 1 common structure)
  • For tabular data:
    • What variables should be collected?
    • What do rows and columns represent?

Best Practices:

  • Use the simplest possible data structure.
  • Organize data by observational unit during collection.
  • Maximize the information for each observation.

16 of 56

For tables, try “Tidy” data

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data”

Wickham, H. Tidy data. 2014. Journal of Statistical Software: v59, i10. https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf

Wide form (or “untidy”) data.

Observations and variables in both rows and columns

Long form (“tidy”) data.

Columns represent variables

Rows represent observations

17 of 56

Some data “no-no’s”

  • Don’t aggregate multiple variables into one - be granular.
  • Make variable names descriptive.
  • Don’t forget a missing value indicator (and description).
    • Avoid zero, or blank space

Not good

Better

18 of 56

Follow standards for dates and categorical vars

  • Use ISO date format: YYYY-MM-DD
    • is 7/10/2019 July 10th or October 7th? It depends!
  • Be careful about categorical variables
    • C & T categories could mean control & treatment, or Creosote & Tarbush
  • Use species codes linked to a taxonomic authority!
    • USDA Plants Database (https://plants.usda.gov/)
      • BOER4 = Bouteloua eriopoda (Torr.) Torr.
    • ITIS taxon numbers are good as a reference, but not user friendly

19 of 56

File formats

  • Some are specific to data type.
  • Some are more accessible and versatile than others.
  • When in doubt, try a delimited text file:

20 of 56

File formats should encourage data re-use

  • Avoid:
    • tables within tables
    • multiple, linked sheets
  • Meet the complicated Excel spreadsheet (enemy of IMs):

21 of 56

Simple data structures and file formats….

  • work with a variety of tools
  • make data analysis and interpretation easier.
  • are easier to archive and publish.
  • are easier to describe with metadata.

22 of 56

Finally – protect the data your collect!

Backup all raw data immediately!!!

  • Jornada IMs can help if you submit data

Document data edits after collection (QA/QC).

Collect appropriate metadata right away!

23 of 56

Agenda

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

24 of 56

Challenge: Understanding data after it is collected

Without Metadata, the usable information content of data declines over time.

Michener et al. 1997. Ecological Applications

25 of 56

What is metadata?

  • Data about the data
    • What, where, when, who, how, why?
    • Allows re-use of the data
    • Ideally you collect metadata with the data
  • Describes and accompanies your data file(s)
  • Machine readable is a plus.
  • Descriptive content in a standardized format

EML

 

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

26 of 56

What is being measured/observed?

  • Attributes of all variables in the data file.
    • Name, units, missing value
    • For categorical data: what are the possible values and their meanings?
  • What do the observations represent?
    • Field observations, lab results, sensor output…

 

27 of 56

Where and when were the data observed?

  • Geographic coordinates or bound of each observational unit.
    • Include coordinate system and datum, preferably.
    • May best be kept in a GIS file (.kmz, .shp, but be sure you can tie to observations)
  • Temporal range (start & end dates) and frequency.

 

28 of 56

How was data collected? By & for who?

  • Description of the method of data collection or creation
    • Instruments, field or lab methods, data QA/QC, derived data method.
  • Who collected the data? (ORCID)
  • Who has the rights to use the data?
  • Who manages the data?

 

CC BY license

29 of 56

Other metadata?

  • Abstract: Why was the data collected?
  • Related publications and datasets
    • Papers using/citing your data
    • Papers describing a critical method
    • Data needed to derive your values (ancillary, synthesis & reanalysis)
    • Model training data
  • Anything else? Think decades down the road...

30 of 56

The metadata format is flexible, just…

EML

(or XML, or text file, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Keep your metadata and data together!

31 of 56

Agenda

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

32 of 56

Do you need advanced data management?

  • Your yellow notebook is full
  • Complex data analysis workflows
  • You use “big data”
  • Complex models (empirical, process-based, ML, etc.)
  • You write lots of code or use pivot tables to analyze your data

Sometimes research and data management gets complicated!

33 of 56

Complex analysis workflows (an example)

  • How do you document and keep track of this?
  • What code and intermediate data should you archive/publish?

Data clean script

5757abstract

Your data files

Ancillary data from NASA

5757834934774989

5757methodst

5757etc.

etc.

New data file

Gapfill script

Model training data

run()

Make()

Metadata

my_project

Data viz scripts

Publication

...

ML model

5757834934774989

5757834934774989

run()

Make()

JRN long-term data

34 of 56

Complex analysis workflows

  • How do you document and keep track of this?
  • What code and intermediate data should you archive/publish?

Data clean script

5757abstract

Your data files

Ancillary data from NASA

5757834934774989

5757methodst

5757etc.

etc.

New data file

Gapfill script

Model training data

run()

Make()

Metadata

my_project

Data viz scripts

Publication

...

ML model

5757834934774989

5757834934774989

run()

Make()

JRN long-term data

Rules of thumb:

  • Preserve & publish the rawest data another scientist might use.
  • Publish data for new research results.
  • Include metadata with all data you publish.
  • Consider publishing analysis code & documentation.
  • Intermediate products can be described by code or metadata (usually).

35 of 56

Ultimate goal - reproducible research

Ask yourself these questions:

  • Could another researcher independently verify your results?
  • Could you?

Then....

  • Write documentation
  • Comment, organize, and publish analysis and viz code

Happy side effect: this makes life easier for you too.

36 of 56

Keeping good documentation

  1. Starts in the field or lab - keep detailed notes as you collect data.
  2. Make an electronic copy of handwritten notes.
  3. Keep a “data analysis log” with your project.
  4. Document like a software engineer.
    • /docs directory with each project
    • document each function or script (inputs & outputs)

37 of 56

Write good code!

  1. Use functions, loops, and other language conventions (organize & reuse code)
  2. Document inputs/outputs
  3. Comment your code
  4. Consider “literate computing” approaches.
    • R markdown files, Jupyter notebooks, MATLAB notebooks
  5. Learn about coding!
    • https://software-carpentry.org

38 of 56

A Jupyter notebook (.ipynb file): a literate computing system

Renders in your browser (even if hosted on Github)

Executable R code (or python, julia, etc)

Figure outputs are displayed inline with code and formatted text.

Statistical and other tabular outputs

Formatted text

39 of 56

Tracking how your project changes

  1. Use a version control system for code.
    • Git, mercurial, etc
  2. Version the documentation with the code.
  3. Integrate with collaboration & project management systems.
    • GitHub, Bitbucket, etc.

2 VCS softwares:

2 collab websites:

40 of 56

Publishable parts of complex workflows

  1. Critical code (data munging, statistics, viz) that led to a research result.
  2. Model parameter files, model state, AND model outputs?
  3. Links to, or archived versions of, ancillary data files
    • Links should be a DOI
    • This could occur in your metadata

Where and how should you publish them? ASK!

jornada.data@nmsu.edu

41 of 56

Agenda

  1. LTER and you the researcher
  2. Data collection best practices
  3. Metadata – what? why? how?
  4. Advanced topics in data analysis & management
  5. Publishing a dataset

42 of 56

The data publishing process

  1. Researcher collects quality data and metadata

Variab

Desc

Don’t forget metadata!

my_project

5757983493477498

5757983493477498

5757983493477498

Abstract

Data files

  1. Submit dataset materials to IM yearly Metadata templates EZeml Data will be securely archived
  1. Review, QA/QC, & editing process via templates, email, and Zoom meetings
  1. Publish to open data repository & cite the dataset

EDI for generalist data, or a specialized repo

portal.edirepository.org

43 of 56

What should you publish?

Metadata

(EML, XML, text, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

If you are unsure - ask!

jornada.data@nmsu.edu

Metadata

(EML, XML, text, etc.)

Metadata

(EML, XML, text, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

code

(script.R, ipynb)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

???

44 of 56

IM Team has resources and time to help

  • See our For Researchers web page to find...
    • Metadata templates and contact info
    • Guidelines and requirements on submission and publishing of data
  • See the Jornada IM pages to find...
    • Metadata standards and suggestions
    • Full documentation on the Jornada IM system (in progress)
  • The IM team will help you...
    • QC your data file
    • Create a publishable metadata file (EML)
    • Publish it all to EDI (or other repos as appropriate)

EML

 

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

45 of 56

Summary

  1. Most environmental science research data is considered a public resource
  2. Jornada LTER is committed to quality data and data re-use – you should be too!
  3. Collecting quality data AND metadata are essential activities for researchers.
  4. Documentation and versioning will help you down the road
  5. The JRN IM team can help you package, publish, and cite your data (jornada.data@nmsu.edu & https://jornada-im.github.io)

46 of 56

Further reading 1

  • See the https://jornada-im.github.io site for Jornada-specific resource

47 of 56

Further reading 2

  • FAIR principles for scientific data: https://doi.org/10.1038/sdata.2016.18

48 of 56

Extra slides follow

49 of 56

Our data management priorities

  1. Contribute to high quality, impactful ecology research at all stages of the data life cycle
  2. Oversee, or directly handle, publication of all Jornada datasets
  3. Act as a source of data-related leadership and education
    • Committee & working group participation, open-source software development, outreach activity, and teaching.

Openness towards diverse stakeholders including scientists, managers, students, & the public.

50 of 56

Finding IM information

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

One click from front-page

  • Research approval
  • Data requirements
  • Forms & Metadata templates

One more click to

  • Detailed instructions
  • Standards & software tools
  • Full IM documentation

(some of these are works-in-progress)

51 of 56

What we ask of JRN researchers

  • Submit data and metadata (Yearly)
  • Update us when your project changes
  • Publish your data
    • at publication or w/in 2 years of collection
  • Cite Jornada funding and data
  • WE ARE PAID TO HELP YOU WITH ALL OF THIS!

52 of 56

Data discovery and access

Also:

  • Interactive data viewer
  • Spatial data catalog

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

53 of 56

Reminders: Steps to publish a JRN dataset

  • Researcher collects quality data and metadata

Variab

Desc

Don’t forget metadata!

my_project

5757983493477498

5757983493477498

5757983493477498

Abstract

Data files

  • Review, QA/QC, & editing process via templates, email, and Zoom meetings
  • Publish to open data repository & cite the dataset

EDI for generalist data, or a specialized repo

portal.edirepository.org

54 of 56

Reminders: Regular IM events

  • R workgroup - Darren James ~ once per month
    • Topics vary, but center on data analysis with R
  • Data Therapy Thursdays – Greg Maurer
    • Open format, informal meetings on data access, management, publishing, analysis…
  • Workshops… Spring and Fall, at least
    • Scientific programming
    • Data management
    • Data analysis, statistics, visualization

55 of 56

Upcoming areas of improvement

  • Highlight the value of JRN’s “core” long-term datasets
    • What are our “core” data?
    • Better data discovery, access, & interpretation of data on the website
  • Get researchers (more) involved in publishing & citing data
    • Metadata templates and apps, more training…
  • More students (undergrad & grad) trained and participating in data management & data science activities at JRN
    • Any ideas? Want to help out?

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

Some of these are in response to the midterm review

56 of 56

Sharing and citing data helps keep the Jornada funded

ask questions: jornada.data@nmsu.edu