1 of 56

JRN data management

Data management orientation for new students, some nagging…

Greg Maurer (Lead IM)

2 of 56

Orientation to JRN data management

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

This will go a bit fast – you can always contact me at jornada.data@nmsu.edu. Also see the Jornada IM website at https://jornada-im.github.io

3 of 56

Modern expectations in scientific research

Quality data and metadata
Open science & reproducible research

Accessible data, methods, and publications
Can other researchers confirm the results?

Community efforts & collaboration among research networks
Data reuse

4 of 56

Modern expectations in scientific research

Quality data and metadata
Open science & reproducible research

Accessible data, methods, and publications
Can other researchers confirm the results?

Community efforts & collaboration among research networks
Data reuse

But be aware – these ideals are NOT fully realized

5 of 56

Agenda

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

6 of 56

Long Term Ecological Research (and you)

Grew out of the International Biological Program (1968 to 1974)
First sites funded by NSF in ~1980 (Jornada Basin LTER in 1982)
Early effort to understand Earth and its living systems.
Much of LTER value lies in the ”long term” data.

At the leading edge of “open science” for many years
Data is open to current & future scientists (like you), managers, policy makers, the public, etc.

7 of 56

LTER: one of many data creation networks

Data collection in ecology/earth systems has grown exponentially.
Modern research draws on many networks.
Open research and data interoperability are key.

8 of 56

Want to make the most of LTER data you collect?

It is a grand contribution to science! Publish a paper on it!
When you do… share the data and metadata.
Open research and data sharing:

Helps Science!!!
FUN!!!
Required by NSF and other funding agencies
Required by ESA Journals
Required by AGU Journals

Don’t be Sméagol

9 of 56

Specific requirements for Jornada researchers

Generally, we follow the LTER Network Data Access policy to...

Submit data and metadata to Jornada IMs yearly
Publish data and metadata at the time research results are peer-reviewed and published, or no later than 2 years after collection.

Additionally we ask that you:

Keep project information up to date in our database (we’ll remind you).
Acknowledge Jornada support and data

Acknowledge Jornada support in papers, dissertations, and other publications.
Cite the Jornada data used in your research.

More details on our website….

10 of 56

Who can help you get research done?

John Anderson

JRN Research Site Manager

Approves new LTER projects
May help with experimental design
The Jornada “know-it-all”

Greg Maurer

JRN Information Manager

Data curation/archiving
Publishing LTER data
Data analysis/janitorial, good coding practices police

And, of course, any PI you work with at the Jornada (or another) LTER site.

11 of 56

There is a large Jornada IM team…

USDA: John Ragosta and Darren James
Programming & qa/qc: Geovany Ramirez
Deputy IMs: Shelly Valdovinos, Madeleine Soss
EDI Data Science Summer Fellow: Brianda Hernandez-Rosales

Contact us! jornada.data@nmsu.edu

12 of 56

Agenda

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

13 of 56

How should you collect your data?

Think about:

Data type
Data structure
File format
How will you use it?

14 of 56

Scientific data types

Tabular

Variables and observations are arranged in rows and columns.
Most data can be represented this way.

Special cases

Imagery (Photos/remote sensing/UAV/Phenocams)
High frequency time series
Geospatial data
Genomic (or other ”omics”) data
These may have unique structures and/or file formats.

15 of 56

Data structure

Depends on data type

(some data types have only 1 common structure)

For tabular data:

What variables should be collected?
What do rows and columns represent?

Best Practices:

Use the simplest possible data structure.
Organize data by observational unit during collection.
Maximize the information for each observation.

16 of 56

For tables, try “Tidy” data

“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data”

Wickham, H. Tidy data. 2014. Journal of Statistical Software: v59, i10. https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf

Wide form (or “untidy”) data.

Observations and variables in both rows and columns

Long form (“tidy”) data.

Columns represent variables

Rows represent observations

17 of 56

Some data “no-no’s”

Don’t aggregate multiple variables into one - be granular.
Make variable names descriptive.
Don’t forget a missing value indicator (and description).

Avoid zero, or blank space

Not good

Better

18 of 56

Follow standards for dates and categorical vars

Use ISO date format: YYYY-MM-DD

is 7/10/2019 July 10th or October 7th? It depends!

Be careful about categorical variables

C & T categories could mean control & treatment, or Creosote & Tarbush

Use species codes linked to a taxonomic authority!

USDA Plants Database (https://plants.usda.gov/)

BOER4 = Bouteloua eriopoda (Torr.) Torr.

ITIS taxon numbers are good as a reference, but not user friendly

19 of 56

File formats

Some are specific to data type.
Some are more accessible and versatile than others.
When in doubt, try a delimited text file:

20 of 56

File formats should encourage data re-use

Avoid:

tables within tables
multiple, linked sheets

Meet the complicated Excel spreadsheet (enemy of IMs):

21 of 56

Simple data structures and file formats….

work with a variety of tools
make data analysis and interpretation easier.
are easier to archive and publish.
are easier to describe with metadata.

22 of 56

Finally – protect the data your collect!

Backup all raw data immediately!!!

Jornada IMs can help if you submit data

Document data edits after collection (QA/QC).

Collect appropriate metadata right away!

23 of 56

Agenda

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

24 of 56

Challenge: Understanding data after it is collected

Without Metadata, the usable information content of data declines over time.

Michener et al. 1997. Ecological Applications

25 of 56

What is metadata?

Data about the data

What, where, when, who, how, why?
Allows re-use of the data
Ideally you collect metadata with the data

Describes and accompanies your data file(s)
Machine readable is a plus.
Descriptive content in a standardized format

EML

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

26 of 56

What is being measured/observed?

Attributes of all variables in the data file.

Name, units, missing value
For categorical data: what are the possible values and their meanings?

What do the observations represent?

Field observations, lab results, sensor output…

27 of 56

Where and when were the data observed?

Geographic coordinates or bound of each observational unit.

Include coordinate system and datum, preferably.
May best be kept in a GIS file (.kmz, .shp, but be sure you can tie to observations)

Temporal range (start & end dates) and frequency.

28 of 56

How was data collected? By & for who?

Description of the method of data collection or creation

Instruments, field or lab methods, data QA/QC, derived data method.

Who collected the data? (ORCID)
Who has the rights to use the data?
Who manages the data?

CC BY license

29 of 56

Other metadata?

Abstract: Why was the data collected?
Related publications and datasets

Papers using/citing your data
Papers describing a critical method
Data needed to derive your values (ancillary, synthesis & reanalysis)
Model training data

Anything else? Think decades down the road...

30 of 56

The metadata format is flexible, just…

EML

(or XML, or text file, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Keep your metadata and data together!

31 of 56

Agenda

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

32 of 56

Do you need advanced data management?

Your yellow notebook is full
Complex data analysis workflows
You use “big data”
Complex models (empirical, process-based, ML, etc.)
You write lots of code or use pivot tables to analyze your data

Sometimes research and data management gets complicated!

33 of 56

Complex analysis workflows (an example)

How do you document and keep track of this?
What code and intermediate data should you archive/publish?

Data clean script

5757abstract

Your data files

Ancillary data from NASA

5757834934774989

5757methodst

5757etc.

etc.

New data file

Gapfill script

Model training data

run()

Make()

Metadata

my_project

Data viz scripts

Publication

…

...

ML model

5757834934774989

run()

Make()

JRN long-term data

34 of 56

Complex analysis workflows

How do you document and keep track of this?
What code and intermediate data should you archive/publish?

Data clean script

5757abstract

Your data files

Ancillary data from NASA

5757834934774989

5757methodst

5757etc.

etc.

New data file

Gapfill script

Model training data

run()

Make()

Metadata

my_project

Data viz scripts

Publication

…

...

ML model

5757834934774989

run()

Make()

JRN long-term data

Rules of thumb:

Preserve & publish the rawest data another scientist might use.
Publish data for new research results.
Include metadata with all data you publish.
Consider publishing analysis code & documentation.
Intermediate products can be described by code or metadata (usually).

35 of 56

Ultimate goal - reproducible research

Ask yourself these questions:

Could another researcher independently verify your results?
Could you?

Then....

Write documentation
Comment, organize, and publish analysis and viz code

Happy side effect: this makes life easier for you too.

36 of 56

Keeping good documentation

Starts in the field or lab - keep detailed notes as you collect data.
Make an electronic copy of handwritten notes.
Keep a “data analysis log” with your project.
Document like a software engineer.

/docs directory with each project
document each function or script (inputs & outputs)

37 of 56

Write good code!

Use functions, loops, and other language conventions (organize & reuse code)
Document inputs/outputs
Comment your code
Consider “literate computing” approaches.

R markdown files, Jupyter notebooks, MATLAB notebooks

Learn about coding!

https://software-carpentry.org

38 of 56

A Jupyter notebook (.ipynb file): a literate computing system

Renders in your browser (even if hosted on Github)

Executable R code (or python, julia, etc)

Figure outputs are displayed inline with code and formatted text.

Statistical and other tabular outputs

Formatted text

39 of 56

Tracking how your project changes

Use a version control system for code.

Git, mercurial, etc

Version the documentation with the code.
Integrate with collaboration & project management systems.

GitHub, Bitbucket, etc.

2 VCS softwares:

2 collab websites:

40 of 56

Publishable parts of complex workflows

Critical code (data munging, statistics, viz) that led to a research result.
Model parameter files, model state, AND model outputs?
Links to, or archived versions of, ancillary data files

Links should be a DOI
This could occur in your metadata

Where and how should you publish them? ASK!

jornada.data@nmsu.edu

41 of 56

Agenda

LTER and you the researcher
Data collection best practices
Metadata – what? why? how?
Advanced topics in data analysis & management
Publishing a dataset

42 of 56

The data publishing process

Researcher collects quality data and metadata

Variab

Desc

Don’t forget metadata!

my_project

5757983493477498

Abstract

Data files

Submit dataset materials to IM yearly Metadata templates EZeml Data will be securely archived

Review, QA/QC, & editing process via templates, email, and Zoom meetings

Publish to open data repository & cite the dataset

EDI for generalist data, or a specialized repo

portal.edirepository.org

43 of 56

What should you publish?

Metadata

(EML, XML, text, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

If you are unsure - ask!

jornada.data@nmsu.edu

Metadata

(EML, XML, text, etc.)

Metadata

(EML, XML, text, etc.)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

code

(script.R, ipynb)

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

???

44 of 56

IM Team has resources and time to help

See our For Researchers web page to find...

Metadata templates and contact info
Guidelines and requirements on submission and publishing of data

See the Jornada IM pages to find...

Metadata standards and suggestions
Full documentation on the Jornada IM system (in progress)

The IM team will help you...

QC your data file
Create a publishable metadata file (EML)
Publish it all to EDI (or other repos as appropriate)

EML

Data file

3453 34534 4534

4534 45343 3453

3453 34523 2334

45 of 56

Summary

Most environmental science research data is considered a public resource
Jornada LTER is committed to quality data and data re-use – you should be too!
Collecting quality data AND metadata are essential activities for researchers.
Documentation and versioning will help you down the road
The JRN IM team can help you package, publish, and cite your data (jornada.data@nmsu.edu & https://jornada-im.github.io)

46 of 56

47 of 56

48 of 56

Extra slides follow

49 of 56

Our data management priorities

Contribute to high quality, impactful ecology research at all stages of the data life cycle
Oversee, or directly handle, publication of all Jornada datasets

Ensure data meet FAIR data principles & the LTER Data Access Policy

Act as a source of data-related leadership and education

Committee & working group participation, open-source software development, outreach activity, and teaching.

Openness towards diverse stakeholders including scientists, managers, students, & the public.

50 of 56

Finding IM information

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

One click from front-page

Research approval
Data requirements
Forms & Metadata templates

One more click to

Detailed instructions
Standards & software tools
Full IM documentation

(some of these are works-in-progress)

https://jornada-im.github.io

https://lter.jornada.nmsu.edu

51 of 56

What we ask of JRN researchers

Submit data and metadata (Yearly)
Update us when your project changes
Publish your data

at publication or w/in 2 years of collection

Cite Jornada funding and data
WE ARE PAID TO HELP YOU WITH ALL OF THIS!

https://lter.jornada.nmsu.edu/for-researchers

52 of 56

Data discovery and access

Also:

Interactive data viewer
Spatial data catalog

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

https://lter.jornada.nmsu.edu/data-catalog/

53 of 56

Reminders: Steps to publish a JRN dataset

Researcher collects quality data and metadata

Variab

Desc

Don’t forget metadata!

my_project

5757983493477498

Abstract

Data files

Submit dataset materials to IM yearly Metadata templates Data will be securely archived

Review, QA/QC, & editing process via templates, email, and Zoom meetings

Publish to open data repository & cite the dataset

EDI for generalist data, or a specialized repo

portal.edirepository.org

54 of 56

Reminders: Regular IM events

R workgroup - Darren James ~ once per month

Topics vary, but center on data analysis with R

Data Therapy Thursdays – Greg Maurer

Open format, informal meetings on data access, management, publishing, analysis…

Workshops… Spring and Fall, at least

Scientific programming
Data management
Data analysis, statistics, visualization

55 of 56

Upcoming areas of improvement

Highlight the value of JRN’s “core” long-term datasets

What are our “core” data?
Better data discovery, access, & interpretation of data on the website

Get researchers (more) involved in publishing & citing data

Metadata templates and apps, more training…

More students (undergrad & grad) trained and participating in data management & data science activities at JRN

Any ideas? Want to help out?

Information management

Dr. Greg Maurer

Lead IM

New Mexico State University

Some of these are in response to the midterm review

56 of 56

Sharing and citing data helps keep the Jornada funded

ask questions: jornada.data@nmsu.edu