JRN data management
Data management orientation for new students, some nagging…
Greg Maurer (Lead IM)
Orientation to JRN data management
This will go a bit fast – you can always contact me at jornada.data@nmsu.edu. Also see the Jornada IM website at https://jornada-im.github.io
Modern expectations in scientific research
Modern expectations in scientific research
But be aware – these ideals are NOT fully realized
Agenda
Long Term Ecological Research (and you)
LTER: one of many data creation networks
Want to make the most of LTER data you collect?
Don’t be Sméagol
Specific requirements for Jornada researchers
Generally, we follow the LTER Network Data Access policy to...
Additionally we ask that you:
Who can help you get research done?
John Anderson
JRN Research Site Manager
Greg Maurer
JRN Information Manager
And, of course, any PI you work with at the Jornada (or another) LTER site.
There is a large Jornada IM team…
Contact us! jornada.data@nmsu.edu
Agenda
How should you collect your data?
Think about:
Scientific data types
Data structure
Best Practices:
For tables, try “Tidy” data
“It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data”
Wickham, H. Tidy data. 2014. Journal of Statistical Software: v59, i10. https://www.jstatsoft.org/article/view/v059i10/v59i10.pdf
Wide form (or “untidy”) data.
Observations and variables in both rows and columns
Long form (“tidy”) data.
Columns represent variables
Rows represent observations
Some data “no-no’s”
Not good
Better
Follow standards for dates and categorical vars
File formats
File formats should encourage data re-use
Simple data structures and file formats….
Finally – protect the data your collect!
Backup all raw data immediately!!!
Document data edits after collection (QA/QC).
Collect appropriate metadata right away!
Agenda
Challenge: Understanding data after it is collected
Without Metadata, the usable information content of data declines over time.
Michener et al. 1997. Ecological Applications
What is metadata?
EML
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
What is being measured/observed?
Where and when were the data observed?
How was data collected? By & for who?
CC BY license
Other metadata?
The metadata format is flexible, just…
EML
(or XML, or text file, etc.)
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Keep your metadata and data together!
Agenda
Do you need advanced data management?
Sometimes research and data management gets complicated!
Complex analysis workflows (an example)
Data clean script
5757abstract
Your data files
Ancillary data from NASA
5757834934774989
5757methodst
5757etc.
etc.
New data file
Gapfill script
Model training data
run()
Make()
Metadata
my_project
Data viz scripts
Publication
…
…
…
...
ML model
5757834934774989
5757834934774989
run()
Make()
JRN long-term data
Complex analysis workflows
Data clean script
5757abstract
Your data files
Ancillary data from NASA
5757834934774989
5757methodst
5757etc.
etc.
New data file
Gapfill script
Model training data
run()
Make()
Metadata
my_project
Data viz scripts
Publication
…
…
…
...
ML model
5757834934774989
5757834934774989
run()
Make()
JRN long-term data
Rules of thumb:
Ultimate goal - reproducible research
Ask yourself these questions:
Then....
Happy side effect: this makes life easier for you too.
Keeping good documentation
Write good code!
A Jupyter notebook (.ipynb file): a literate computing system
Renders in your browser (even if hosted on Github)
Executable R code (or python, julia, etc)
Figure outputs are displayed inline with code and formatted text.
Statistical and other tabular outputs
Formatted text
Tracking how your project changes
2 VCS softwares:
2 collab websites:
Publishable parts of complex workflows
Where and how should you publish them? ASK!
Agenda
The data publishing process
Variab
Desc
Don’t forget metadata!
my_project
5757983493477498
5757983493477498
5757983493477498
Abstract
Data files
EDI for generalist data, or a specialized repo
portal.edirepository.org
What should you publish?
Metadata
(EML, XML, text, etc.)
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Metadata
(EML, XML, text, etc.)
Metadata
(EML, XML, text, etc.)
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
code
(script.R, ipynb)
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
???
IM Team has resources and time to help
EML
Data file
3453 34534 4534
4534 45343 3453
3453 34523 2334
Summary
Further reading 1
Further reading 2
Extra slides follow
Our data management priorities
Openness towards diverse stakeholders including scientists, managers, students, & the public.
Finding IM information
Information management
Dr. Greg Maurer
Lead IM
New Mexico State University
One click from front-page
One more click to
(some of these are works-in-progress)
What we ask of JRN researchers
Data discovery and access
Also:
Information management
Dr. Greg Maurer
Lead IM
New Mexico State University
Reminders: Steps to publish a JRN dataset
Variab
Desc
Don’t forget metadata!
my_project
5757983493477498
5757983493477498
5757983493477498
Abstract
Data files
EDI for generalist data, or a specialized repo
portal.edirepository.org
Reminders: Regular IM events
Upcoming areas of improvement
Information management
Dr. Greg Maurer
Lead IM
New Mexico State University
Some of these are in response to the midterm review
Sharing and citing data helps keep the Jornada funded
ask questions: jornada.data@nmsu.edu