Tabular Data In Spreadsheets: OpenRefine & Python
Presentation template by SlidesCarnival
Meryl Brodsky & Michael Shensky, UT-Austin Libraries
Data & Donuts, September 13, 2024
Slides: https://tinyurl.com/49pnvkmt
Data & Donuts Overview
2
Tabular Data in Spreadsheets: OpenRefine and Python
Research Data Management Best Practices
Managing Research Code with Git and GitHub
Intro to R for Data Management
September 13
September 27
October 11
October 25
UTL Funding Opportunities
3
Info Session in the PCL Scholars Lab and on Zoom on Tuesday 9/17 from 1pm to 2pm
UTL Funding Opportunities
4
This paid two-semester long program is aimed at UT graduate students. We will select up to five (5) total Fellows who will complete a project that involves data, digital collections, digital media, or digital methods/platforms.
Fellows will receive:
� Applicants should read about all expectations and view the frequently asked questions before applying. Applications are welcome until September 15, 2024. Apply here.
Today, we’ll learn about
5
What’s Wrong with this Data?
Please respond in the Chat
6
Source: https://sketchplanations.com/chihuahua-syndrome
Best Practice - Build in Data Validation
7
Readme template: https://guides.lib.utexas.edu/ld.php?content_id=73027116
Best Practice - Tidy Data
8
Wickham, Hadley (2014). "Tidy Data" Journal of Statistical Software.
Date Problems
9
Null Problems
10
Null Values | Problems | Compatibility | Recommendation |
0 | Indistinguishable from a true zero | | Never Use |
Blank | Hard to distinguish missing, overlooked, spaces | R, Python, SQL | Best Option |
NA, na, | Can be the wrong data type | R | Good option |
N/A | Alternate form of NA, often not compatible | | Avoid |
None | Uncommon, Can be the wrong data type | Python | Avoid |
NULL | Can be the wrong data type | SQL | Good option |
Missing, - + | Uncommon. Can be the wrong data type | | Avoid |
White EP, et al. (2013). Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. https://ojs.library.queensu.ca/index.php/IEE/article/view/4608
Best Practices
11
OpenRefine
OpenRefine is a an open source tool for working with messy data.
You can use OpenRefine to:
12
Source: https://openrefine.org/
Download OPENREFINE
http://openrefine.org/download.html
OpenRefine works best on these browsers:
13
Download the Petnames.TSV Dataset
14
https://github.com/jgolbeck/petnames
Download the Petnames.TSV Dataset
15
https://github.com/jgolbeck/petnames/blob/master/PetNames.tsv
CReate a Project
16
Create Project
17
starting a project
18
Use Facets to see and clean data
19
Fixing Errors and Clustering
20
Clustering
21
Sort Columns
22
Edit Cells and Common transformations
23
Tracking project history
24
Rename & Export
To rename the file click on the title. An editable dialog box will appear.
To export, click export in the upper hand conner, and select your preferred export format.
25
Save json scripts
26
Scripted approaches for Working with Tabular data
27
Scripted approaches for Working with Tabular data
28
Scripted approaches for Working with Tabular data
29
Python Packages for Working with Tabular Data
30
WHen to use which package?
31
Working with Tabular Data in Python
Google Colab notebook with examples for:
32
Workshop Feedback
Please help us to improve this workshop in the future by filling out a brief anonymous survey that will popup when the Zoom session closes.
33
OpenRefine Resources & Links
34
35
Questions?
Upcoming Workshops