1 of 35

Tabular Data In Spreadsheets: OpenRefine & Python

Presentation template by SlidesCarnival

Meryl Brodsky & Michael Shensky, UT-Austin Libraries

Data & Donuts, September 13, 2024

Slides: https://tinyurl.com/49pnvkmt

2 of 35

Data & Donuts Overview

2

This is the 1st of 4 virtual Data & Donuts workshops in the Fall 2024 semester

Tabular Data in Spreadsheets: OpenRefine and Python

Research Data Management Best Practices

Managing Research Code with Git and GitHub

Intro to R for Data Management

Zoom recordings & lecture slides will be posted online at
Sign up to receive Data & Donuts workshop event notifications at

September 13

September 27

October 11

October 25

https://guides.lib.utexas.edu/data-and-donuts

https://utlists.utexas.edu/sympa/subscribe/research-data-services

3 of 35

UTL Funding Opportunities

3

Info Session in the PCL Scholars Lab and on Zoom on Tuesday 9/17 from 1pm to 2pm

4 of 35

UTL Funding Opportunities

4

This paid two-semester long program is aimed at UT graduate students. We will select up to five (5) total Fellows who will complete a project that involves data, digital collections, digital media, or digital methods/platforms.

Fellows will receive:

project support through consultation with research librarians and staff experts
mentorship, cohort-building and related professional development opportunities
a one-time stipend of $3,000 to enable focus on accepted proposed project work

� Applicants should read about all expectations and view the frequently asked questions before applying. Applications are welcome until September 15, 2024. Apply here.

5 of 35

Today, we’ll learn about

Formatting data in spreadsheets
Using OpenRefine to clean data
Using Python Scripts to clean data

5

6 of 35

What’s Wrong with this Data?

Please respond in the Chat

6

Source: https://sketchplanations.com/chihuahua-syndrome

7 of 35

Best Practice - Build in Data Validation

If writing something down is essential, create a guide (Readme file) to train recorders
Use drop down menus whenever possible
Define the type & format of data in each cell

7

Readme template: https://guides.lib.utexas.edu/ld.php?content_id=73027116

Before you even begin your data collection, think about what type of data will be most useful to you for analysis, and how you will use it.

All of your data collecting should be documented in a ReadMe file. The Readme File should include:

How was this data created?

What types of data were collected?

What do the variables mean?

When was this data collected?

What processing or cleaning has been done?

We’ve included a link to a readme template on the bottom of this slide. This will be helpful in training people on your team who are collecting the data.

If using a spreadsheet or a form, create a drop down menu so that the amount of variation in response is limited.

If you can’t do that, then define the type of data you expect in each cell, whether it’s quantitative or qualitative, and what the format of the data should be, especially related to dates.

8 of 35

Best Practice - Tidy Data

Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’
Put each observation in its own row
Don’t combine multiple pieces of information in one cell

8

Wickham, Hadley (2014). "Tidy Data" Journal of Statistical Software.

Here’s a best practice called Tidy Data. Tidy Data was concept coined by Hadley Wickham. The idea of keeping data tidy, was developed by the New Zealander as a set of best practices for data analysis. It sounds British. Here in American we might call it neat data. I like my data neat.

Keep one observation in each cell.

Don’t use formatting to convey information, such as bold or a background color – you’ll want to be able to sort data by all of your variables, so you make those variables a column.

Make sure that variable names don’t contain stop words and that they are descriptive.

Don’t use special characters in the data. You need to let the data speak.

Don’t use multiple sheets or tabs for different years unless you are planning to analyze them separately. This way if you want to compare years you can.

9 of 35

Date Problems

Formatting may “add” or “remove” specificity
Dates are integer-based
Excel may change order based on your region
Excel may use 1900 or 1904 date systems

9

10 of 35

Null Problems

10

Null Values	Problems	Compatibility	Recommendation
0	Indistinguishable from a true zero		Never Use
Blank	Hard to distinguish missing, overlooked, spaces	R, Python, SQL	Best Option
NA, na,	Can be the wrong data type	R	Good option
N/A	Alternate form of NA, often not compatible		Avoid
None	Uncommon, Can be the wrong data type	Python	Avoid
NULL	Can be the wrong data type	SQL	Good option
Missing, - +	Uncommon. Can be the wrong data type		Avoid

White EP, et al. (2013). Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. https://ojs.library.queensu.ca/index.php/IEE/article/view/4608

11 of 35

Best Practices

Don’t touch the raw data!
Make a new file for any clean-up or analysis
Document any changes you made to the data
Export cleaned data to a text-based format, like CSV (comma-separated values)

11

12 of 35

OpenRefine

OpenRefine is a an open source tool for working with messy data.

You can use OpenRefine to:

Clean and visualize data
Transform it from one format into another
Extend it with web services and external data

12

Source: https://openrefine.org/

13 of 35

Download OPENREFINE

http://openrefine.org/download.html

OpenRefine works best on these browsers:

Google Chrome
Chromium
Opera
Microsoft Edge
Safari

13

http://127.0.0.1:3333/

14 of 35

Download the Petnames.TSV Dataset

14

https://github.com/jgolbeck/petnames

15 of 35

Download the Petnames.TSV Dataset

15

https://github.com/jgolbeck/petnames/blob/master/PetNames.tsv

16 of 35

CReate a Project

16

17 of 35

Create Project

17

18 of 35

starting a project

18

19 of 35

Use Facets to see and clean data

19

20 of 35

Fixing Errors and Clustering

20

So now you can see how many dogs, cats, and other animals there are. You can also see where there are errors.

You can see that it is not counting capital D dog, as being the same things as small d dog.

And you can fix these right now if you like. You can go to small d dog, and make it a capital d, and you’ll see the number increase.

Highlight small d dog, and go to edit, make it a capital D, and then hit apply. Go ahead and do that.

You’ll see the number of Dogs increase at the top.

The other thing you’ll note is that there are now 68 types whereas before there were 69 types.

Okay great, but these errors are pretty straightforward to fix, and there aren’t that many types.

How would it work with a larger set? You could employ clustering, which gets like things together and asks you to pick the right one.

We’ll try that now.Go to Cluster in the Facet Box.

21 of 35

Clustering

21

You can change it to Metaphone 3 which is an algorithm that collects like things. It has about 10 matches which is pretty good.

This makes some pretty good matches, but the Cats one should be in with Cat, and we can change that by typing it in the box.

The cow one doesn’t look right, and the rabbit and robot are not a match.

You can go through these and select the ones that look good, and then merge them. Or you can try some of the other keying functions to see if they do a better job.

Depending on what language you’re working in, different keying functions may work better.

Fingerprint seems to pick up few matches.

Cologne Phonetic seems to work for about 9 different names.

So, I am going with Metaphone-3, and I’ll select the ones that look good and merge them and recluster. Then I will close.

22 of 35

Sort Columns

22

23 of 35

Edit Cells and Common transformations

23

24 of 35

Tracking project history

24

25 of 35

Rename & Export

To rename the file click on the title. An editable dialog box will appear.

To export, click export in the upper hand conner, and select your preferred export format.

25

26 of 35

Save json scripts

26

27 of 35

Scripted approaches for Working with Tabular data

Pros

Reproducibility
Efficiency
Documentation
Code versioning
Code sharing

Cons

Might require new skills
Might not be well suited to very small or complex datasets

27

28 of 35

Scripted approaches for Working with Tabular data

What can you do with tabular data using a scripted process?

Access/load
Preview
Clean
Analyze
Visualize

28

29 of 35

Scripted approaches for Working with Tabular data

Options

Python
R
Other scripting languages

Look beyond the standard library for packages that facilitate tabular data management

29

30 of 35

Python Packages for Working with Tabular Data

Python packages to consider

csv (https://docs.python.org/3/library/csv.html)
pandas (https://pandas.pydata.org/docs/reference/)
polars (https://docs.pola.rs/api/python/stable/reference/)
openpyxl (https://openpyxl.readthedocs.io/en/stable/)

30

31 of 35

WHen to use which package?

Scenarios where each package is most useful:

csv: you have a small dataset and simple script
pandas: a larger dataset and more complex operations
polars: working with very large tabular datasets
openpyxl: you want to format tabular data in Excel

31

32 of 35

Working with Tabular Data in Python

Google Colab notebook with examples for:

Importing different packages for tabular data
Reading data from csv files
Saving tabular data in different formats
Cleaning data from a tabular dataset
Visualizing data from a tabular dataset

32

https://bit.ly/python-nb-tabular-data-2024

33 of 35

Workshop Feedback

Please help us to improve this workshop in the future by filling out a brief anonymous survey that will popup when the Zoom session closes.

33

34 of 35

OpenRefine Resources & Links

OpenRefine: http://openrefine.org/
Github repository: https://github.com/OpenRefine/OpenRefine
OpenRefine user manual: https://openrefine.org/docs
Reconcilable Data Services (to link data):

https://github.com/OpenRefine/OpenRefine/wiki/Reconcilable-Data-Sources

Using OpenRefine (e-book):

https://search.lib.utexas.edu/permalink/01UTAU_INST/be14ds/alma991057973361606011

GREL (can transform your data in a few more complex ways):

https://guides.library.illinois.edu/openrefine/grel

34

35 of 35

35

Contact us

Meryl Brodsky

meryl.brodsky@austin.utexas.ed u

Michael Shensky

m.shensky@austin.utexas.edu

Questions?

Upcoming Workshops