1 of 35

Tabular Data In Spreadsheets: OpenRefine & Python

Presentation template by SlidesCarnival

Meryl Brodsky & Michael Shensky, UT-Austin Libraries

Data & Donuts, September 13, 2024

Slides: https://tinyurl.com/49pnvkmt

2 of 35

Data & Donuts Overview

2

  • This is the 1st of 4 virtual Data & Donuts workshops in the Fall 2024 semester

Tabular Data in Spreadsheets: OpenRefine and Python

Research Data Management Best Practices

Managing Research Code with Git and GitHub

Intro to R for Data Management

  • Zoom recordings & lecture slides will be posted online at
  • Sign up to receive Data & Donuts workshop event notifications at

September 13

September 27

October 11

October 25

3 of 35

UTL Funding Opportunities

3

Info Session in the PCL Scholars Lab and on Zoom on Tuesday 9/17 from 1pm to 2pm

4 of 35

UTL Funding Opportunities

4

This paid two-semester long program is aimed at UT graduate students. We will select up to five (5) total Fellows who will complete a project that involves data, digital collections, digital media, or digital methods/platforms.

Fellows will receive:

  • project support through consultation with research librarians and staff experts
  • mentorship, cohort-building and related professional development opportunities
  • a one-time stipend of $3,000 to enable focus on accepted proposed project work

� Applicants should read about all expectations and view the frequently asked questions before applying. Applications are welcome until September 15, 2024. Apply here.

5 of 35

Today, we’ll learn about

  • Formatting data in spreadsheets
  • Using OpenRefine to clean data
  • Using Python Scripts to clean data

5

6 of 35

What’s Wrong with this Data?

Please respond in the Chat

6

Source: https://sketchplanations.com/chihuahua-syndrome

7 of 35

Best Practice - Build in Data Validation

  • If writing something down is essential, create a guide (Readme file) to train recorders
  • Use drop down menus whenever possible
  • Define the type & format of data in each cell

7

Readme template: https://guides.lib.utexas.edu/ld.php?content_id=73027116

8 of 35

Best Practice - Tidy Data

  • Put all your variables in columns - the thing you’re measuring, like ‘weight’ or ‘temperature’
  • Put each observation in its own row
  • Don’t combine multiple pieces of information in one cell

8

Wickham, Hadley (2014). "Tidy Data" Journal of Statistical Software.

9 of 35

Date Problems

  • Formatting may “add” or “remove” specificity
  • Dates are integer-based
  • Excel may change order based on your region
  • Excel may use 1900 or 1904 date systems

9

10 of 35

Null Problems

10

Null Values

Problems

Compatibility

Recommendation

0

Indistinguishable from a true zero

Never Use

Blank

Hard to distinguish missing, overlooked, spaces

R, Python, SQL

Best Option

NA, na,

Can be the wrong data type

R

Good option

N/A

Alternate form of NA, often not compatible

Avoid

None

Uncommon, Can be the wrong data type

Python

Avoid

NULL

Can be the wrong data type

SQL

Good option

Missing, - +

Uncommon. Can be the wrong data type

Avoid

White EP, et al. (2013). Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. https://ojs.library.queensu.ca/index.php/IEE/article/view/4608

11 of 35

Best Practices

  • Don’t touch the raw data!
  • Make a new file for any clean-up or analysis
  • Document any changes you made to the data
  • Export cleaned data to a text-based format, like CSV (comma-separated values)

11

12 of 35

OpenRefine

OpenRefine is a an open source tool for working with messy data.

You can use OpenRefine to:

    • Clean and visualize data
    • Transform it from one format into another
    • Extend it with web services and external data

12

Source: https://openrefine.org/

13 of 35

Download OPENREFINE

http://openrefine.org/download.html

OpenRefine works best on these browsers:

  • Google Chrome
  • Chromium
  • Opera
  • Microsoft Edge
  • Safari

13

14 of 35

Download the Petnames.TSV Dataset

14

https://github.com/jgolbeck/petnames

15 of 35

Download the Petnames.TSV Dataset

15

https://github.com/jgolbeck/petnames/blob/master/PetNames.tsv

16 of 35

CReate a Project

16

17 of 35

Create Project

17

18 of 35

starting a project

18

19 of 35

Use Facets to see and clean data

19

20 of 35

Fixing Errors and Clustering

20

21 of 35

Clustering

21

22 of 35

Sort Columns

22

23 of 35

Edit Cells and Common transformations

23

24 of 35

Tracking project history

24

25 of 35

Rename & Export

To rename the file click on the title. An editable dialog box will appear.

To export, click export in the upper hand conner, and select your preferred export format.

25

26 of 35

Save json scripts

26

27 of 35

Scripted approaches for Working with Tabular data

  • Pros
    • Reproducibility
    • Efficiency
    • Documentation
    • Code versioning
    • Code sharing
  • Cons
    • Might require new skills
    • Might not be well suited to very small or complex datasets

27

28 of 35

Scripted approaches for Working with Tabular data

  • What can you do with tabular data using a scripted process?
    • Access/load
    • Preview
    • Clean
    • Analyze
    • Visualize

28

29 of 35

Scripted approaches for Working with Tabular data

  • Options
    • Python
    • R
    • Other scripting languages

  • Look beyond the standard library for packages that facilitate tabular data management

29

30 of 35

Python Packages for Working with Tabular Data

30

31 of 35

WHen to use which package?

  • Scenarios where each package is most useful:
    • csv: you have a small dataset and simple script
    • pandas: a larger dataset and more complex operations
    • polars: working with very large tabular datasets
    • openpyxl: you want to format tabular data in Excel

31

32 of 35

Working with Tabular Data in Python

Google Colab notebook with examples for:

  • Importing different packages for tabular data
  • Reading data from csv files
  • Saving tabular data in different formats
  • Cleaning data from a tabular dataset
  • Visualizing data from a tabular dataset

32

33 of 35

Workshop Feedback

Please help us to improve this workshop in the future by filling out a brief anonymous survey that will popup when the Zoom session closes.

33

34 of 35

OpenRefine Resources & Links

34

35 of 35

35

Contact us

Meryl Brodsky

meryl.brodsky@austin.utexas.edu

Michael Shensky

m.shensky@austin.utexas.edu

Questions?

Upcoming Workshops