1 of 22

Python for Data Retrieval

and Visualization

Michael Shensky

Head of Research Data Services

m.shensky@austin.utexas.edu

Ian Goodale

European Studies and Linguistics Librarian

ian.goodale@austin.utexas.edu

2 of 22

What Workshops covering research data practices and software

When 12pm - 1:15pm on the dates listed below

Where Zoom (all dates) / PCL Scholars Lab (select dates)

More info https://guides.lib.utexas.edu/data-and-donuts

Join us for our Data & Donuts workshops in the Spring 2025 semester! Use QR code for online schedule.

Open Source GIS: From QGIS to Python

Intro to Python for Data Management

Python for Data Retrieval and Visualization

Making Beautiful Plots in R’s ggplot2

Where and How to Publish Research Data

How to Share Sensitive (Human) Data

January 31

February 11

February 12

February 13

https://utlists.utexas.edu/sympa/subscribe/research-data-services

February 14

February 28

Fri

Tue

Wed

Thu

Fri

3 of 22

Workshop Logistics

Feel free to ask questions and add comments in the chat
Workshop instruction will run from 12pm to 1pm and there will be time for questions until 1:15pm

4 of 22

Goals for this Workshop

Get acquainted with the process of importing and installing Python packages
Gain an understanding of what APIs are and how they can be used
Practice using Python to retrieve data online using APIs
Gain an understanding of cleaning and visualizing data using Python

5 of 22

What is Python?

Open source, interpreted programming language
Cross platform compatible (Windows, MacOS, & Linux)
Current version is 3.13.1 (as of Feb. 2025)
Widely used in a variety of fields
Large ecosystem of open source packages
Can be used for file management, analyzing data, editing data, visualizing data, and more!

6 of 22

What is an API?

API stands for: Application Programming Interface
An API is is a set of definitions, protocols, and standards for interacting with a software application external to the script or application you have written
Some APIs are designed to facilitate interaction with locally installed software, while others are designed to allow you to communicate with software on a server

7 of 22

What is REST API?

A REST API allows you to utilize web services using HTTP methods (get, post, put, delete) according to the client-server model
No client information is stored between requests
The computer running your Python code (the client) can use a REST API to execute a process that is made available at a defined REST URL endpoint
REST endpoints are described in an API’s documentation

8 of 22

Learning to Use a REST API

Some APIs are public while others require an API key that allows the creator to control how it is used
APIs can be accessed either directly or through apps or interfaces made available for testing and learning

https://api.inaturalist.org/v1/docs/

https://guides.dataverse.org/en/latest/api/dataaccess.html

9 of 22

Recommendations for Using APIs?

Learn how to read API documentation
Make sure you understand API usage limits and stay in compliance with terms of use
Consider using an API wrapper if one exists for the API you want to utilize
Be conscious of API updates and how they might impact your scripted workflows

https://github.com/realpython/list-of-python-api-wrappers

10 of 22

Using a REST API and Python

You can utilize HTTP methods in a Python script using the requests package

It is not part of the Python standard library and must be installed (it is already installed by default in Google Colab)
The package documentation is available at https://requests.readthedocs.io/en/latest/
Makes it easy to use HTTP requests in your Python code

11 of 22

Why Use Python and APIs to Retrieve Data?

Accessing data with a scripted process can be quicker than manually downloading it

Graphical user interfaces for data portals can be sometimes be difficult to navigate and use

A scripted process for accessing data allows you to reproduce your workflow later and allow others to replicate your work

If you are accessing frequently updating data from an external source, running a script to retrieve data at regular intervals can be useful

Data can get messy if you do not organize it after downloading it manually, but if downloading data using a script you can enforce naming conventions and a file organization structure

Efficiency

Ease of Use

Reproducibility

Data Updates

File

Management

12 of 22

Working with Data Returned by an API Call Using Python

APIs commonly allow users to request that data be returned in XML or JSON format
There are Python packages that can facilitate working with XML and JSON data like:

Json
Xmltodict

Refer to API documentation to learn how the data you request will be structured
Using a print() statement can also be helpful for previewing the data you have requested
You can also save the data you have requested to a file (CSV, JSON, XML, TXT, etc.)

13 of 22

Google Colab

Colab is a free Google service that allows you to create and run Jupyter Notebooks in the cloud
Allows you to write Python code and text notes in compartmentalized cells within a notebook

Code cells can be run individually to allow for previewing outputs and troubleshooting issues

Notebooks are stored in Google Drive and can access other files in Google Drive

https://research.google.com/colaboratory/faq.html

Colab FAQs at https://research.google.com/colaboratory/faq.html

14 of 22

PRACTICE: Retrieving Data Using Python and APIs

Learn about helpful Python packages for data retrieval
Compare APIs
Preview retrieved data in different formats

Link to publicly shared Google Colab Notebook #1:

https://bit.ly/python-data-api-2025

Please save a copy of this script to your drive so you can edit it

15 of 22

Transition to Data Cleaning and Visualization

16 of 22

Key Technologies

Natural Language Toolkit (NLTK)
Pandas
Matplotlib
Wordcloud
For topic modeling, we’ll use Gensim

17 of 22

Natural Language Toolkit (NLTK)

A leading suite of libraries for building Python programs to work with human language data.
We’ll use NLTK for word tokenization and stopwords removal
NLTK is often considered especially well-suited for academic purposes.

An online book that serves as a guide to the package is available at https://www.nltk.org/book/

18 of 22

Pandas

Pandas is an open source data manipulation and analysis library.
We’ll use pandas to put our text data into a DataFrame to prep it for visualization.
The DataFrame is the primary pandas data structure.

A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns).

19 of 22

Matplotlib

Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
We’ll use it to transform our text data in the DataFrame into some simple static visualizations.

Visit https://matplotlib.org/ for curated examples, cheat sheets, documentation, and more.

20 of 22

Wordcloud and Gensim

Wordcloud is a simple word cloud generator written in Python.
Gensim is is an open-source library for unsupervised topic modeling and other natural language processing functionalities.