1 of 22

Python for Data Retrieval

and Visualization

Michael Shensky

Head of Research Data Services

m.shensky@austin.utexas.edu

Ian Goodale

European Studies and Linguistics Librarian

ian.goodale@austin.utexas.edu

2 of 22

What Workshops covering research data practices and software

When 12pm - 1:15pm on the dates listed below

Where Zoom (all dates) / PCL Scholars Lab (select dates)

More info https://guides.lib.utexas.edu/data-and-donuts

  • Join us for our Data & Donuts workshops in the Spring 2025 semester! Use QR code for online schedule.

Open Source GIS: From QGIS to Python

Intro to Python for Data Management

Python for Data Retrieval and Visualization

Making Beautiful Plots in R’s ggplot2

Where and How to Publish Research Data

How to Share Sensitive (Human) Data

  • Sign up to receive Data & Donuts workshop event notifications at

January 31

February 11

February 12

February 13

February 14

February 28

Fri

Tue

Wed

Thu

Fri

Fri

3 of 22

Workshop Logistics

  • Feel free to ask questions and add comments in the chat
  • Workshop instruction will run from 12pm to 1pm and there will be time for questions until 1:15pm

4 of 22

Goals for this Workshop

  • Get acquainted with the process of importing and installing Python packages
  • Gain an understanding of what APIs are and how they can be used
  • Practice using Python to retrieve data online using APIs
  • Gain an understanding of cleaning and visualizing data using Python

5 of 22

What is Python?

  • Open source, interpreted programming language
  • Cross platform compatible (Windows, MacOS, & Linux)
  • Current version is 3.13.1 (as of Feb. 2025)
  • Widely used in a variety of fields
  • Large ecosystem of open source packages
  • Can be used for file management, analyzing data, editing data, visualizing data, and more!

6 of 22

What is an API?

  • API stands for: Application Programming Interface
  • An API is is a set of definitions, protocols, and standards for interacting with a software application external to the script or application you have written
  • Some APIs are designed to facilitate interaction with locally installed software, while others are designed to allow you to communicate with software on a server

7 of 22

What is REST API?

  • A REST API allows you to utilize web services using HTTP methods (get, post, put, delete) according to the client-server model
  • No client information is stored between requests
  • The computer running your Python code (the client) can use a REST API to execute a process that is made available at a defined REST URL endpoint
  • REST endpoints are described in an API’s documentation

8 of 22

Learning to Use a REST API

  • Some APIs are public while others require an API key that allows the creator to control how it is used
  • APIs can be accessed either directly or through apps or interfaces made available for testing and learning

9 of 22

Recommendations for Using APIs?

  • Learn how to read API documentation
  • Make sure you understand API usage limits and stay in compliance with terms of use
  • Consider using an API wrapper if one exists for the API you want to utilize
  • Be conscious of API updates and how they might impact your scripted workflows

10 of 22

Using a REST API and Python

  • You can utilize HTTP methods in a Python script using the requests package
    • It is not part of the Python standard library and must be installed (it is already installed by default in Google Colab)
    • The package documentation is available at https://requests.readthedocs.io/en/latest/
    • Makes it easy to use HTTP requests in your Python code

11 of 22

Why Use Python and APIs to Retrieve Data?

Accessing data with a scripted process can be quicker than manually downloading it

Graphical user interfaces for data portals can be sometimes be difficult to navigate and use

A scripted process for accessing data allows you to reproduce your workflow later and allow others to replicate your work

If you are accessing frequently updating data from an external source, running a script to retrieve data at regular intervals can be useful

Data can get messy if you do not organize it after downloading it manually, but if downloading data using a script you can enforce naming conventions and a file organization structure

Efficiency

Ease of Use

Reproducibility

Data Updates

File

Management

12 of 22

Working with Data Returned by an API Call Using Python

  • APIs commonly allow users to request that data be returned in XML or JSON format
  • There are Python packages that can facilitate working with XML and JSON data like:
    • Json
    • Xmltodict
  • Refer to API documentation to learn how the data you request will be structured
  • Using a print() statement can also be helpful for previewing the data you have requested
  • You can also save the data you have requested to a file (CSV, JSON, XML, TXT, etc.)

13 of 22

Google Colab

  • Colab is a free Google service that allows you to create and run Jupyter Notebooks in the cloud
  • Allows you to write Python code and text notes in compartmentalized cells within a notebook
    • Code cells can be run individually to allow for previewing outputs and troubleshooting issues
  • Notebooks are stored in Google Drive and can access other files in Google Drive

https://research.google.com/colaboratory/faq.html

14 of 22

PRACTICE: Retrieving Data Using Python and APIs

  • Learn about helpful Python packages for data retrieval
  • Compare APIs
  • Preview retrieved data in different formats

Link to publicly shared Google Colab Notebook #1:

https://bit.ly/python-data-api-2025

Please save a copy of this script to your drive so you can edit it

15 of 22

Transition to Data Cleaning and Visualization

16 of 22

Key Technologies

  • Natural Language Toolkit (NLTK)
  • Pandas
  • Matplotlib
  • Wordcloud
  • For topic modeling, we’ll use Gensim

17 of 22

Natural Language Toolkit (NLTK)

  • A leading suite of libraries for building Python programs to work with human language data.
  • We’ll use NLTK for word tokenization and stopwords removal
  • NLTK is often considered especially well-suited for academic purposes.
  • An online book that serves as a guide to the package is available at https://www.nltk.org/book/

18 of 22

Pandas

  • Pandas is an open source data manipulation and analysis library.
  • We’ll use pandas to put our text data into a DataFrame to prep it for visualization.
  • The DataFrame is the primary pandas data structure.
  • A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns).

19 of 22

Matplotlib

  • Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
  • We’ll use it to transform our text data in the DataFrame into some simple static visualizations.
  • Visit https://matplotlib.org/ for curated examples, cheat sheets, documentation, and more.

20 of 22

Wordcloud and Gensim

  • Wordcloud is a simple word cloud generator written in Python.
  • Gensim is is an open-source library for unsupervised topic modeling and other natural language processing functionalities.

21 of 22

PRACTICE: Cleaning and Visualizing Data

  • First, we’ll clean our text to ready it for visualization and analysis
  • Next, we’ll load our data into a dataframe in Pandas
  • Finally, we’ll visualize the text in two different types of graphs
  • If we have time, we’ll explore topic modeling as well

22 of 22

Wrap Up

Michael Shensky

Head of Research Data Services

m.shensky@austin.utexas.edu

Questions? Comments?

Spring 2025 Data & Donuts

Workshop Recording and Materials

Next Data & Donuts workshop: Tomorrow!

Making Beautiful Plots in R’s ggplot2

12pm - 1:15pm on Zoom and in the PCL Scholars Lab Data

Ian Goodale

European Studies and Linguistics Librarian

ian.goodale@austin.utexas.edu