1 of 60

Pandas

Lecture 03

1

CMSC 320

 

INTRODUCTION TO �DATA SCIENCE

– FARDINA FATHMIUL ALAM

           (fardina@umd.edu)

“A Python Library for Manipulating Tabular Data”

2 of 60

Importance

  • Python: It's a user-friendly language for data analysis.
  • Git: Helps manage code and data changes when working with a team.
  • Pandas: Simplifies data cleaning and manipulation.
  • Databases: Needed for storing and retrieving data efficiently.

2

These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.

3 of 60

Topics to Cover

  • What is Pandas
  • How to install and import Pandas
  • Basic Pandas Workflow
  • Some Basic Operations using Pandas

4 of 60

Basic Concept Review: Table / Tabular Data

4

In tabular data, information is organized into rows and columns.

5 of 60

Tabular data is crucial for both Pandas and SQL

Provides a structured and organized way to represent, manipulate and visualize data.

  • Both pandas and SQL are commonly used for working with tabular data, making it essential for tasks such as data analysis, manipulation, and visualization.

5

6 of 60

There are some operations common in both SQL and Pandas-

select, join, aggregates etc.

6

7 of 60

1. SELECT/SLICING

Select only some of the rows, or some of the columns, or a combination

7

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

2

11.0

40.8

143.8

3

15.6

65.3

165.3

4

35.1

84.2

185.8

ID

age

1

12.2

2

11.0

3

15.6

4

35.1

Only columns

ID and Age

Only rows with wgt > 41

Both with wgt > 41

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

3

15.6

65.3

165.3

4

35.1

84.2

185.8

ID

age

1

12.2

3

15.6

4

35.1

8 of 60

2. AGGREGATE/REDUCE

Combine values across a column into a single value

8

ID

age

wgt_kg

hgt_cm

1

12.2

42.3

145.1

2

11.0

40.8

143.8

3

15.6

65.3

165.3

4

35.1

84.2

185.8

SUM

SUM(wgt_kg^2 - hgt_cm)

73.9

232.6

640.0

MAX

35.1

84.2

185.8

14167.66

What about ID/Index column?

Usually not meaningful to aggregate across it

May need to explicitly add an ID column

9 of 60

3. MAP

Apply a function to every row, possibly creating more or fewer columns

9

ID

Address

1

College Park, MD, 20742

2

Washington, DC, 20001

3

Silver Spring, MD, 20901

Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)

ID

City

State

Zipcode

1

College Park

MD

20742

2

Washington

DC

20001

3

Silver Spring

MD

20901

10 of 60

4. GROUP BY

Group tuples together by column/dimension

10

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

ID

B

C

1

3

6.6

3

4

3.1

4

3

8.0

7

4

2.3

8

3

8.0

ID

B

C

2

2

4.7

5

1

1.2

6

2

2.5

A = foo

A = bar

By ‘A’

11 of 60

4. GROUP BY

Group tuples together by column/dimension

11

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

By ‘B’

ID

A

C

5

bar

1.2

B = 1

ID

A

C

2

bar

4.7

6

bar

2.5

ID

A

C

3

foo

3.1

7

foo

2.3

ID

A

C

1

foo

6.6

4

foo

8.0

8

foo

8.0

B = 3

B = 2

B = 4

12 of 60

4. GROUP BY

Group tuples together by column/dimension

12

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

By ‘A’, ‘B’

ID

C

5

1.2

A = bar, B = 1

ID

C

2

4.7

6

2.5

ID

C

3

3.1

7

2.3

ID

C

1

6.6

4

8.0

8

8.0

A = foo, B = 3

A = bar, B = 2

A = foo, B = 4

13 of 60

5. GROUP BY AGGREGATE

Compute one aggregate per group

13

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Group by ‘B’

Sum on C

ID

A

C

5

bar

1.2

B = 1

ID

A

C

2

bar

4.7

6

bar

2.5

ID

A

C

3

foo

3.1

7

foo

2.3

ID

A

C

1

foo

6.6

4

foo

8.0

8

foo

8.0

B = 3

B = 2

B = 4

Sum (C)

1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)

22.6

Sum (C)

7.2

Sum (C)

5.4

14 of 60

5. GROUP BY AGGREGATE

Final result usually seen as a table

14

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Group by ‘B’

Sum on C

Sum (C)

1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)

22.6

Sum (C)

7.2

Sum (C)

5.4

B

SUM(C )

1

1.2

2

7.2

3

22.6

4

5.4

15 of 60

6. UNION / INTERSECTION / DIFFERENCE

Set operations – only if the two tables have identical attributes/columns

15

15

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

ID

A

B

C

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

U

ID

A

B

C

1

foo

3

6.6

2

bar

2

4.7

3

foo

4

3.1

4

foo

3

8.0

5

bar

1

1.2

6

bar

2

2.5

7

foo

4

2.3

8

foo

3

8.0

Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets

The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations

16 of 60

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

16

ID

A

B

1

foo

3

2

bar

2

3

foo

4

4

foo

3

ID

C

1

1.2

2

2.5

3

2.3

5

8.0

ID

A

B

C

1

foo

3

1.2

2

bar

2

2.5

3

foo

4

2.3

What about IDs not present in both tables?

Often need to keep them around

Can “pad” with NaN

17 of 60

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

Outer joins can be used to ”pad” IDs that don’t appear in both tables

Three variants: LEFT, RIGHT, FULL

SQL Terminology – pandas has these operations as well

17

ID

A

B

1

foo

3

2

bar

2

3

foo

4

4

foo

3

ID

C

1

1.2

2

2.5

3

2.3

5

8.0

ID

A

B

C

1

foo

3

1.2

2

bar

2

2.5

3

foo

4

2.3

4

foo

3

NaN

5

NaN

NaN

8.0

18 of 60

Next

Pandas

18

Class Colab:

Dataset:

19 of 60

What is Pandas

  • An open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools.
  • It's built on top of NumPy and is particularly useful for working with structured (tabular) data.
  • Key Data Structures
    1. Series
    2. DataFrame

Why Pandas?

  • Easy handling of missing data.�
  • Powerful grouping and aggregation functions.�
  • Quick data filtering, sorting, and merging.�
  • Excellent support for reading/writing data (CSV, Excel, SQL, JSON).

"Pandas": Short for "Python Data Analysis Library," used for data analysis.

Written by Wes McKinney: Started in 2008 to get a high-performance, flexible tool to perform quantitative analysis on financial data.

20 of 60

Pandas First Steps: install and import

  • Pandas is an easy package to install. Open up your terminal program (shell or cmd) and install it using either of the following commands:

  • For jupyter notebook users, you can run this cell:
    • The ! at the beginning runs cells as if they were in a terminal.

  • To import pandas we usually import it with a shorter name since it's used so much:

20

$ conda install pandas

OR

$ pip install pandas

!pip install pandas

import pandas as pd

After installing pandas, it's good practice to check the installed version to ensure

print(pd.__version__)

“pd”alias (in Python): an alternate name for referring to the same thing .

21 of 60

Core components of Pandas:

Series & DataFrames

21

Basic Pandas Workflow……..

22 of 60

Pandas: Data Table Representation

Data in Pandas is stored in two fundamental data types:

  • Series is the data type that stores a single column of data.
  • DataFrame is the data type that stores an entire dataset.

22

Notes: They both behave very similar to numpy arrays, so you can use them as input to anything in numpy, scipy, or scikit-learn. The main difference from numpy arrays is indexing.

23 of 60

  1. Series

One-dimensional labeled array (like a column).

  • Similar to NumPy array but with axis labels/index.

Create a Series

24 of 60

Series

An one dimensional array of data with an index

  • Subclass of numpy.ndarray
  • Has two components : index and value for each element (A bit similar concept as dictionary)
  • Data: any type
  • Index: most commonly integers, start with 0 (default).
      • Index labels need not be ordered
      • We almost NEVER access a series by an index
      • Duplicates possible but result in reduced functionality, may produce inaccurate result.

24

25 of 60

Series Functions

Pandas can do anything with a series you can do with an array

    • Sum, count, show unique, etc
    • If you have a basic thing you want to do with a series, look it up!
  • apply()
    • Apply takes a function as an argument and returns a series with that function applied
  • You can also convert a series into a series of booleans by applying a logical condition (which will be useful later)

25

26 of 60

2. Dataframe

Two-dimensional labeled data structure/table (like a spreadsheet or SQL table).

Most commonly used Pandas object.

Create a DataFrame

27 of 60

Dataframes

A collection of series organized by columns.

  • Each column can have a different type (integers, floats, strings, etc.)
  • Mutable size: insert and delete columns
  • Can Perform Arithmetic operations on rows and columns
  • We do not access rows via index (less common).
  • We DO access columns via their names

27

rows

columns

28 of 60

Dataframe Functions

Important functions:

  • How to access a column
  • How to filter a column
  • Apply (LATER TOPIC)
  • GroupBy

28

29 of 60

Basic Operations: Reading Data

# Read CSV file

df = pd.read_csv('data.csv')

# Read Excel file

df = pd.read_excel('data.xlsx') [ Reading (both XLS and XLSX), using the file path as an input.]

# Read Text file

df = pd.read_csv("diabetes.txt", sep="\s") [sep argument - separator. sep="\s" referring to single whitespace]

# Read Json file

df = pd.read_json("diabetes.json")

30 of 60

Basic Operations: Outputting data

# into a CSV file

df = df.to_csv("diabetes_out.csv", index=False)

For a Series s: s.to_csv("filename.csv")

# into am Excel file

df.to_excel("diabetes_out.xlsx", index=False)

# Read Text file

df = f.to_csv('diabetes_out.txt', header=df.columns, index=None, sep=' ')

# Read Json file

df.to_json("diabetes_out.json")

31 of 60

Example: Loading a DataFrame from files

31

32 of 60

Reading data from a CSV file

  • With CSV files, all you need is a single line to load in the data:

32

df = pd.read_csv('dataset.csv')

Load a CSV file into a Pandas DataFrame:

*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.

33 of 60

Viewing and understanding DataFrames using pandas

After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either view a small sample of the dataset or a summary of the data in the form of summary statistics.

Example:

Use DataFrame.head() and DataFrame.tail() to view the top (first) and bottom (last) few rows of the frame respectively.

  • Default value is 5, unless specified otherwise.

34 of 60

Basic Operations: Exploring Data

# Display First 5 rows

df.head()

# Display Last 3 rows

df.tail(n=3)

[specify the number of rows through the n argument (the default value is 5)].

# Data types

df.dtypes

# Summary statistics

df.describe()

prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.

# Shape (rows, columns)

df.shape

# Column names

df.columns

df.info() # Overview of columns and types

Get the number of rows and columns

df.shape[0] # Get the number of rows only

df.shape[1] # Get the number of columns only

35 of 60

Basic Operations: Selecting Data

# Select column

df['Age']

# Select multiple columns

df[['Name', 'City']]

# Select rows by index

df.iloc[0] # First row

df.iloc[1:3] # Rows 1-2

# Select rows by condition

df[df['Age'] > 30]

36 of 60

Example: How to access a column

dataframe['column_name']: Access a column by its name.

36

E.g. : df['age']

Where df=your dataframe

37 of 60

Arithmetic Operations

Can perform arithmetic operations (e.g., addition, subtraction, multiplication, division) on entire columns or between columns.

E.g.:

df['new_column'] = df['column1'] + df['column2']

df['new_column'] = df['column1'] - df['column2']

E.g # Modify column

df['Age'] = df['Age'] + 1

37

38 of 60

Applying Functions Directly to the dataframe

You can apply different functions directly to a DataFrame column:

Syntax: dataframe[column_name].function()

38

39 of 60

Applying Functions Directly to the dataframe

Aggregation operations are used to compute summary statistics or single values from multiple values in a dataset. E.g:

39

E.g.: df[“age”].sum()

Try some other aggregation operation!

40 of 60

Filtering in Pandas

1. Column Filtering (Selecting Columns):

Select specific columns or filter columns based on names/conditions. Method: Basic Selection: df[['col1']] or df[['col1', 'col2']].

  • Preparing data for modeling (selecting features).
  • Reducing memory usage by dropping unused columns.

2. Row Filtering (Selecting Rows):

Filter rows based on conditions applied to column values. Method: Boolean Indexing (Most common one).

  • Analyzing subsets (e.g., specific time periods, categories).
  • Cleaning data (removing duplicates, outliers).

Two Types:

Focus on specific subsets of your data

41 of 60

  1. Filtering Columns:Filtering a Pandas DataFrame by column values

Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.

# Sample DataFrame

df = pd.DataFrame({

'name': ['Alice', 'Bob', 'Charlie'],

'age': [25, 30, 35],

'city': ['NY', 'LA', 'Chicago']

})

# Method 1: Select specific columns

df_cols = df[['name', 'age']] # Keeps only 'name' and 'age'

42 of 60

2. Filtering Rows: How to Filter Rows based on Condition (s)

Filter rows where a column meets a condition:

42

E.g.: Filter rows where age is greater than 25.

dataframe [boolean condition]

df [df[‘Age’]>29]

Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]

Q: What about multiple condition?

(Boolean Indexing)

43 of 60

Count Unique Values

Basic Usage: Count occurrences of each unique value in a column:

Returns the frequency of each unique value in a Series, sorted by default in descending order.

Use value_counts() after filtering when you want to see/count how many times each unique value appears in a Series (a single column).

You can also Count unique values in a filtered subset

44 of 60

Key Aggregation Methods Supported by Pandas

E.g: Get Statistics

Method

Action

Works On

.sum()

Sum of values

DataFrame/Series

.mean()

Mean (average)

DataFrame/Series

.count()

Non-NA values

DataFrame/Series

.size()

All rows (including NA)

GroupBy only

.min()/.max()

Minimum/Maximum

DataFrame/Series

.std()/.var()

Standard deviation/Variance

DataFrame/Series

.describe()

Summary statistics (count, mean, std, etc.)

DataFrame/Series

45 of 60

Applying Aggregation Functions Directly to a DataFrame

We can use different key statistical methods directly on a DataFrame or Series.

(Next Topic)

46 of 60

More Advanced: Filtering Data & Apply Statistical Functions

We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:

46

df[df['condition']][column_name].statistics_function()

  • Filters the DataFrame based on the condition specified within the brackets.
  • It selects only the rows that satisfy the condition.
  • Selects the specific column from the filtered DataFrame obtained in left, resulting in a Series containing only the values from that column.
  • Applies a statistical function to the Series obtained in the left to compute a summary statistic.

47 of 60

More Advanced: Filtering Data & Apply Statistical Functions

We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:

47

df[df['condition']][column_name].statistics_function()

Try: Calculate the mean of the 'Number' column where the 'Age' is greater than 25

48 of 60

Grouping Data (Groupby)

Purpose: The groupby() operation splits data into groups based on a column/criteria, and then apply a function (like sum, mean, etc.) to each group, and combines results.

  • This process is commonly referred to as "split-apply-combine."

48

df.groupby('condition')[target_column_name].statistics_function()

  • Groups the DataFrame based on the values in the 'condition' column.
  • Splits the DataFrame into groups based on unique (identical) values in the specified column.
  • Selects a specific column from each group, resulting in a Series containing the values from that column within each group..
  • Applies a statistical function to each group of values in the selected column.

Splitting

Apply

Combine

grouping_column

49 of 60

Common Usage:

  • # Group by a single column

grouped = df.groupby('column_name')

  • # Group by multiple columns

grouped = df.groupby(['col1', 'col2'])

df.groupby('condition')[target_column_name].statistics_function()

Example:

  • # Group by and apply aggregation

df.groupby('category')['value'].mean()

  • df.groupby('condition') - Groups the DataFrame by the column 'condition'
  • [column_name] - Selects a specific column ('value' in your example) from each group. [OPTIONAL]
  • .statistics_function() - Applies an aggregation function (like mean(), sum(), count(), etc.)

grouping_column

50 of 60

Example: Grouping Data (Groupby)

50

1. group by “order”

2. Apply “Sum” to each group

3. Combine the result

51 of 60

Group By - More Examples (1)

  • Group the rows of the DataFrame into two categories based on whether the Values column is greater than equal 15 or not, and calculate the sum of the Values for each group.
  • Output:

52 of 60

Group By - More Complicated Examples (2)

  • Find the maximum Values for each Category, but only include rows where Values are less than 25.
  • Step 1: Filter rows where `Values` < 25

  • Step 2: Group by `Category` and calculate the maximum `Values`
  • Output:

53 of 60

Pandas Label/Position-based indexing and slicing: loc, iloc

53

https://pandas.pydata.org/docs/user_guide/indexing.html

loc: Used for label-based indexing.

Select rows and columns by label (the actual name of the row or column).

  • Allows to access a group of rows and columns by labels or boolean arrays.
  • Example: df.loc[row_label, column_label:
    • df.loc[1, 'Age'] # Selects the value in row 1 and column 'Age'

iloc: Used for integer-location based indexing.

Select rows and columns by position (using row/column numbers, starting from 0).

  • allows to access rows and columns by their integer position (i.e., index positions).
  • Example: df.iloc[row_index, column_index]
    • df.iloc[1, 1] # Selects the value in the second row, second column

54 of 60

Pandas Label/Position-based indexing and slicing: loc, iloc

54

# Access rows with labels from index 2 to 4 and columns 'A' and 'B'

df.loc[2:4, ['A', 'B']]

# Access rows from index 2 to 4 and columns at positions 0 and 1

df.iloc[2:5, [0, 1]]

https://pandas.pydata.org/docs/user_guide/indexing.html

55 of 60

Example: loc, iloc

55

https://pandas.pydata.org/docs/user_guide/indexing.html

56 of 60

When in doubt, visualize

You can visualize what we just did by plotting the dataframes.

  • Install matplotlib

conda install matplotlib

56

57 of 60

Pandas: Advantages

  • A powerful python library for data tasks like analysis and cleaning.
  • Simplifies data representation for better understanding.
  • Cleans messy datasets for readability and relevance.
  • Increases productivity by minimizing writing.
  • Offers extensive features for easy data analysis.

57

58 of 60

References

58

59 of 60

More References

CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

59

60 of 60

The End