1 of 60

Pandas

Lecture 03

1

CMSC 320�

INTRODUCTION TO �DATA SCIENCE

– FARDINA FATHMIUL ALAM

(fardina@umd.edu)

�

“A Python Library for Manipulating Tabular Data”

2 of 60

Importance

Python: It's a user-friendly language for data analysis.
Git: Helps manage code and data changes when working with a team.
Pandas: Simplifies data cleaning and manipulation.
Databases: Needed for storing and retrieving data efficiently.

2

These skills are essential for effective data analysis, collaboration, and handling data in real-world projects.

3 of 60

Topics to Cover

What is Pandas
How to install and import Pandas
Basic Pandas Workflow
Some Basic Operations using Pandas

4 of 60

Basic Concept Review: Table / Tabular Data

4

In tabular data, information is organized into rows and columns.

5 of 60

Tabular data is crucial for both Pandas and SQL

Provides a structured and organized way to represent, manipulate and visualize data.

Both pandas and SQL are commonly used for working with tabular data, making it essential for tasks such as data analysis, manipulation, and visualization.

5

6 of 60

There are some operations common in both SQL and Pandas-

select, join, aggregates etc.

6

7 of 60

1. SELECT/SLICING

Select only some of the rows, or some of the columns, or a combination

7

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
2	11.0	40.8	143.8
3	15.6	65.3	165.3
4	35.1	84.2	185.8

ID	age
1	12.2
2	11.0
3	15.6
4	35.1

Only columns

ID and Age

Only rows with wgt > 41

Both with wgt > 41

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
3	15.6	65.3	165.3
4	35.1	84.2	185.8

ID	age
1	12.2
3	15.6
4	35.1

8 of 60

2. AGGREGATE/REDUCE

Combine values across a column into a single value

8

ID	age	wgt_kg	hgt_cm
1	12.2	42.3	145.1
2	11.0	40.8	143.8
3	15.6	65.3	165.3
4	35.1	84.2	185.8

SUM

SUM(wgt_kg^2 - hgt_cm)

73.9	232.6	640.0

MAX

35.1	84.2	185.8

14167.66

What about ID/Index column?

Usually not meaningful to aggregate across it

May need to explicitly add an ID column

9 of 60

3. MAP

Apply a function to every row, possibly creating more or fewer columns

9

ID	Address
1	College Park, MD, 20742
2	Washington, DC, 20001
3	Silver Spring, MD, 20901

Variations that allow one row to generate multiple rows in the output (sometimes called “flatmap”)

ID	City	State	Zipcode
1	College Park	MD	20742
2	Washington	DC	20001
3	Silver Spring	MD	20901

10 of 60

4. GROUP BY

Group tuples together by column/dimension

10

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	B	C
1	3	6.6
3	4	3.1
4	3	8.0
7	4	2.3
8	3	8.0

ID	B	C
2	2	4.7
5	1	1.2
6	2	2.5

A = foo

A = bar

By ‘A’

11 of 60

4. GROUP BY

Group tuples together by column/dimension

11

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

By ‘B’

ID	A	C
5	bar	1.2

B = 1

ID	A	C
2	bar	4.7
6	bar	2.5

ID	A	C
3	foo	3.1
7	foo	2.3

ID	A	C
1	foo	6.6
4	foo	8.0
8	foo	8.0

B = 3

B = 2

B = 4

12 of 60

4. GROUP BY

Group tuples together by column/dimension

12

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

By ‘A’, ‘B’

ID	C
5	1.2

A = bar, B = 1

ID	C
2	4.7
6	2.5

ID	C
3	3.1
7	2.3

ID	C
1	6.6
4	8.0
8	8.0

A = foo, B = 3

A = bar, B = 2

A = foo, B = 4

13 of 60

5. GROUP BY AGGREGATE

Compute one aggregate per group

13

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Group by ‘B’

Sum on C

ID	A	C
5	bar	1.2

B = 1

ID	A	C
2	bar	4.7
6	bar	2.5

ID	A	C
3	foo	3.1
7	foo	2.3

ID	A	C
1	foo	6.6
4	foo	8.0
8	foo	8.0

B = 3

B = 2

B = 4

Sum (C)
1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)
22.6

Sum (C)
7.2

Sum (C)
5.4

14 of 60

5. GROUP BY AGGREGATE

Final result usually seen as a table

14

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Group by ‘B’

Sum on C

Sum (C)
1.2

B = 1

B = 3

B = 2

B = 4

Sum (C)
22.6

Sum (C)
7.2

Sum (C)
5.4

B	SUM(C )
1	1.2
2	7.2
3	22.6
4	5.4

15 of 60

6. UNION / INTERSECTION / DIFFERENCE

Set operations – only if the two tables have identical attributes/columns

15

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0

ID	A	B	C
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

U

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

Similarly Intersection (returns the rows that are common to both tables) and Set Difference EXCEPT/MINUS (returns the rows that are in one table but not in the other)manipulate tables as Sets

The way IDs are treated (whether as unique or non-unique identifiers) can lead to different behaviors in these operations

16 of 60

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

16

ID	A	B
1	foo	3
2	bar	2
3	foo	4
4	foo	3

ID	C
1	1.2
2	2.5
3	2.3
5	8.0

ID	A	B	C
1	foo	3	1.2
2	bar	2	2.5
3	foo	4	2.3

⨝

What about IDs not present in both tables?

Often need to keep them around

Can “pad” with NaN

17 of 60

7. MERGE OR JOIN

Combine rows/tuples across two tables if they have the same key

Outer joins can be used to ”pad” IDs that don’t appear in both tables

Three variants: LEFT, RIGHT, FULL

SQL Terminology – pandas has these operations as well

17

ID	A	B
1	foo	3
2	bar	2
3	foo	4
4	foo	3

ID	C
1	1.2
2	2.5
3	2.3
5	8.0

ID	A	B	C
1	foo	3	1.2
2	bar	2	2.5
3	foo	4	2.3
4	foo	3	NaN
5	NaN	NaN	8.0

⟗

18 of 60

18

https://colab.research.google.com/drive/1i2IJTAodNi3RVYuoGM5INU4bHOiN-eBU?usp=sharing

Class Colab:

https://drive.google.com/file/d/1au5qJnRR57pOeAt4aefzxwsmG02JyaR_/view?usp=sharing

Dataset:

19 of 60

What is Pandas

An open-source Python library that provides high-performance, easy-to-use data structures and data analysis tools.
It's built on top of NumPy and is particularly useful for working with structured (tabular) data.
Key Data Structures

Series
DataFrame

Why Pandas?

Easy handling of missing data.�
Powerful grouping and aggregation functions.�
Quick data filtering, sorting, and merging.�
Excellent support for reading/writing data (CSV, Excel, SQL, JSON).

"Pandas": Short for "Python Data Analysis Library," used for data analysis.

Written by Wes McKinney: Started in 2008 to get a high-performance, flexible tool to perform quantitative analysis on financial data.

20 of 60

Pandas First Steps: install and import

Pandas is an easy package to install. Open up your terminal program (shell or cmd) and install it using either of the following commands:

For jupyter notebook users, you can run this cell:

The ! at the beginning runs cells as if they were in a terminal.

To import pandas we usually import it with a shorter name since it's used so much:

20

$ conda install pandas

OR

$ pip install pandas

!pip install pandas

import pandas as pd

Installation: https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

After installing pandas, it's good practice to check the installed version to ensure

print(pd.__version__)

“pd” → alias (in Python): an alternate name for referring to the same thing .

21 of 60

Core components of Pandas:

Series & DataFrames

21

Basic Pandas Workflow……..

22 of 60

Pandas: Data Table Representation

Data in Pandas is stored in two fundamental data types:

Series is the data type that stores a single column of data.
DataFrame is the data type that stores an entire dataset.

22

Notes: They both behave very similar to numpy arrays, so you can use them as input to anything in numpy, scipy, or scikit-learn. The main difference from numpy arrays is indexing.

23 of 60

Series

One-dimensional labeled array (like a column).

Similar to NumPy array but with axis labels/index.

Create a Series

24 of 60

Series

An one dimensional array of data with an index

Subclass of numpy.ndarray
Has two components : index and value for each element (A bit similar concept as dictionary)
Data: any type
Index: most commonly integers, start with 0 (default).

Index labels need not be ordered
We almost NEVER access a series by an index
Duplicates possible but result in reduced functionality, may produce inaccurate result.

24

25 of 60

Series Functions

Pandas can do anything with a series you can do with an array

Sum, count, show unique, etc
If you have a basic thing you want to do with a series, look it up!

apply()

Apply takes a function as an argument and returns a series with that function applied

You can also convert a series into a series of booleans by applying a logical condition (which will be useful later)

25

26 of 60

2. Dataframe

Two-dimensional labeled data structure/table (like a spreadsheet or SQL table).

Most commonly used Pandas object.

Create a DataFrame

27 of 60

Dataframes

A collection of series organized by columns.

Each column can have a different type (integers, floats, strings, etc.)
Mutable size: insert and delete columns
Can Perform Arithmetic operations on rows and columns
We do not access rows via index (less common).
We DO access columns via their names

27

rows

columns

28 of 60

Dataframe Functions

Important functions:

How to access a column
How to filter a column
Apply (LATER TOPIC)
GroupBy

28

29 of 60

Basic Operations: Reading Data

# Read CSV file

df = pd.read_csv('data.csv')

# Read Excel file

df = pd.read_excel('data.xlsx') [ Reading (both XLS and XLSX), using the file path as an input.]

# Read Text file

df = pd.read_csv("diabetes.txt", sep="\s") [sep argument - separator. sep="\s" referring to single whitespace]

# Read Json file

df = pd.read_json("diabetes.json")

30 of 60

Basic Operations: Outputting data

# into a CSV file

df = df.to_csv("diabetes_out.csv", index=False)

For a Series s: s.to_csv("filename.csv")

# into am Excel file

df.to_excel("diabetes_out.xlsx", index=False)

# Read Text file

df = f.to_csv('diabetes_out.txt', header=df.columns, index=None, sep=' ')

# Read Json file

df.to_json("diabetes_out.json")

31 of 60

Example: Loading a DataFrame from files

31

32 of 60

Reading data from a CSV file

With CSV files, all you need is a single line to load in the data:

32

df = pd.read_csv('dataset.csv')

Load a CSV file into a Pandas DataFrame:

*** If the data contains only one column and you specify squeeze=True, pandas will convert the result to a Series.

33 of 60

Viewing and understanding DataFrames using pandas

After reading tabular data as a DataFrame, you would need to have a glimpse of the data. You can either view a small sample of the dataset or a summary of the data in the form of summary statistics.

Example:

Use DataFrame.head() and DataFrame.tail() to view the top (first) and bottom (last) few rows of the frame respectively.

Default value is 5, unless specified otherwise.

34 of 60

Basic Operations: Exploring Data

# Display First 5 rows df.head() # Display Last 3 rows df.tail(n=3) [specify the number of rows through the n argument (the default value is 5)].	# Data types df.dtypes # Summary statistics df.describe() prints the summary statistics of all numeric columns, such as count, mean, standard deviation, range, and quartiles of numeric columns.	# Shape (rows, columns) df.shape # Column names df.columns df.info() # Overview of columns and types

Get the number of rows and columns

df.shape[0] # Get the number of rows only

df.shape[1] # Get the number of columns only

35 of 60

Basic Operations: Selecting Data

# Select column df['Age'] # Select multiple columns df[['Name', 'City']]	# Select rows by index df.iloc[0] # First row df.iloc[1:3] # Rows 1-2	# Select rows by condition df[df['Age'] > 30]

36 of 60

Example: How to access a column

dataframe['column_name']: Access a column by its name.

36

E.g. : df['age']

Where df=your dataframe

37 of 60

Arithmetic Operations

Can perform arithmetic operations (e.g., addition, subtraction, multiplication, division) on entire columns or between columns.

E.g.:

df['new_column'] = df['column1'] + df['column2']

df['new_column'] = df['column1'] - df['column2']

E.g # Modify column

df['Age'] = df['Age'] + 1

37

38 of 60

Applying Functions Directly to the dataframe

You can apply different functions directly to a DataFrame column:

Syntax: dataframe[column_name].function()

38

39 of 60

Applying Functions Directly to the dataframe

Aggregation operations are used to compute summary statistics or single values from multiple values in a dataset. E.g:

39

E.g.: df[“age”].sum()

Try some other aggregation operation!

40 of 60

Filtering in Pandas

1. Column Filtering (Selecting Columns): Select specific columns or filter columns based on names/conditions. Method: Basic Selection: df[['col1']] or df[['col1', 'col2']].	Preparing data for modeling (selecting features). Reducing memory usage by dropping unused columns.
2. Row Filtering (Selecting Rows): Filter rows based on conditions applied to column values. Method: Boolean Indexing (Most common one).	Analyzing subsets (e.g., specific time periods, categories). Cleaning data (removing duplicates, outliers).

Two Types:

Focus on specific subsets of your data

41 of 60

Filtering Columns:Filtering a Pandas DataFrame by column values

Extract specific rows based on conditions applied to one or more columns, making it easier to work with relevant subsets of data.

# Sample DataFrame

df = pd.DataFrame({

'name': ['Alice', 'Bob', 'Charlie'],

'age': [25, 30, 35],

'city': ['NY', 'LA', 'Chicago']

})

# Method 1: Select specific columns

df_cols = df[['name', 'age']] # Keeps only 'name' and 'age'

42 of 60

2. Filtering Rows: How to Filter Rows based on Condition (s)

Filter rows where a column meets a condition:

42

E.g.: Filter rows where age is greater than 25.

dataframe [boolean condition]

df [df[‘Age’]>29]

Known as “Boolean Indexing”: Use boolean indexing to filter data based on conditions. ex: filtered_df = df[df['Value'] > 20]

Q: What about multiple condition?

(Boolean Indexing)

43 of 60

Count Unique Values

Basic Usage: Count occurrences of each unique value in a column:

Returns the frequency of each unique value in a Series, sorted by default in descending order.

Use value_counts() after filtering when you want to see/count how many times each unique value appears in a Series (a single column).

You can also Count unique values in a filtered subset

44 of 60

Key Aggregation Methods Supported by Pandas

E.g: Get Statistics

Method	Action	Works On
.sum()	Sum of values	DataFrame/Series
.mean()	Mean (average)	DataFrame/Series
.count()	Non-NA values	DataFrame/Series
.size()	All rows (including NA)	GroupBy only
.min()/.max()	Minimum/Maximum	DataFrame/Series
.std()/.var()	Standard deviation/Variance	DataFrame/Series
.describe()	Summary statistics (count, mean, std, etc.)	DataFrame/Series

45 of 60

Applying Aggregation Functions Directly to a DataFrame

We can use different key statistical methods directly on a DataFrame or Series.

(Next Topic)

46 of 60

More Advanced: Filtering Data & Apply Statistical Functions

We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:

46

df[df['condition']][column_name].statistics_function()

Filters the DataFrame based on the condition specified within the brackets.
It selects only the rows that satisfy the condition.

Selects the specific column from the filtered DataFrame obtained in left, resulting in a Series containing only the values from that column.

Applies a statistical function to the Series obtained in the left to compute a summary statistic.

47 of 60

More Advanced: Filtering Data & Apply Statistical Functions

We can filter rows in a DataFrame based on a condition and then apply statistical functions to a specific column:

47

df[df['condition']][column_name].statistics_function()

Try: Calculate the mean of the 'Number' column where the 'Age' is greater than 25

48 of 60

Grouping Data (Groupby)

Purpose: The groupby() operation splits data into groups based on a column/criteria, and then apply a function (like sum, mean, etc.) to each group, and combines results.

This process is commonly referred to as "split-apply-combine."

48

df.groupby('condition')[target_column_name].statistics_function()

Groups the DataFrame based on the values in the 'condition' column.
Splits the DataFrame into groups based on unique (identical) values in the specified column.

Selects a specific column from each group, resulting in a Series containing the values from that column within each group..

Applies a statistical function to each group of values in the selected column.

Splitting

Apply

Combine

grouping_column

49 of 60

Common Usage:

# Group by a single column

grouped = df.groupby('column_name')

# Group by multiple columns

grouped = df.groupby(['col1', 'col2'])

df.groupby('condition')[target_column_name].statistics_function()

Example:

# Group by and apply aggregation

df.groupby('category')['value'].mean()

df.groupby('condition') - Groups the DataFrame by the column 'condition'
[column_name] - Selects a specific column ('value' in your example) from each group. [OPTIONAL]
.statistics_function() - Applies an aggregation function (like mean(), sum(), count(), etc.)

grouping_column

50 of 60

Example: Grouping Data (Groupby)

50

1. group by “order”

2. Apply “Sum” to each group

3. Combine the result

51 of 60

Group By - More Examples (1)

Group the rows of the DataFrame into two categories based on whether the Values column is greater than equal 15 or not, and calculate the sum of the Values for each group.

Output:

52 of 60

Group By - More Complicated Examples (2)

Find the maximum Values for each Category, but only include rows where Values are less than 25.

Step 1: Filter rows where `Values` < 25

Step 2: Group by `Category` and calculate the maximum `Values`
Output:

53 of 60

Pandas Label/Position-based indexing and slicing: loc, iloc

53

https://pandas.pydata.org/docs/user_guide/indexing.html

loc: Used for label-based indexing.

Select rows and columns by label (the actual name of the row or column).

Allows to access a group of rows and columns by labels or boolean arrays.
Example: df.loc[row_label, column_label:

df.loc[1, 'Age'] # Selects the value in row 1 and column 'Age'

iloc: Used for integer-location based indexing.

Select rows and columns by position (using row/column numbers, starting from 0).

allows to access rows and columns by their integer position (i.e., index positions).
Example: df.iloc[row_index, column_index]

df.iloc[1, 1] # Selects the value in the second row, second column

54 of 60

Pandas Label/Position-based indexing and slicing: loc, iloc

54

# Access rows with labels from index 2 to 4 and columns 'A' and 'B'

df.loc[2:4, ['A', 'B']]

# Access rows from index 2 to 4 and columns at positions 0 and 1

df.iloc[2:5, [0, 1]]

https://pandas.pydata.org/docs/user_guide/indexing.html

55 of 60

Example: loc, iloc

55

https://pandas.pydata.org/docs/user_guide/indexing.html

56 of 60

When in doubt, visualize

You can visualize what we just did by plotting the dataframes.

Install matplotlib

conda install matplotlib

56

57 of 60

Pandas: Advantages

A powerful python library for data tasks like analysis and cleaning.
Simplifies data representation for better understanding.
Cleans messy datasets for readability and relevance.
Increases productivity by minimizing writing.
Offers extensive features for easy data analysis.

57

58 of 60

References

pandas documentation

https://pandas.pydata.org/pandas-docs/stable/index.html

pandas: Input/output

https://pandas.pydata.org/pandas-docs/stable/reference/io.html

pandas: DataFrame

https://pandas.pydata.org/pandas-docs/stable/reference/frame.html

pandas: Series

https://pandas.pydata.org/pandas-docs/stable/reference/series.html

pandas: Plotting

https://pandas.pydata.org/pandas-docs/stable/reference/plotting.html

58

59 of 60

More References

https://github.com/TirendazAcademy/PANDAS-TUTORIAL/blob/main/01-What-is-Pandas.ipynb
Here's a good playlist by Kevin Markham, a data science instructor, on how to do specific things in Pandas: https://www.youtube.com/playlist?list=PL5-da3qGB5ICCsgW1MxlZ0Hq8LL5U3u9y
Python Data Science Handbook by Jake VanderPlas
Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython by Wes McKinney
https://pandas.pydata.org/pandas-docs/stable/comparison_with_sql.html

CheatSheet: ttps://s3.amazonaws.com/assets.datacamp.com/blog_assets/PandasPythonForDataScience.pdf

59

60 of 60

The End

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0

ID	A	B	C
1	foo	3	6.6
2	bar	2	4.7
3	foo	4	3.1
4	foo	3	8.0
5	bar	1	1.2
6	bar	2	2.5
7	foo	4	2.3
8	foo	3	8.0