1 of 107

Exam Point of View Preparation

Series – 8 Marks, DataFrames – 27 Marks

2 of 107

Series

Important Topics

  • Introduction to Pandas
  • What is Series? Give one example.
  • Differences between Series and DataFrame
  • Series Attributes – Theory and Outputs
  • Series Slicing
  • Learn Series Operations

3 of 107

DataFrames

Important Topics

  • What is DataFrame? Give one example.

DataFrame Properties (Rows, Columns, Indexes, Default Values….)

  • Creation of DataFrames using various methods

DataFrames Output

  • DataFrame Attributes – Theory and Outputs

4 of 107

DataFrames Operations

loc, iloc,at, iat functions

Accessing(Displaying) /Modifying/Adding – Single Column

Accessing(Displaying) /Modifying/Adding – Single Row

Accessing(Displaying) /Modifying/Adding – Multiple Columns

Accessing(Displaying) /Modifying/Adding – Multiple Rows

Accessing(Displaying) /Modifying/Adding – Single/Multiple Columns & Single/Multiple Rows

Accessing(Displaying) /Modifying/Adding – Single value

(print means you need to keep the statement within print statement)

Display first rows (head), last rows (tail)

Rename column heading, Rename row label

Remove a row, remove a column

5 of 107

Binary Operations:

addition, subtraction, multiplication, division

ITERATION – iterrows( ) & iteritems( )

iterating Over a Data Frame

6 of 107

SERIES

7 of 107

Pandas or Python Pandas is Python’s library for data analysis. Pandas has derived its name from “panel data system”, which is an ecometrics term for multi dimensional, structured data sets. Pandas has become a popular choice for data analysis.

Data analysis refers to process of evaluating big data sets using analytical and statistical tools so as to discover useful information and conclusions to support business decision-making. The main author of Pandas is Wes McKinney.

Using Pandas

Pandas is an open source, BSD library built for Python programming language.

Pandas Offers high performance, easy to use data structures and data analysis tools.

We need to import pandas:

import pandas (or) import pandas as <identifier>

Ex: import pandas as pd

If we use numpy arrays, import numpy as np

Note: Pandas uses NumPy as its support library and hence many datatypes, constants and functions of NumPy are frequently used with Pandas.

Wrongly Given Import Statement:

Import Pandas (I and P are in Capitals)

import pandas as S.No (. or any special symbol is not allowed)

import pandas as if (keywords should not be given)

import pandas as 7rno (should not start with a digit)

(It can contain alphabets, digits and underscore only.

It should not have any space, special symbol, keyword.

It should not start with a digit)

PANDAS INTRODUCTION

8 of 107

Why Pandas?

  • Pandas is the most popular library in the scientific python ecosystem for doing data analysis.
  • It can read or write in many different data formats (integer, float, double, etc)
  • It can calculate in all the possible ways data is organized i.e., across rows and down columns.
  • It can easily select subsets of data from bulky data sets and even combine multiple datasets together. It has functionality to find and fill missing data.
  • It allows you to apply operations to independent groups within the data.
  • It supports reshaping of data into different forms.
  • It supports advanced time-series functionality (time series forecasting is the use of a model to predict future values based on previously observed values)
  • It supports visualization by integrating matplotlib and seaborn etc. Libraries.

Pandas is the best at handling huge tabular data sets comprising different data formats.

9 of 107

Pandas Data Structure:

.

Data Structures refer to specialized way of storing data so as to apply a specific type of functionality on them.

We can think Pandas data structures as enhanced versions of NumPy structured arrays in which the rows and columns can be identified and accessed with labels rather than simple integer indices.

Out of many data structures of Pandas, two basic data structures – Series and DataFrame are universally popular for their dependability.

(Pandas also supports Panel Data Structure, but it is not in syllabus)

PANDAS DATA STRUCTURES

10 of 107

Property

Series

DataFrame

Dimensions

1 Dimensional

2-Dimensional

Type of Data

Homegeneous, i.e., all the elements must be of same type in a Series object

Heterogeneous, i.e., a DataFrame object can have elements of different data types

 

Mutability

Value mutable, i.e., their elements value can change

Value mutable, i.e., their elements value can change

Size-immutable, i.e., size of a Series object, once created, cannot change. If we want to add/drop an element, internally a new Series object will be created

Size-mutable, i.e., size of a Dataframe object, once created, can change in place. That is, you can add/drop elements in an existing dataframe object.

11 of 107

1

2

3

4

‘A’

‘B’

‘C’

‘D’

Index Data

Series:

DataFrame:

Examples of Series & DataFrame

12 of 107

A Series is a Pandas data structure that represents a one dimensional array of indexed data.

It represents 1 D array like object containing an array of data for any NumPy data type and an associated array of data labels, called its index.

A Series type object has two main components:

* An array of actual data

* An associated array of indexes or data labels.

Both components are one-dimensional arrays with the same length. The index is used to access individual values.

Ex:

SERIES

13 of 107

SERIES ATTRIBUTES

14 of 107

Series Object Attributes: When we create a Series type object, all information related to it is available through attributes. Syntax: <Series object>.<attribute name>

Attribute

Description

index

The index(axis labels) of the Series

index.name

Name of the index. Can be used to assign new name to index.

values

Return Series as ndarray or ndarray-like (data) depending on the dtype

dtype

Return the dtype object of the underlying data (datatype)

shape

Return a tuple of the shape of the underlying data

nbytes

Return the number of bytes in the underlying data

ndim

Return the number of dimensions of the underlying data

size

Return the number of elements in the underlying data

itemsize

Return the size of the dtype of the item of the underlying data

 Note : In newer versions, they have removed the Itemsize

hasnans

Return True if there are any NaN values; otherwise return False

empty

Return True if the Series object is empty, false otherwise.

name

Return or assign name to Series object

Series Attributes

15 of 107

Consider the following Series Object:

>>> Marks=[34,33,np.NaN,38,40]

>>> Exams=["CT1","CT2","CT3","CT4","CT5"]

>>> S=pd.Series(Marks,index=Exams)

>>> S

CT1 34.0

CT2 33.0

CT3 NaN

CT4 38.0

CT5 40.0

dtype: float64

 (i) index :

>>> S.index

Index(['CT1', 'CT2', 'CT3', 'CT4', 'CT5'], dtype='object')

(ii) index.name:

>>> S.index.name="Test"

>>>S.index.name

'Test'

>>> S.index

>>>Index(['CT1', 'CT2', 'CT3', 'CT4', 'CT5'],

dtype='object', name='Test')

 

(iii) values:

>>> S.values

array([34., 33., nan, 38., 40.])

 

(iv) dtype:

>>> S.dtype

dtype('float64')

16 of 107

>>> S

CT1 34.0

CT2 33.0

CT3 NaN

CT4 38.0

CT5 40.0

dtype: float64

(v) shape:

>>> S.shape

(5, )

 

(vi) nbytes:

>>> S.nbytes #5 elements X 4 bytes for float

40

(vii) ndim:

>>> S.ndim # Series is One Dimensional

1

(viii) size:

>>> S.size # 5 elements

5

(ix) itemsize:

In new versions, this property was removed.

AttributeError: 'Series' object has no attribute 'itemsize'

(x) hasnans:

>>> S.hasnans

True

(xi) empty:

>>> S.empty

False

(xii) name

>>> S.name="MySeries"

>>>S.name

'MySeries'

>>>S

CT1 34.0

CT2 33.0

CT3 NaN

CT4 38.0

CT5 40.0

Name: MySeries, dtype: float64

17 of 107

Other example related to index:

>>> S3=pd.Series(data=np.arange(5,25,4))

>>> S3.index

RangeIndex(start=0, stop=5, step=1)

>>> a=np.arange(9,13)

>>> S4=pd.Series(index=a,data=a*2)

>>> S4.index

Index([9, 10, 11, 12], dtype='int32')

 

Some functions

Function

Use

len( )

To get total number of elements (including NaN values)

count( )

To get the count of non-NaN values in a series object

type( )

To know the data type of an object

>>> len(S)

5

>>> S.count()

4

>>> type(S)

<class 'pandas.core.series.Series'>

>>>S4

9 18

10 20

11 22

12 24

dtype: int32

18 of 107

Note: In the same statement, we can work with 2 or more attributes of the Same Series or different also.

>>>S1=pd.Series([10,20,30])

>>>S1

0 10

1 20

2 30

dtype: int64

>>>S2=pd.Series(['a','e','i','o','u'],index=[100,200,

300,400,500])

>>>S2

100 a

200 e

300 i

400 o

500 u

dtype: object

>>>S1.shape,S2.shape

((3,), (5,))

>>>print(S1.shape,S2.shape)

(3,) (5,)

>>>S1.ndim,S2.nbytes

(1, 40)

>>>print(S1.ndim,S2.nbytes)

1 40

19 of 107

Student’s Task

>>>Veg=pd.Series(['Onion','Carrot','Beetroot',

'Potato'],[30,70,50,20])

>>> Veg

30 Onion

70 Carrot

50 Beetroot

20 Potato

dtype: object

For the above Series “Veg”, Work with the all 12 attributes and 3 functions. Also write outputs.

20 of 107

SLICING

21 of 107

We can access Series indexes separately, data separately, also can access individual elements and slices.

Let us take some example Series Objects.

>>>S1=pd.Series(data=[5,6,7,8,9,10,11,12],

index=['May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

>>> S2=pd.Series(data=[75,72,89],

index=['Raj','Kamal','Nani'])

>>> S3=pd.Series([87,99,52],index=[11,12,13])

Accessing a Series Objects and its Elements (Series Slices)

22 of 107

(a) Accessing Individual Elements: With index value or with its position.

Syntax:<Series Object name>[<valid index>]

>>> S1['Jul']

7

Note: (1) If the Series object has duplicate indexes, then giving an index with the Series object will return all the entries with that index.

(2) If the indexes are string type, then it will work with position value also, otherwise, KeyError will come.

>>> S1[2]

7

>>> S3[11]

87

>>> S3[0]

KeyError

Note: >>S1[0]=13 will change the value of

“May” as 13 instead of 5.

23 of 107

(b) Extracting Slices from Series Object:

Slicing takes place position wise and not the index wise in a series object.

Ex: S=pd.Series(data=[21,22,23,24,25,26,27,28,29,30,31,32],

index=['Jan','Feb','Mar','Apr','May',

'Jun','Jul','Aug','Sep','Oct','Nov','Dec'])

All individual elements have position numbers starting form 0 onwards.

Syntax: <object>[start:end:step)

(end value is excluding)

The slice object of a Series object is also a panda Series type object.

24 of 107

>>>S[1:5]

#position wise, not index wise

Feb 22

Mar 23

Apr 24

May 25

dtype: int64 

>>>S[10:5]

Series([], dtype: int64) 

>>>S[15:20]

Series([], dtype: int64) 

>>>S[2:10:2]

Mar 23

May 25

Jul 27

Sep 29

dtype: int64

>>>S[7:2:-1]

Aug 28

Jul 27

Jun 26

May 25

Apr 24

dtype: int64

 

>>>S[10:1:-2]

Nov 31

Sep 29

Jul 27

May 25

Mar 23

dtype: int64

25 of 107

>>>S[:]

Jan 21

Feb 22

Mar 23

Apr 24

May 25

Jun 26

Jul 27

Aug 28

Sep 29

Oct 30

Nov 31

Dec 32

dtype: int64

 

>>>S[:5]

Jan 21

Feb 22

Mar 23

Apr 24

May 25

dtype: int64

>>>S[ :8:2]

Jan 21

Mar 23

May 25

Jul 27

dtype: int64

>>>S[ : :1]

Jan 21

Feb 22

Mar 23

Apr 24

May 25

Jun 26

Jul 27

Aug 28

Sep 29

Oct 30

Nov 31

Dec 32

dtype: int64

>>>S[ : :3]

Jan 21

Apr 24

Jul 27

Oct 30

dtype: int64

>>>S[::-2]

Dec 32

Oct 30

Aug 28

Jun 26

Apr 24

Feb 22

dtype: int64

26 of 107

>>>S[ : :-1]

#slice with values reversed

Dec 32

Nov 31

Oct 30

Sep 29

Aug 28

Jul 27

Jun 26

May 25

Apr 24

Mar 23

Feb 22

Jan 21

dtype: int64

>>>S[ :5:-2]

Dec 32

Oct 30

Aug 28

dtype: int64

 

>>>S[ :5:2]

Jan 21

Mar 23

May 25

dtype: int64

 

>>>S[-8:7]

May 25

Jun 26

Jul 27

dtype: int64

>>>S[-10:10:2]

Mar 23

May 25

Jul 27

Sep 29

dtype: int64 

>>>S[2::3]

Mar 23

Jun 26

Sep 29

Dec 32

dtype: int64 

>>>S[10::-3]

Nov 31

Aug 28

May 25

Feb 22

dtype: int64

27 of 107

Student’s Task

E=pd.Series(['CT1','CT2','CT3','CT4','T1','PT1','PT2','PT3','PT4','PT5',

'PT6','PT7','PB1','PB2','Pra','Board'],index=[101,102,103,104,

105,106, 107,108,109,110,111,112,113,114,115,116])

>>>E[105:110]

Series([], dtype: object)

(a) E[2:10]

(b) E[3:12:2]

(c) E[-12:10:3]

(d) E[4:11:2]

(e) E[4:11:-2]

Consider the following Series “E”, and answer the

questions based on Slicing.

(f) E[10:2:-3]

(g) E[10:2:3]

(h) E[10: ]

(i) E[10: :-1]

(j) E[::-3]

 

28 of 107

OPERATIONS ON SERIES

29 of 107

Operations on Series Object

 

(a) Modifying Elements of Series Object:

Syntax: <SeriesObject>[<index>]=<new data value>

Above assignment will change the data value of the given index in the Series object.

<SeriesObject>[start:stop]=<new data value>

Above assignment will replace all the values falling in given slice.

>>> S2=pd.Series(data=[75,72,89],index=['Raj','Kamal','Nani'])

>>> S2

Raj 75

Kamal 72

Nani 89

dtype: int64

30 of 107

>>> S2["Raj"]=94

>>> S2[1]=99

>>> S2

Raj 94

Kamal 99

Nani 89

dtype: int64

>>> S1[1:6]=25

31 of 107

Renaming Indexes:

Syntax:<Object>.index=<new index array>

>>> S3=pd.Series([87,99,52],index=[11,12,13])

>>> S3

11 87

12 99

13 52

dtype: int64

>>> S3.index=['First','Second','Third']

>>> S3

First 87

Second 99

Third 52

dtype: int64

 

>>> S3.index=['One','Two']

ValueError

32 of 107

head( ) & tail( )function:

head( ) function is used to fetch first n rows from a Pandas object and tail( ) function returns last n rows from a Pandas object.

Syntax:

<pandas object>.head([n])

<pandas object>.tail([n])

Note: If you do not provide any value for n, the head( ) and tail( ) will return first 5 and last 5 rows.

>>> S1

May 5

Jun 6

Jul 7

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64

33 of 107

>>> S1

May 5

Jun 6

Jul 7

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64

>>> S1.head(3)

May 5

Jun 6

Jul 7

dtype: int64

 

>>> S1.head()

May 5

Jun 6

Jul 7

Aug 8

Sep 9

dtype: int64

 

>>> S1.head(77)

May 5

Jun 6

Jul 7

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64 

>>> S1.head(-2)

May 5

Jun 6

Jul 7

Aug 8

Sep 9

Oct 10

dtype: int64

>>> S1.tail(3)

Oct 10

Nov 11

Dec 12

dtype: int64

 

>>> S1.tail()

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64

>>> S1.tail(22)

May 5

Jun 6

Jul 7

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64

 

>>> S1.tail(-3)

Aug 8

Sep 9

Oct 10

Nov 11

Dec 12

dtype: int64

34 of 107

DATAFRAMES

35 of 107

A DataFrame is a Pandas data structure, which stores data in two-dimensional way. It is an ordered collection of columns where columns may store different types of data e.g., numeric or floating point or string or Boolean type, etc.

Characteristics:

  • It has two indexes/axes.
  • Row index (axis=0) & Column index (axis=1).
  • Row index is known as index,
  • Column index is known as column name.
  • Indexes can be of numbers or letters or strings.
  • Different columns can have data of different types.
  • Value is mutable (ie its value can change)
  • We can add/delete rows/columns in a DataFrame ie size-mutable.

DATAFRAME - INTRODUCTION

36 of 107

DATAFRAMES - CREATION

37 of 107

CREATING A DATAFRAME

Before creation, we need to import two modules.

import pandas (or) import pandas as pd

import numpy (or) import numpy as np

(In the place of pd or np, we can use any valid identifier)

Syntax:

<dataFrameObject>=pandas.DataFrame(

<a 2D datastructure>, [columns=<column sequence>],

[index=<index sequence>]).

 

We can create using:

  • Two-dimensional dictionaries ie dictionaries having lists or dictionaries or ndarrays or Series objects, etc.
  • Two-dimensional ndarrays (NumPy array)
  • Series type object
  • Another DataFrame object

Displaying a DataFrame is same as the way we display other variables and objects.

38 of 107

(i) Creating a DataFrame using a 2-D Dictionary:

A 2-D dictionary is a dictionary having items as (key:value), where value part is a data structure of any type i.e., another dictionary, an ndarray, a series object, a list, etc.

Value part of all the keys should have similar structure.

(a) Creating a dataframe from a 2D dictionary having values as lists:

>>>dict={'RNo':[51,52,53,54],'SName': ['Lahari','Chanakya','Harish','Neha'], 'Marks':[55,62,52,75]}

df=pd.DataFrame(dict)

 

Program to create a dataframe using 2-D Dictionary having values as lists:

import pandas as pd

dict={'RNo':[51,52,53,54],'SName':

['Lahari','Chanakya','Harish','Neha'],

'Marks':[55,62,52,75]}

df=pd.DataFrame(dict)

print(df)

output

By default, its index will be assigned 0 (zero) onwards.

Note : As per text book, the output columns will be placed in ascending order ie “Marks” then “RNo” then “SName” but practically, the output columns are displaying as per the entered order.

39 of 107

Specifying Own Index:

>>>df=pd.DataFrame(dict,index=['First','Second','Third','Fourth'])

Note: If the number of indexes does not match the index sequence, then “ValueError” will occur.

 Example :Given a dictionary that stores “State names” as index, “Mother Tongue” &“Population” as column names. Note: Population in crores.

Program:

import pandas as pd

dict={'Tother Tongue':['Telugu','Tamil','Hindi'],

'Population':[6,8,12]}

df=pd.DataFrame(dict,index=['AP','TN','Maharastra'])

print(df)

40 of 107

(c) Creating a dataframe from a 2D dictionary having values as dictionary object:

dict={'RNo':{'First':51,'Second':52,'Third':53,'Fourth':54},'SName':{'First':'Lahari','Second':

‘Chanakya','Third':'Harish','Fourth':'Neha'},'Marks':{'First':55,'Second':62,'Third':52,'Fourth':75}}

df=pd.DataFrame(dict)

dict={'First':{'RNo':51,'SName':'Lahari','Marks':55},

'Second':{'RNo':52,'SName':'Chanakya','Marks':62},

'Third':{'RNo':53,'SName':'Harish','Marks':52},

'Fourth':{'RNo':54,'SName':'Neha','Marks':75}}

df=pd.DataFrame(dict)

41 of 107

Special Condition:

Two dictionaries with dissimilar keys as inner dictionaries of a 2D dictionary. For this DataFrame can be created with non-matching inner keys.

All the inner keys become indexes, NaN values will be added for non-matching keys of inner dictionaries.

Program:

import pandas as pd

C1={'Qty':95,'Half Yearly':89}

C2={'Half Yearly':94,'Annual':97}

Marks={'Student 1':C1,'Student 2':C2}

df=pd.DataFrame(Marks)

print(df)

OUTPUT

42 of 107

(ii) Creating a Dataframe Object from a List of Dictionaries/Lists:

(a) Creating a Dataframe using a list having List of dictionaries :

  If we pass a 2D list having dictionaries as its elements (list of dictionaries) to pandas.DataFrame() function, it will create a DataFrame object such that the inner dictionary keys will become the columns and inner dictionary’s values will make rows.

Ex:

import pandas as pd

dict1={'RNo':51,'SName':'Lahari','Marks':55}

dict2={'RNo':52,'SName':'Chanakya','Marks':62}

dict3={'RNo':53,'SName':'Harish','Marks':52}

dict4={'RNo':54,'SName':'Neha','Marks':75}

students=[dict1,dict2,dict3,dict4]

df=pd.DataFrame(students)

print(df)

43 of 107

Note : We can also include indexes as follows:

df=pd.DataFrame(students,index=['First','Second','Third','Fourth'])

Note: If we do not give the same column name in every row, it will com “NaN” values.

Program:

import pandas as pd

dict1={'RNo':51,'SName':'Lahari','Marks':55}

dict2={'RNo':52,'Name':'Chanakya','Marks':62}

dict3={'RNo':53,'Name':'Harish','Marks':52}

dict4={'RNo':54,'SName':'Neha','Marks':75}

students=[dict1,dict2,dict3,dict4]

df=pd.DataFrame(students,index=['First','Second','Third','Fourth'])

print(df)

OUTPUT

44 of 107

(b) Creating using a list having List of lists:

lists=[[10,20,40],['A','B','C','D'],[33.5,55.75,2.5]]

df=pd.DataFrame(lists)

Inserting Rows & Column Names:

import pandas as pd

lists=[[51,'Lahari',55],[52,'Chanakya',62],[53,'Harish',52]]

#each inner list is a row

df=pd.DataFrame(lists,columns=['RNo','SName','Marks'],index=['First','Second','Third'])

print(df)

45 of 107

(iii) Creating a dataframe Object from a 2-D ndarray:

We can pass a two-dimensional Numpy array (ie having shape as (<n>,<n>) to DataFrame( ) to create a dataframe object.

 Consider the program to create np array:

import numpy as np

import pandas as pd

narr=np.array([[10,20,30],[40,50,60]],np.int32)

print(narr)

Program:

import numpy as np

import pandas as pd

narr=np.array([[10,20,30],[40,50,60]],np.int32)

mydf=pd.DataFrame(narr)

print(mydf)

Output

[[1020 30]

[405060]]

OUTPUT

46 of 107

narr=np.array([[10.7,20.5],[40,50],[25.2,55]])

mydf=pd.DataFrame(narr,columns=["One","Two"],index=['A','B','C'])

print(mydf)

We can specify either columns or index or both the sequences.

Note : If, the rows of ndarrays differ in length, i.e., if number of elements in each row differ, then Python will create just single column in the dataframe object and the type of the column will be considered as object.

Example:

narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]], dtype="object")

narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]],dtype="object")

Output

[list([10.7, 20.5, 30.2]) list([40, 50]) list([25, 55, 11, 45])]

47 of 107

Program:

narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]], dtype="object")

mydf=pd.DataFrame(narr) Output

(iv) Creating a dataframe Object from a 2D

Dictionary with Values as Series Objects:

import pandas as pd

RN=pd.Series([11,12,13,14])

SN=pd.Series(['Rajesh','Likhith','Navya','Bhavya'])

M=pd.Series([56,75,91,82])

studict={'RNo':RN,'SName':SN,'Marks':M}

mydf=pd.DataFrame(studict)

print(mydf)

Output

48 of 107

(v) Creating a dataframe Object from a 2D Dictionary with Values as Series Objects:

DF

Program:

import pandas as pd

dict={'RNo':[51,52,53,54],'SName':['Lahari','Chanakya',

'Harish','Neha'],'Marks':[55,62,52,75]}

df=pd.DataFrame(dict)

dfnew=pd.DataFrame(df)

print(dfnew)

OUTPUT

(new DataFrame created from existing DataFrame)

49 of 107

DATAFRAME - ATTRIBUTES

50 of 107

DATAFRAME ATTRIBUTES

All information related to a DataFrame such as its size, datatype, etc is available through its attributes.

Syntax to use a specific attribute:

<DataFrame object>.<attribute name>

Attribute

Description

index

The index (row labels) of the DataFrame

columns

The column labels of the DataFrame

axes

It returns axis 0 i.e., index and axis 1 i.e., columns of the DataFrame

dtypes

Return the data types of data in the DataFrame

size

Return an int representing the number of elements in this object

shape

Return a tuple representing the dimensionality of the DataFrame i., (no.of rows, no.of columns)

values

Return a Numpy representation of the DataFrame

empty

Indicats whether DataFrame is empty

ndim

Return an int representing the number of axes/array dimensions.

T

Transpose

51 of 107

Example of a DataFrame DF:

Retrieving various properties of a DataFrame Object:

>>>df.index

Index(['First', 'Second', 'Third', 'Fourth'], dtype='object')

(for default indexes)

>>>df.index #above example

RangeIndex(start=0, stop=3, step=1)

>>> df.columns

Index(['RNo', 'SName', 'Marks'], dtype='object')

>>>df.axes

[Index(['First', 'Second', 'Third', 'Fourth'], dtype='object'), Index(['RNo', 'SName', 'Marks'], dtype='object')]

>>>df.dtypes

RNo int64

SName object

Marks int64

dtype: object

>>>df.size#4 rows X 3 columns

12

>>>df.shape #(no.of rows, no.of columns)

(4, 3)

>>>df.values# Numpy representation

[ [51 'Lahari' 55]

[52 'Chanakya' 62]

[53 'Harish' 52]

[54 'Neha' 75] ]

52 of 107

>>>df.empty

#if DataFrame is empty, gives True

False

>>>df.ndim # As DataFrame is a 2 Dimensional

2

>>>df.T

#Transpose. Rows will become columns and vice versa.

Example of a DataFrame DF:

Function

Description

len(<DF Object>)

Return the number of rows in a dataframe

(<DF Object>. count( )

If we pass any argument or 0 (default is 0), it returns count of non-NA values for each column, if it is 1, it returns count of non-NA values for each row.

OTHERS

53 of 107

>>>len(df)

4

>>>df.count( )

#df.count(0)or df.count(axis=’index’)

RNo 4

SName 4

Marks 4

dtype: int64

>>>df.count(1) # df.count(axis=’columns’)

First 3

Second 3

Third 3

Fourth 3

dtype: int64

>>>df.shape[0]# to get number of rows

4

>>>df.shape[1]# to get number of columns

3

54 of 107

OPERATIONS ON DATAFRAMES

SELECTING/ACCESSING DATA

&

MODIFYING, ADDING DATA

55 of 107

Create the following DataFrame in any method

df

56 of 107

import pandas as pd

dict={'Eng':[68,72,66],'Tel':[55,84,90],'Mat':[60,70,65],'Soc':[80,90,85]}

df=pd.DataFrame(dict,index=['Raj','Pavan','Mohan'])

print(DF)

57 of 107

Selecting/Accessing a subset from a DataFrame using Row/Column Names using loc function:

To access row(s) and/or a combination of rows and columns, we can use loc function.

Syntax:

<DataFrame Object>.loc[<startrow>:<endrow>, <startcolumn>:<endcolumn>]

Note: With loc, Both start label and end label are included when given as start:end

Selecting/Accessing a subset from a DataFrame using Row/Column Names using iloc function:

With this function, we can extract, subset from dataframe using the row and column numeric index/position. iloc means integer location.

Syntax:

<DF Object>.iloc[<start row index>:<end row index>, <start col index>:<end column index>]

Note: With iloc, like slices end index/position is excluded when given as start:end.

58 of 107

.at function: Access a single value for a row/column label pair by labels.

Syntax:<DF Object>.at[<row label>,<col label>]

 

.iat function: Access a single value for a row/column label pair by index position.

 

Syntax:

<DF Object>.at[<row index no><col index no>]

 

59 of 107

SINGLE COLUMN

SELECTING/ACCESSING a column: 

Syntax:<DataFrame object> [<column name>]

(or)<DataFrame object>.<column name>

>>>df['Eng']

Raj 68

Pavan 72

Mohan 66

Name: Eng, dtype: int64 

>>>df.Eng

Raj 68

Pavan 72

Mohan 66

Name: Eng, dtype: int64

MODIFYING a Column:

Note: Assigning values to a new column label that does not exist will create a new column at the end. If the column already exists in the DataFrame then the assignment statement will update the values of the already existing column, for example:

df['Eng']=[40,50,60]

df['Tel']=55

df.Mat=70,80,90

df.Soc=100

Note : If we give following,

>>> df.corporate=11,12,13 or

>>> df.corporate=[11,12,13],

No error will be displayed, but nothing will be stored in DataFrame.

ADDING a Column:

>>>df['Hin']=[89,78,76]

60 of 107

SELECTING/ACCESSING a column (loc):

 

>>>df.loc[:,'Eng']

Raj 68

Pavan 72

Mohan 66

Name: Eng, dtype: int64

MODIFYING a Column (loc):

>>>df.loc[:,'Eng']=[10,20,30]

# df.loc[:,'Eng']=10,20,30

>>>df.loc[:,'Mat']=100

ADDING a Column (loc):

>>>df.loc[:,'IP']=[10,20,30]

>>>df.loc[:,'Hin']=50

61 of 107

SELECTING/ACCESSING a column (iloc):

>>>df.iloc[:,1]

Raj 55

Pavan 84

Mohan 90

Name: Tel, dtype: int64

>>>df.iloc[:,[1]]

Tel

Raj 55

Pavan 84

Mohan 90

62 of 107

MODIFYING a Column (iloc):

>>>df.iloc[:,1]=[40,50,60]

>>>df.iloc[:,3]=70

Note: We cannot add a Column using iloc.

If you try to add new column using iloc, “IndexError” will come.

Ex:

>>>df.iloc[:,4]=95

IndexError : iloc cannot enlarge its target object

>>> df.iloc[:,1:3]=[[1,2],[3,4],[5,6]]

63 of 107

MULTIPLE COLUMNS

SELECTING/ACCESSING multiple column:

<DataFrame object>[ [<column name>,<column name>,…..] ]

>>>df[['Tel','Soc','Mat']]

MODIFYING multiple Columns values:

>>>df[['Tel','Soc','Mat']]=10,20,30

# df[['Tel','Soc','Mat']]=[10,20,30]

>>> df[['Tel','Soc','Mat']]=[[1,2,3],[4,5,6],[7,8,9]]

64 of 107

SELECTING/ACCESSING multiple columns (loc):

>>> df.loc[:,'Eng':'Mat']

Note: All columns between start and end columns are listed.

 

>>> df.loc[:,'Tel':]

>>>df.loc[:,'Mat':'Eng']

Empty DataFrame

Columns: []

Index: [Raj, Pavan, Mohan]

 

>>>df.loc[:,['Soc','Tel','Eng']]

65 of 107

MODIFYING multiple Columns values (loc):

>>>df.loc[:,'Eng':'Mat']=50,60,70

>>>df.loc[:,['Soc','Tel','Eng']]=10,20,30

66 of 107

SELECTING/ACCESSING multiple columns (iloc):

>>> df.iloc[:,1:3] #Excluding column 3

>>> df.iloc[:,[2,0]]

>>> df.iloc[:,1:]

>>> df.iloc[:,2:0]

Empty DataFrame

Columns: []

Index: [Raj, Pavan, Mohan]

>>> df.iloc[:,[2,0,1]]

MODIFYING multiple Columns values (iloc):

>>>df.iloc[:,1:3]=[25,35]

67 of 107

68 of 107

SINGLE ROW

SELECTING/ACCESSING one row (loc):

Just give the row name/label.

>>>df.loc['Pavan']

# df.loc['Pavan',] or df.loc['Pavan',:]

Eng 72

Tel 84

Mat 70

Soc 90

Name: Pavan, dtype: int64

 

>>> df.loc['Kiran']

KeyError: 'Kiran'

MODIFYING one row (loc):

>>>df.loc["Raj"]=91,92,93,94

#df.loc[“Raj”,:] = [91,92,93,94]

>>>df.loc["Pavan"]=100

>>> df.loc['Mohan',:]=601,602,603

ValueError: could not broadcast input array from shape (3,) into shape (4,)

69 of 107

ADDING one row (loc):

>>>df.loc['Kumar']=91,92,93,94

Note: If we try to add a row with lesser values than the number of columns in the DataFrame, it results in a ValueError, with the error message: ValueError: Cannot set a row with mismatched columns.

Similarly, if we try to add a column with lesser values than the number of rows in the DataFrame, it results in a ValueError, with the error message: ValueError: Length of values does not match length of index.

70 of 107

SELECTING/ACCESSING one row (iloc):

>>>df.iloc[1] #df.iloc[1,] or df.iloc[1,:]

Eng 72

Tel 84

Mat 70

Soc 90

Name: Pavan, dtype: int64

 

>>> df.iloc[4]

IndexError: single positional indexer is out-of-bounds

71 of 107

MODIFYING one row (iloc):

>>>df.iloc[2]=75

>>>df.iloc[1]=81,82,83,84

# df.iloc[1]=[81,82,83,84]

# df.iloc[1,:]=[81,82,83,84]

Note: We cannot add a row using iloc.

If you try to add new column using iloc, “IndexError” will come.

Ex:

>>>df.iloc[:,3]=91,92,93,94

IndexError : iloc cannot enlarge its target object

>>> df.iloc[[2,0]]=[[100,200,300,400],[11,22,33,44]]

72 of 107

MULTIPLE ROWS

SELECTING/ACCESSING multiple rows (loc):

>>>df.loc['Raj':'Mohan']

# df.loc['Raj':'Mohan', ] or df.loc['Raj':'Mohan', :]

>>>df.loc['Pavan':'Mohan']

>>>df.loc[['Mohan','Raj']]

>>>df.loc['Pavan':'Raj']

Empty DataFrame

Columns: [Eng, Tel, Mat, Soc]

Index: [ ]

73 of 107

MODIFYING multiple rows (loc):

>>> df.loc[['Mohan','Raj']]=[[1,2,3,4],[5,6,7,8]]

SELECTING/ACCESSING multiple rows (iloc):

>>> df.iloc[0:3] # df.iloc[0:3,] or df.iloc[0:3,:]

>>>df.iloc[0:2]

>>>df.iloc[1:10]

>>>df.iloc[1:1]

Empty DataFrame

Columns: [Eng, Tel, Mat, Soc]

Index: [ ]

>>>df.iloc[[2,1]] #df.iloc[[2,1], ] or df.iloc[[2,1], : ]

74 of 107

MODIFYING multiple rows (iloc):

>>>df.iloc[0:2]=[[1,2,3,4],[5,6,7,8]]

Modifying All Rows (iloc):

 

>>>df[ : ]

>>>df[ : ] = 10

75 of 107

76 of 107

RANGE OF COLUMNS

FROM A RANGE OF ROWS 

SELECTING/ACCESSING range of columns from a range of rows (loc):

<DF Object>.loc[<startrow>:<endrow>,

<startcolumn>:<endcolumn>]

>>> df.loc['Pavan':'Mohan','Tel':'Soc']

>>>df.loc['Mohan':'Raj','Eng':'Soc']

Empty DataFrame

Columns: [Eng, Tel, Mat, Soc]

Index: []

 

>>>df.loc['Raj':'Pavan','Mat':'Eng']

Empty DataFrame

Columns: []

Index: [Raj, Pavan]

77 of 107

MODIFYING range of columns from a range of rows (loc):

>>>df.loc['Pavan':'Mohan','Tel':'Soc']=[[1,2,3],[4,5,6]]

SELECTING/ACCESSING range of columns from a range of rows (iloc):

>>> df.iloc[1:3,0:2] #Rows 1,2 & Columns 0,1

>>>df.iloc[[1,2],[2,0,1]]

>>> df.iloc[2:2,0:2]

Empty DataFrame

Columns: [Eng, Tel]

Index: []

 

>>> df.iloc[1:3,2:0]

Empty DataFrame

Columns: []

Index: [Pavan, Mohan]

 

>>> df.iloc[[1,3],0:2]

IndexError: positional indexers are out-of-bounds

MODIFYING range of columns

from a range of rows (iloc):

>>>df.iloc[0:2,1:4]=[[21,22,23],[31,32,33]]

78 of 107

79 of 107

SINGLE VALUE

SELECTING/ACCESSING a single value:

Either give name of row or numeric index in square brackets.

Syntax:<DF Object>.<column>

[<row name or row numeric index>]

Ex: >>df.Eng['Pavan']

72

MODIFYING a single value:

>>>df.Eng['Pavan']=200 will change the value to 200

>>> df.Tel[0]=500

80 of 107

SELECTING/ACCESSING a single value (loc):

>>>df.loc['Pavan','Mat']

100

MODIFYING a single value (loc):

Specify the row label and the column name, then assign the new value.

 

>>>df.loc['Pavan','Mat']=100

SELECTING/ACCESSING a single value (iloc):

>>>df.iloc[2,3]

85

MODIFYING a single value (iloc):

>>>df.iloc[2,3]=500

81 of 107

.at function: Access a single value for a row/column label pair by labels.

Syntax:<DF Object>.at[<row label>,<col label>]

>>> df.at['Raj','Mat']

60

 

>>> df.at['Raj','Mat']=150 will change the value to 150

>>> df.at['Kiran','Soc']

KeyError: 'Kiran'

>>> df.at['Raj','IP']

KeyError: 'IP'

82 of 107

.iat function: Access a single value for a row/column label pair by index position.

Syntax:

<DF Object>.at[<row index no><col index no>]

>>> df.iat[2,2]

65

 

# df.iat[2,3]=30 will change the value to 30

83 of 107

84 of 107

HEAD & TAIL FUNCTIONS

head(n): To display the first n rows in the DataFrame. Default value of n is 5.

tail(n): To display the last n rows in the DataFrame. Default value of n is 5.

Create the following DataFrame “MyDF”.

Execute the following commands:

MyDF.head(3)

MyDF.head( )

MyDF.head(15)

MyDF.head(-3)

MyDF.tail(3)

MyDF.tail( )

MyDF.tail(777)

MyDF.tail(-3)

85 of 107

Create the following DataFrame in any method

df

86 of 107

RENAMING ROWS/COLUMNS

To change the name of any row/column individually, we can use the rename( ) function.

rename( ) function by default does not make changes in the original dataframe. It creates a new dataframe with the changes and the original dataframe remains unchanged.

Syntax:

<DF>.rename(index={<names dictionary>},

columns={<names dictionary>}, inplace=False)

Renaming Row Indexes:

>>>df.rename(index={'Raj':'Mr.Rajesh','Mohan':'Mohan Garu'},inplace=True)

Renaming Column Indexes (Column Labels):

>>> df.rename(columns={'Eng':'English', 'Mat':'Maths'},inplace=True)

87 of 107

Another Example:

dict={'RNo':[51,52,53],'SName':['Suresh','Naresh','Bhavesh']}

df=pd.DataFrame(dict, index=['First','Second','Third'])

>>>df.rename(index={'Second':'Two'}, columns={'RNo':'RollNo'},inplace=True)

Note : If we do not add “inplace=True”, when we are executing the commands only, it will show the modified values. But really it won’t modifies the values. So to modify values we need to add “inplace=True”.

88 of 107

Create the following DataFrame in any method

df

89 of 107

ASSIGN FUNCTION

<DF object>=<DF object>.assign(<column name>=<values for column>)

 

>>> df=df.assign(Mat=[10,11,12])

>>>df=df.assign(IP=[81,82,83])

>>>df=df.assign(Tel=77)

>>>df=df.assign(New=[55,56])

ValueError: Length of values (2) does not match length of index (4)

90 of 107

DELETING ROWS/COLUMNS

Two ways to delete rows and columns

– del( ) and drop( )

We can use the DataFrame.drop() method to delete rows and columns from a DataFrame. We need to specify the names of the labels to be dropped and the axis from which they need to be dropped. To delete a row, the parameter axis is assigned the value 0 and for deleting a column,the parameter axis is assigned the value 1.

 

(i) Delete row(s) using drop( ) function:

Syntax:<DF>.drop(index or sequence of indexes)

 

>>> df.drop('Pavan',axis=0,inplace=True)

#df.drop('Pavan',inplace=True)

#df=df.drop('Pavan',axis=0)

# Default axis is 0, so no need to give

91 of 107

>>> df.drop(['Raj','Pavan'],inplace=True)

Note: If the DataFrame has more than one row with the same label, the DataFrame.drop() method will delete all the matching rows from it.

 (Other examples:

df.drop(range(2,15,3)) – 2,5,8,11,14

df.drop([2,4,6,8,12])

Argument to drop( ) should be either an index, or a sequence containing indexes.)

(ii) Delete a column, using drop( ) function:

>>> df.drop('Tel',axis=1,inplace=True)

>>>df.drop(['Soc','Eng'],axis=1,inplace=True)

92 of 107

(iii) Delete a column, using del( ) function:

Syntax: del <DF object>[<column name>]

>>> del df['Mat']

93 of 107

94 of 107

ITERATION

95 of 107

ITERATION (Pandas 2 Chapter)

Iterating Over a Data Frame

Iterating Over a DataFrame:

 

>>> dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

>>>DF=pd.DataFrame(dict,index=['Private','Govt'])

96 of 107

iterrows( ) : This method iterates over dataframe row wise where each horizontal subset is in the form of (row-index,Series) where Series contains all column values for that row-index.

 

Example Program: Using iterrows( ) to

extract data from dataframe row wise.

 

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for (row,rowSeries) in DF.iterrows():

print("Row index:", row)

print("Containing: ")

print(rowSeries)

Row index: Private

Containing:

Teachers 20

Students 200

Ratio 10

Name: Private, dtype: int64

Row index: Govt

Containing:

Teachers 10

Students 150

Ratio 15

Name: Govt, dtype: int64

OUTPUT

97 of 107

Example : Using iterrows( ) to extract row-wise Series objects

 

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for (row,rowSeries) in DF.iterrows():

print("Row index:",row)

print("Containing: ")

i=0

for val in rowSeries:

print("At",i,"position: ",val)

i=i+1

OUTPUT

Row index: Private

Containing:

At 0 position: 20

At 1 position: 200

At 2 position: 10

Row index: Govt

Containing:

At 0 position: 10

At 1 position: 150

At 2 position: 15

98 of 107

Write a program to print the DataFrame DF, one row at a time

 

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for i,j in DF.iterrows():

print(i)

print(j)

print("____________")

OUTPUT

Private

Teachers 20

Students 200

Ratio 10

Name: Private, dtype: int64

____________

Govt

Teachers 10

Students 150

Ratio 15

Name: Govt, dtype: int64

99 of 107

Putting Individual columns from a row:

When accessing rows of a DataFrame using iterrows(), then by using rowSeries[<column>], you can print individual column value from that row ie.,after the line,for r, Row in df.iterrows( ):

You can print individual column value as :

Row[<column name>]

Write a program to print only the values from Teachers column, for each row

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for row,rowSeries in DF.iterrows():

print(rowSeries['Teachers'])

print("------")

OUTPUT

20

------

10

------

 

100 of 107

iteritems( ): This method iterates over dataframe column wise where each vertical subset is in the form of (col-index,Series) where Series contains all row values for that column-index.

Note: in present versions, iteritems( ) is

replaced with items( ) 

Example : Using iteritems( ) to extract data from

dataframe column wise.

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for (col,colSeries) in DF.items(): # iteritems( )

print("Column index:",col)

print("Containing: ")

print(colSeries)

Column index: Teachers

Containing:

Private 20

Govt 10

Name: Teachers, dtype: int64

Column index: Students

Containing:

Private 200

Govt 150

Name: Students, dtype: int64

Column index: Ratio

Containing:

Private 10

Govt 15

Name: Ratio, dtype: int64

OUTPUT

101 of 107

Example : Using iteritems( ) to extract

dataframe column wise series object

 

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for (col,colSeries) in DF.items(): #iteritems( )

print("Column index:",col)

print("Containing: ")

i=0

for val in colSeries:

print("At row ",i,":",val)

i=i+1

OUTPUT

Column index: Teachers

Containing:

At row 0 : 20

At row 1 : 10

Column index: Students

Containing:

At row 0 : 200

At row 1 : 150

Column index: Ratio

Containing:

At row 0 : 10

At row 1 : 15

102 of 107

Write a program to print the DataFrame DF, one column at a time

import pandas as pd

dict={'Teachers':[20,10],'Students':[200,150],

'Ratio':[10,15]}

DF=pd.DataFrame(dict,index=['Private','Govt'])

for i,j in DF.items(): #iteritems( )

print(i)

print(j)

print("____________")

OUTPUT

Teachers

Private 20

Govt 10

Name: Teachers, dtype: int64

____________

Students

Private 200

Govt 150

Name: Students, dtype: int64

____________

Ratio

Private 10

Govt 15

Name: Ratio, dtype: int64

103 of 107

104 of 107

Binary Operations:

addition, subtraction, multiplication, division

import pandas as pd

dict1={'A':[11,17,23],'B':[13,19,25],'C':[15,21,27]}

DF1=pd.DataFrame(dict1)

dict2={'A':[12,18,24],'B':[14,20,26],'C':[16,22,28]}

DF2=pd.DataFrame(dict2)

dict3={'A':[1,3,5],'B':[2,4,6]}

DF3=pd.DataFrame(dict3)

dict4={'A':[7,9],'B':[8,10]}

DF4=pd.DataFrame(dict4)

105 of 107

Addition : [ Using +, add( ), radd( ) ]

 

Note : DF1.add(DF2) is equal to DF1+DF2

DF1.radd(DF2) is equal to DF2+DF1

radd( ) means reverse addition

 

>>>DF1+DF2 #DF1.add(DF2)

>>>DF1+DF3 >>>DF1+DF4

>>>DF3+DF4 >>>DF3.add(DF4)

106 of 107

Subtraction: [ Using -, sub( ), rsub( ) ]

 Note : DF1.sub(DF2) is equal to DF1-DF2

DF1.rsub(DF2) is equal to DF2-DF1

rsub( ) means reverse subtraction

>>>DF1-DF2 >>>DF2-DF1

>>>DF1-DF3 >>>DF3-DF1

>>>DF3-DF4 >>>DF4-DF3

#DF3.sub(DF4) #DF3.rsub(DF4)

107 of 107

Multiplication: [ Using *, mul( ), rmul( ) ] 

Note : DF1.mul(DF2) is equal to DF1*DF2

DF1.rmul(DF2) is equal to DF2*DF1

rmul( ) means reverse multiplication >>DF1*DF2 >>>DF1*DF3

Division: [ Using /, div( ), rdiv( ) ]

 

Note : DF1.div(DF2) is equal to DF1/DF2

DF1.rdiv(DF2) is equal to DF2/DF1

rdiv( ) means reverse division.

 

>>>DF1/DF2

>>>DF2/DF1

>>>DF2/DF3