Exam Point of View Preparation
Series – 8 Marks, DataFrames – 27 Marks
Series
Important Topics
DataFrames
Important Topics
DataFrame Properties (Rows, Columns, Indexes, Default Values….)
DataFrames Output
DataFrames Operations
loc, iloc,at, iat functions
Accessing(Displaying) /Modifying/Adding – Single Column
Accessing(Displaying) /Modifying/Adding – Single Row
Accessing(Displaying) /Modifying/Adding – Multiple Columns
Accessing(Displaying) /Modifying/Adding – Multiple Rows
Accessing(Displaying) /Modifying/Adding – Single/Multiple Columns & Single/Multiple Rows
Accessing(Displaying) /Modifying/Adding – Single value
(print means you need to keep the statement within print statement)
Display first rows (head), last rows (tail)
Rename column heading, Rename row label
Remove a row, remove a column
Binary Operations:
addition, subtraction, multiplication, division
ITERATION – iterrows( ) & iteritems( )
iterating Over a Data Frame
SERIES
Pandas or Python Pandas is Python’s library for data analysis. Pandas has derived its name from “panel data system”, which is an ecometrics term for multi dimensional, structured data sets. Pandas has become a popular choice for data analysis.
Data analysis refers to process of evaluating big data sets using analytical and statistical tools so as to discover useful information and conclusions to support business decision-making. The main author of Pandas is Wes McKinney.
Using Pandas
Pandas is an open source, BSD library built for Python programming language.
Pandas Offers high performance, easy to use data structures and data analysis tools.
We need to import pandas:
import pandas (or) import pandas as <identifier>
Ex: import pandas as pd
If we use numpy arrays, import numpy as np
Note: Pandas uses NumPy as its support library and hence many datatypes, constants and functions of NumPy are frequently used with Pandas.
Wrongly Given Import Statement:
Import Pandas (I and P are in Capitals)
import pandas as S.No (. or any special symbol is not allowed)
import pandas as if (keywords should not be given)
import pandas as 7rno (should not start with a digit)
(It can contain alphabets, digits and underscore only.
It should not have any space, special symbol, keyword.
It should not start with a digit)
PANDAS INTRODUCTION
Why Pandas?
Pandas is the best at handling huge tabular data sets comprising different data formats.
Pandas Data Structure:
.
Data Structures refer to specialized way of storing data so as to apply a specific type of functionality on them.
We can think Pandas data structures as enhanced versions of NumPy structured arrays in which the rows and columns can be identified and accessed with labels rather than simple integer indices.
Out of many data structures of Pandas, two basic data structures – Series and DataFrame are universally popular for their dependability.
(Pandas also supports Panel Data Structure, but it is not in syllabus)
PANDAS DATA STRUCTURES
Property | Series | DataFrame |
Dimensions | 1 Dimensional | 2-Dimensional |
Type of Data | Homegeneous, i.e., all the elements must be of same type in a Series object | Heterogeneous, i.e., a DataFrame object can have elements of different data types |
Mutability | Value mutable, i.e., their elements value can change | Value mutable, i.e., their elements value can change |
Size-immutable, i.e., size of a Series object, once created, cannot change. If we want to add/drop an element, internally a new Series object will be created | Size-mutable, i.e., size of a Dataframe object, once created, can change in place. That is, you can add/drop elements in an existing dataframe object. |
1 2 3 4 | ‘A’ ‘B’ ‘C’ ‘D’ |
Index Data
Series:
DataFrame:
Examples of Series & DataFrame
A Series is a Pandas data structure that represents a one dimensional array of indexed data.
It represents 1 D array like object containing an array of data for any NumPy data type and an associated array of data labels, called its index.
A Series type object has two main components:
* An array of actual data
* An associated array of indexes or data labels.
Both components are one-dimensional arrays with the same length. The index is used to access individual values.
Ex:
SERIES
SERIES ATTRIBUTES
Series Object Attributes: When we create a Series type object, all information related to it is available through attributes. Syntax: <Series object>.<attribute name>
Attribute | Description |
index | The index(axis labels) of the Series |
index.name | Name of the index. Can be used to assign new name to index. |
values | Return Series as ndarray or ndarray-like (data) depending on the dtype |
dtype | Return the dtype object of the underlying data (datatype) |
shape | Return a tuple of the shape of the underlying data |
nbytes | Return the number of bytes in the underlying data |
ndim | Return the number of dimensions of the underlying data |
size | Return the number of elements in the underlying data |
itemsize | Return the size of the dtype of the item of the underlying data Note : In newer versions, they have removed the Itemsize |
hasnans | Return True if there are any NaN values; otherwise return False |
empty | Return True if the Series object is empty, false otherwise. |
name | Return or assign name to Series object |
Series Attributes
Consider the following Series Object:
>>> Marks=[34,33,np.NaN,38,40]
>>> Exams=["CT1","CT2","CT3","CT4","CT5"]
>>> S=pd.Series(Marks,index=Exams)
>>> S
CT1 34.0
CT2 33.0
CT3 NaN
CT4 38.0
CT5 40.0
dtype: float64
(i) index :
>>> S.index
Index(['CT1', 'CT2', 'CT3', 'CT4', 'CT5'], dtype='object')
(ii) index.name:
>>> S.index.name="Test"
>>>S.index.name
'Test'
>>> S.index
>>>Index(['CT1', 'CT2', 'CT3', 'CT4', 'CT5'],
dtype='object', name='Test')
(iii) values:
>>> S.values
array([34., 33., nan, 38., 40.])
(iv) dtype:
>>> S.dtype
dtype('float64')
>>> S
CT1 34.0
CT2 33.0
CT3 NaN
CT4 38.0
CT5 40.0
dtype: float64
(v) shape:
>>> S.shape
(5, )
(vi) nbytes:
>>> S.nbytes #5 elements X 4 bytes for float
40
(vii) ndim:
>>> S.ndim # Series is One Dimensional
1
(viii) size:
>>> S.size # 5 elements
5
(ix) itemsize:
In new versions, this property was removed.
AttributeError: 'Series' object has no attribute 'itemsize'
(x) hasnans:
>>> S.hasnans
True
(xi) empty:
>>> S.empty
False
(xii) name
>>> S.name="MySeries"
>>>S.name
'MySeries'
>>>S
CT1 34.0
CT2 33.0
CT3 NaN
CT4 38.0
CT5 40.0
Name: MySeries, dtype: float64
Other example related to index:
>>> S3=pd.Series(data=np.arange(5,25,4))
>>> S3.index
RangeIndex(start=0, stop=5, step=1)
>>> a=np.arange(9,13)
>>> S4=pd.Series(index=a,data=a*2)
>>> S4.index
Index([9, 10, 11, 12], dtype='int32')
Some functions
Function | Use |
len( ) | To get total number of elements (including NaN values) |
count( ) | To get the count of non-NaN values in a series object |
type( ) | To know the data type of an object |
>>> len(S)
5
>>> S.count()
4
>>> type(S)
<class 'pandas.core.series.Series'>
>>>S4
9 18
10 20
11 22
12 24
dtype: int32
Note: In the same statement, we can work with 2 or more attributes of the Same Series or different also.
>>>S1=pd.Series([10,20,30])
>>>S1
0 10
1 20
2 30
dtype: int64
>>>S2=pd.Series(['a','e','i','o','u'],index=[100,200,
300,400,500])
>>>S2
100 a
200 e
300 i
400 o
500 u
dtype: object
>>>S1.shape,S2.shape
((3,), (5,))
>>>print(S1.shape,S2.shape)
(3,) (5,)
>>>S1.ndim,S2.nbytes
(1, 40)
>>>print(S1.ndim,S2.nbytes)
1 40
Student’s Task
>>>Veg=pd.Series(['Onion','Carrot','Beetroot',
'Potato'],[30,70,50,20])
>>> Veg
30 Onion
70 Carrot
50 Beetroot
20 Potato
dtype: object
For the above Series “Veg”, Work with the all 12 attributes and 3 functions. Also write outputs.
SLICING
We can access Series indexes separately, data separately, also can access individual elements and slices.
Let us take some example Series Objects.
>>>S1=pd.Series(data=[5,6,7,8,9,10,11,12],
index=['May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
>>> S2=pd.Series(data=[75,72,89],
index=['Raj','Kamal','Nani'])
>>> S3=pd.Series([87,99,52],index=[11,12,13])
Accessing a Series Objects and its Elements (Series Slices)
(a) Accessing Individual Elements: With index value or with its position.
Syntax:<Series Object name>[<valid index>]
>>> S1['Jul']
7
Note: (1) If the Series object has duplicate indexes, then giving an index with the Series object will return all the entries with that index.
(2) If the indexes are string type, then it will work with position value also, otherwise, KeyError will come.
>>> S1[2]
7
>>> S3[11]
87
>>> S3[0]
KeyError
Note: >>S1[0]=13 will change the value of
“May” as 13 instead of 5.
(b) Extracting Slices from Series Object:
Slicing takes place position wise and not the index wise in a series object.
Ex: S=pd.Series(data=[21,22,23,24,25,26,27,28,29,30,31,32],
index=['Jan','Feb','Mar','Apr','May',
'Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
All individual elements have position numbers starting form 0 onwards.
Syntax: <object>[start:end:step)
(end value is excluding)
The slice object of a Series object is also a panda Series type object.
>>>S[1:5]
#position wise, not index wise
Feb 22
Mar 23
Apr 24
May 25
dtype: int64
>>>S[10:5]
Series([], dtype: int64)
>>>S[15:20]
Series([], dtype: int64)
>>>S[2:10:2]
Mar 23
May 25
Jul 27
Sep 29
dtype: int64
>>>S[7:2:-1]
Aug 28
Jul 27
Jun 26
May 25
Apr 24
dtype: int64
>>>S[10:1:-2]
Nov 31
Sep 29
Jul 27
May 25
Mar 23
dtype: int64
>>>S[:]
Jan 21
Feb 22
Mar 23
Apr 24
May 25
Jun 26
Jul 27
Aug 28
Sep 29
Oct 30
Nov 31
Dec 32
dtype: int64
>>>S[:5]
Jan 21
Feb 22
Mar 23
Apr 24
May 25
dtype: int64
>>>S[ :8:2]
Jan 21
Mar 23
May 25
Jul 27
dtype: int64
>>>S[ : :1]
Jan 21
Feb 22
Mar 23
Apr 24
May 25
Jun 26
Jul 27
Aug 28
Sep 29
Oct 30
Nov 31
Dec 32
dtype: int64
>>>S[ : :3]
Jan 21
Apr 24
Jul 27
Oct 30
dtype: int64
>>>S[::-2]
Dec 32
Oct 30
Aug 28
Jun 26
Apr 24
Feb 22
dtype: int64
>>>S[ : :-1]
#slice with values reversed
Dec 32
Nov 31
Oct 30
Sep 29
Aug 28
Jul 27
Jun 26
May 25
Apr 24
Mar 23
Feb 22
Jan 21
dtype: int64
>>>S[ :5:-2]
Dec 32
Oct 30
Aug 28
dtype: int64
>>>S[ :5:2]
Jan 21
Mar 23
May 25
dtype: int64
>>>S[-8:7]
May 25
Jun 26
Jul 27
dtype: int64
>>>S[-10:10:2]
Mar 23
May 25
Jul 27
Sep 29
dtype: int64
>>>S[2::3]
Mar 23
Jun 26
Sep 29
Dec 32
dtype: int64
>>>S[10::-3]
Nov 31
Aug 28
May 25
Feb 22
dtype: int64
Student’s Task
E=pd.Series(['CT1','CT2','CT3','CT4','T1','PT1','PT2','PT3','PT4','PT5',
'PT6','PT7','PB1','PB2','Pra','Board'],index=[101,102,103,104,
105,106, 107,108,109,110,111,112,113,114,115,116])
>>>E[105:110]
Series([], dtype: object)
(a) E[2:10]
(b) E[3:12:2]
(c) E[-12:10:3]
(d) E[4:11:2]
(e) E[4:11:-2]
Consider the following Series “E”, and answer the
questions based on Slicing.
(f) E[10:2:-3]
(g) E[10:2:3]
(h) E[10: ]
(i) E[10: :-1]
(j) E[::-3]
OPERATIONS ON SERIES
Operations on Series Object
(a) Modifying Elements of Series Object:
Syntax: <SeriesObject>[<index>]=<new data value>
Above assignment will change the data value of the given index in the Series object.
<SeriesObject>[start:stop]=<new data value>
Above assignment will replace all the values falling in given slice.
>>> S2=pd.Series(data=[75,72,89],index=['Raj','Kamal','Nani'])
>>> S2
Raj 75
Kamal 72
Nani 89
dtype: int64
>>> S2["Raj"]=94
>>> S2[1]=99
>>> S2
Raj 94
Kamal 99
Nani 89
dtype: int64
>>> S1[1:6]=25
Renaming Indexes:
Syntax:<Object>.index=<new index array>
>>> S3=pd.Series([87,99,52],index=[11,12,13])
>>> S3
11 87
12 99
13 52
dtype: int64
>>> S3.index=['First','Second','Third']
>>> S3
First 87
Second 99
Third 52
dtype: int64
>>> S3.index=['One','Two']
ValueError
head( ) & tail( )function:
head( ) function is used to fetch first n rows from a Pandas object and tail( ) function returns last n rows from a Pandas object.
Syntax:
<pandas object>.head([n])
<pandas object>.tail([n])
Note: If you do not provide any value for n, the head( ) and tail( ) will return first 5 and last 5 rows.
>>> S1
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.head(3)
May 5
Jun 6
Jul 7
dtype: int64
>>> S1.head()
May 5
Jun 6
Jul 7
Aug 8
Sep 9
dtype: int64
>>> S1.head(77)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.head(-2)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
dtype: int64
>>> S1.tail(3)
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail()
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail(22)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail(-3)
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
DATAFRAMES
A DataFrame is a Pandas data structure, which stores data in two-dimensional way. It is an ordered collection of columns where columns may store different types of data e.g., numeric or floating point or string or Boolean type, etc.
Characteristics:
DATAFRAME - INTRODUCTION
DATAFRAMES - CREATION
CREATING A DATAFRAME
Before creation, we need to import two modules.
import pandas (or) import pandas as pd
import numpy (or) import numpy as np
(In the place of pd or np, we can use any valid identifier)
Syntax:
<dataFrameObject>=pandas.DataFrame(
<a 2D datastructure>, [columns=<column sequence>],
[index=<index sequence>]).
We can create using:
Displaying a DataFrame is same as the way we display other variables and objects.
(i) Creating a DataFrame using a 2-D Dictionary:
A 2-D dictionary is a dictionary having items as (key:value), where value part is a data structure of any type i.e., another dictionary, an ndarray, a series object, a list, etc.
Value part of all the keys should have similar structure.
(a) Creating a dataframe from a 2D dictionary having values as lists:
>>>dict={'RNo':[51,52,53,54],'SName': ['Lahari','Chanakya','Harish','Neha'], 'Marks':[55,62,52,75]}
df=pd.DataFrame(dict)
Program to create a dataframe using 2-D Dictionary having values as lists:
import pandas as pd
dict={'RNo':[51,52,53,54],'SName':
['Lahari','Chanakya','Harish','Neha'],
'Marks':[55,62,52,75]}
df=pd.DataFrame(dict)
print(df)
output
By default, its index will be assigned 0 (zero) onwards.
Note : As per text book, the output columns will be placed in ascending order ie “Marks” then “RNo” then “SName” but practically, the output columns are displaying as per the entered order.
Specifying Own Index:
>>>df=pd.DataFrame(dict,index=['First','Second','Third','Fourth'])
Note: If the number of indexes does not match the index sequence, then “ValueError” will occur.
Example :Given a dictionary that stores “State names” as index, “Mother Tongue” &“Population” as column names. Note: Population in crores.
Program:
import pandas as pd
dict={'Tother Tongue':['Telugu','Tamil','Hindi'],
'Population':[6,8,12]}
df=pd.DataFrame(dict,index=['AP','TN','Maharastra'])
print(df)
(c) Creating a dataframe from a 2D dictionary having values as dictionary object:
dict={'RNo':{'First':51,'Second':52,'Third':53,'Fourth':54},'SName':{'First':'Lahari','Second':
‘Chanakya','Third':'Harish','Fourth':'Neha'},'Marks':{'First':55,'Second':62,'Third':52,'Fourth':75}}
df=pd.DataFrame(dict)
dict={'First':{'RNo':51,'SName':'Lahari','Marks':55},
'Second':{'RNo':52,'SName':'Chanakya','Marks':62},
'Third':{'RNo':53,'SName':'Harish','Marks':52},
'Fourth':{'RNo':54,'SName':'Neha','Marks':75}}
df=pd.DataFrame(dict)
Special Condition:
Two dictionaries with dissimilar keys as inner dictionaries of a 2D dictionary. For this DataFrame can be created with non-matching inner keys.
All the inner keys become indexes, NaN values will be added for non-matching keys of inner dictionaries.
Program:
import pandas as pd
C1={'Qty':95,'Half Yearly':89}
C2={'Half Yearly':94,'Annual':97}
Marks={'Student 1':C1,'Student 2':C2}
df=pd.DataFrame(Marks)
print(df)
OUTPUT
(ii) Creating a Dataframe Object from a List of Dictionaries/Lists:
(a) Creating a Dataframe using a list having List of dictionaries :
If we pass a 2D list having dictionaries as its elements (list of dictionaries) to pandas.DataFrame() function, it will create a DataFrame object such that the inner dictionary keys will become the columns and inner dictionary’s values will make rows.
Ex:
import pandas as pd
dict1={'RNo':51,'SName':'Lahari','Marks':55}
dict2={'RNo':52,'SName':'Chanakya','Marks':62}
dict3={'RNo':53,'SName':'Harish','Marks':52}
dict4={'RNo':54,'SName':'Neha','Marks':75}
students=[dict1,dict2,dict3,dict4]
df=pd.DataFrame(students)
print(df)
Note : We can also include indexes as follows:
df=pd.DataFrame(students,index=['First','Second','Third','Fourth'])
Note: If we do not give the same column name in every row, it will com “NaN” values.
Program:
import pandas as pd
dict1={'RNo':51,'SName':'Lahari','Marks':55}
dict2={'RNo':52,'Name':'Chanakya','Marks':62}
dict3={'RNo':53,'Name':'Harish','Marks':52}
dict4={'RNo':54,'SName':'Neha','Marks':75}
students=[dict1,dict2,dict3,dict4]
df=pd.DataFrame(students,index=['First','Second','Third','Fourth'])
print(df)
OUTPUT
(b) Creating using a list having List of lists:
lists=[[10,20,40],['A','B','C','D'],[33.5,55.75,2.5]]
df=pd.DataFrame(lists)
Inserting Rows & Column Names:
import pandas as pd
lists=[[51,'Lahari',55],[52,'Chanakya',62],[53,'Harish',52]]
#each inner list is a row
df=pd.DataFrame(lists,columns=['RNo','SName','Marks'],index=['First','Second','Third'])
print(df)
(iii) Creating a dataframe Object from a 2-D ndarray:
We can pass a two-dimensional Numpy array (ie having shape as (<n>,<n>) to DataFrame( ) to create a dataframe object.
Consider the program to create np array:
import numpy as np
import pandas as pd
narr=np.array([[10,20,30],[40,50,60]],np.int32)
print(narr)
Program:
import numpy as np
import pandas as pd
narr=np.array([[10,20,30],[40,50,60]],np.int32)
mydf=pd.DataFrame(narr)
print(mydf)
Output
[[1020 30]
[405060]]
OUTPUT
narr=np.array([[10.7,20.5],[40,50],[25.2,55]])
mydf=pd.DataFrame(narr,columns=["One","Two"],index=['A','B','C'])
print(mydf)
We can specify either columns or index or both the sequences.
Note : If, the rows of ndarrays differ in length, i.e., if number of elements in each row differ, then Python will create just single column in the dataframe object and the type of the column will be considered as object.
Example:
narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]], dtype="object")
narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]],dtype="object")
Output
[list([10.7, 20.5, 30.2]) list([40, 50]) list([25, 55, 11, 45])]
Program:
narr=np.array([[10.7,20.5,30.2],[40,50],[25,55,11,45]], dtype="object")
mydf=pd.DataFrame(narr) Output
(iv) Creating a dataframe Object from a 2D
Dictionary with Values as Series Objects:
import pandas as pd
RN=pd.Series([11,12,13,14])
SN=pd.Series(['Rajesh','Likhith','Navya','Bhavya'])
M=pd.Series([56,75,91,82])
studict={'RNo':RN,'SName':SN,'Marks':M}
mydf=pd.DataFrame(studict)
print(mydf)
Output
(v) Creating a dataframe Object from a 2D Dictionary with Values as Series Objects:
DF
Program:
import pandas as pd
dict={'RNo':[51,52,53,54],'SName':['Lahari','Chanakya',
'Harish','Neha'],'Marks':[55,62,52,75]}
df=pd.DataFrame(dict)
dfnew=pd.DataFrame(df)
print(dfnew)
OUTPUT
(new DataFrame created from existing DataFrame)
DATAFRAME - ATTRIBUTES
DATAFRAME ATTRIBUTES
All information related to a DataFrame such as its size, datatype, etc is available through its attributes.
Syntax to use a specific attribute:
<DataFrame object>.<attribute name>
Attribute | Description |
index | The index (row labels) of the DataFrame |
columns | The column labels of the DataFrame |
axes | It returns axis 0 i.e., index and axis 1 i.e., columns of the DataFrame |
dtypes | Return the data types of data in the DataFrame |
size | Return an int representing the number of elements in this object |
shape | Return a tuple representing the dimensionality of the DataFrame i., (no.of rows, no.of columns) |
values | Return a Numpy representation of the DataFrame |
empty | Indicats whether DataFrame is empty |
ndim | Return an int representing the number of axes/array dimensions. |
T | Transpose |
Example of a DataFrame DF:
Retrieving various properties of a DataFrame Object:
>>>df.index
Index(['First', 'Second', 'Third', 'Fourth'], dtype='object')
(for default indexes)
>>>df.index #above example
RangeIndex(start=0, stop=3, step=1)
>>> df.columns
Index(['RNo', 'SName', 'Marks'], dtype='object')
>>>df.axes
[Index(['First', 'Second', 'Third', 'Fourth'], dtype='object'), Index(['RNo', 'SName', 'Marks'], dtype='object')]
>>>df.dtypes
RNo int64
SName object
Marks int64
dtype: object
>>>df.size#4 rows X 3 columns
12
>>>df.shape #(no.of rows, no.of columns)
(4, 3)
>>>df.values# Numpy representation
[ [51 'Lahari' 55]
[52 'Chanakya' 62]
[53 'Harish' 52]
[54 'Neha' 75] ]
>>>df.empty
#if DataFrame is empty, gives True
False
>>>df.ndim # As DataFrame is a 2 Dimensional
2
>>>df.T
#Transpose. Rows will become columns and vice versa.
Example of a DataFrame DF:
Function | Description |
len(<DF Object>) | Return the number of rows in a dataframe |
(<DF Object>. count( ) | If we pass any argument or 0 (default is 0), it returns count of non-NA values for each column, if it is 1, it returns count of non-NA values for each row. |
OTHERS
>>>len(df)
4
>>>df.count( )
#df.count(0)or df.count(axis=’index’)
RNo 4
SName 4
Marks 4
dtype: int64
>>>df.count(1) # df.count(axis=’columns’)
First 3
Second 3
Third 3
Fourth 3
dtype: int64
>>>df.shape[0]# to get number of rows
4
>>>df.shape[1]# to get number of columns
3
OPERATIONS ON DATAFRAMES
SELECTING/ACCESSING DATA
&
MODIFYING, ADDING DATA
Create the following DataFrame in any method
df
import pandas as pd
dict={'Eng':[68,72,66],'Tel':[55,84,90],'Mat':[60,70,65],'Soc':[80,90,85]}
df=pd.DataFrame(dict,index=['Raj','Pavan','Mohan'])
print(DF)
Selecting/Accessing a subset from a DataFrame using Row/Column Names using loc function:
To access row(s) and/or a combination of rows and columns, we can use loc function.
Syntax:
<DataFrame Object>.loc[<startrow>:<endrow>, <startcolumn>:<endcolumn>]
Note: With loc, Both start label and end label are included when given as start:end
Selecting/Accessing a subset from a DataFrame using Row/Column Names using iloc function:
With this function, we can extract, subset from dataframe using the row and column numeric index/position. iloc means integer location.
Syntax:
<DF Object>.iloc[<start row index>:<end row index>, <start col index>:<end column index>]
Note: With iloc, like slices end index/position is excluded when given as start:end.
.at function: Access a single value for a row/column label pair by labels.
Syntax:<DF Object>.at[<row label>,<col label>]
.iat function: Access a single value for a row/column label pair by index position.
Syntax:
<DF Object>.at[<row index no><col index no>]
SINGLE COLUMN
SELECTING/ACCESSING a column:
Syntax:<DataFrame object> [<column name>]
(or)<DataFrame object>.<column name>
>>>df['Eng']
Raj 68
Pavan 72
Mohan 66
Name: Eng, dtype: int64
>>>df.Eng
Raj 68
Pavan 72
Mohan 66
Name: Eng, dtype: int64
MODIFYING a Column:
Note: Assigning values to a new column label that does not exist will create a new column at the end. If the column already exists in the DataFrame then the assignment statement will update the values of the already existing column, for example:
df['Eng']=[40,50,60]
df['Tel']=55
df.Mat=70,80,90
df.Soc=100
Note : If we give following,
>>> df.corporate=11,12,13 or
>>> df.corporate=[11,12,13],
No error will be displayed, but nothing will be stored in DataFrame.
ADDING a Column:
>>>df['Hin']=[89,78,76]
SELECTING/ACCESSING a column (loc):
>>>df.loc[:,'Eng']
Raj 68
Pavan 72
Mohan 66
Name: Eng, dtype: int64
MODIFYING a Column (loc):
>>>df.loc[:,'Eng']=[10,20,30]
# df.loc[:,'Eng']=10,20,30
>>>df.loc[:,'Mat']=100
ADDING a Column (loc):
>>>df.loc[:,'IP']=[10,20,30]
>>>df.loc[:,'Hin']=50
SELECTING/ACCESSING a column (iloc):
>>>df.iloc[:,1]
Raj 55
Pavan 84
Mohan 90
Name: Tel, dtype: int64
>>>df.iloc[:,[1]]
Tel
Raj 55
Pavan 84
Mohan 90
MODIFYING a Column (iloc):
>>>df.iloc[:,1]=[40,50,60]
>>>df.iloc[:,3]=70
Note: We cannot add a Column using iloc.
If you try to add new column using iloc, “IndexError” will come.
Ex:
>>>df.iloc[:,4]=95
IndexError : iloc cannot enlarge its target object
>>> df.iloc[:,1:3]=[[1,2],[3,4],[5,6]]
MULTIPLE COLUMNS
SELECTING/ACCESSING multiple column:
<DataFrame object>[ [<column name>,<column name>,…..] ]
>>>df[['Tel','Soc','Mat']]
MODIFYING multiple Columns values:
>>>df[['Tel','Soc','Mat']]=10,20,30
# df[['Tel','Soc','Mat']]=[10,20,30]
>>> df[['Tel','Soc','Mat']]=[[1,2,3],[4,5,6],[7,8,9]]
SELECTING/ACCESSING multiple columns (loc):
>>> df.loc[:,'Eng':'Mat']
Note: All columns between start and end columns are listed.
>>> df.loc[:,'Tel':]
>>>df.loc[:,'Mat':'Eng']
Empty DataFrame
Columns: []
Index: [Raj, Pavan, Mohan]
>>>df.loc[:,['Soc','Tel','Eng']]
MODIFYING multiple Columns values (loc):
>>>df.loc[:,'Eng':'Mat']=50,60,70
>>>df.loc[:,['Soc','Tel','Eng']]=10,20,30
SELECTING/ACCESSING multiple columns (iloc):
>>> df.iloc[:,1:3] #Excluding column 3
>>> df.iloc[:,[2,0]]
>>> df.iloc[:,1:]
>>> df.iloc[:,2:0]
Empty DataFrame
Columns: []
Index: [Raj, Pavan, Mohan]
>>> df.iloc[:,[2,0,1]]
MODIFYING multiple Columns values (iloc):
>>>df.iloc[:,1:3]=[25,35]
SINGLE ROW
SELECTING/ACCESSING one row (loc):
Just give the row name/label.
>>>df.loc['Pavan']
# df.loc['Pavan',] or df.loc['Pavan',:]
Eng 72
Tel 84
Mat 70
Soc 90
Name: Pavan, dtype: int64
>>> df.loc['Kiran']
KeyError: 'Kiran'
MODIFYING one row (loc):
>>>df.loc["Raj"]=91,92,93,94
#df.loc[“Raj”,:] = [91,92,93,94]
>>>df.loc["Pavan"]=100
>>> df.loc['Mohan',:]=601,602,603
ValueError: could not broadcast input array from shape (3,) into shape (4,)
ADDING one row (loc):
>>>df.loc['Kumar']=91,92,93,94
Note: If we try to add a row with lesser values than the number of columns in the DataFrame, it results in a ValueError, with the error message: ValueError: Cannot set a row with mismatched columns.
Similarly, if we try to add a column with lesser values than the number of rows in the DataFrame, it results in a ValueError, with the error message: ValueError: Length of values does not match length of index.
SELECTING/ACCESSING one row (iloc):
>>>df.iloc[1] #df.iloc[1,] or df.iloc[1,:]
Eng 72
Tel 84
Mat 70
Soc 90
Name: Pavan, dtype: int64
>>> df.iloc[4]
IndexError: single positional indexer is out-of-bounds
MODIFYING one row (iloc):
>>>df.iloc[2]=75
>>>df.iloc[1]=81,82,83,84
# df.iloc[1]=[81,82,83,84]
# df.iloc[1,:]=[81,82,83,84]
Note: We cannot add a row using iloc.
If you try to add new column using iloc, “IndexError” will come.
Ex:
>>>df.iloc[:,3]=91,92,93,94
IndexError : iloc cannot enlarge its target object
>>> df.iloc[[2,0]]=[[100,200,300,400],[11,22,33,44]]
MULTIPLE ROWS
SELECTING/ACCESSING multiple rows (loc):
>>>df.loc['Raj':'Mohan']
# df.loc['Raj':'Mohan', ] or df.loc['Raj':'Mohan', :]
>>>df.loc['Pavan':'Mohan']
>>>df.loc[['Mohan','Raj']]
>>>df.loc['Pavan':'Raj']
Empty DataFrame
Columns: [Eng, Tel, Mat, Soc]
Index: [ ]
MODIFYING multiple rows (loc):
>>> df.loc[['Mohan','Raj']]=[[1,2,3,4],[5,6,7,8]]
SELECTING/ACCESSING multiple rows (iloc):
>>> df.iloc[0:3] # df.iloc[0:3,] or df.iloc[0:3,:]
>>>df.iloc[0:2]
>>>df.iloc[1:10]
>>>df.iloc[1:1]
Empty DataFrame
Columns: [Eng, Tel, Mat, Soc]
Index: [ ]
>>>df.iloc[[2,1]] #df.iloc[[2,1], ] or df.iloc[[2,1], : ]
MODIFYING multiple rows (iloc):
>>>df.iloc[0:2]=[[1,2,3,4],[5,6,7,8]]
Modifying All Rows (iloc):
>>>df[ : ]
>>>df[ : ] = 10
RANGE OF COLUMNS
FROM A RANGE OF ROWS
SELECTING/ACCESSING range of columns from a range of rows (loc):
<DF Object>.loc[<startrow>:<endrow>,
<startcolumn>:<endcolumn>]
>>> df.loc['Pavan':'Mohan','Tel':'Soc']
>>>df.loc['Mohan':'Raj','Eng':'Soc']
Empty DataFrame
Columns: [Eng, Tel, Mat, Soc]
Index: []
>>>df.loc['Raj':'Pavan','Mat':'Eng']
Empty DataFrame
Columns: []
Index: [Raj, Pavan]
MODIFYING range of columns from a range of rows (loc):
>>>df.loc['Pavan':'Mohan','Tel':'Soc']=[[1,2,3],[4,5,6]]
SELECTING/ACCESSING range of columns from a range of rows (iloc):
>>> df.iloc[1:3,0:2] #Rows 1,2 & Columns 0,1
>>>df.iloc[[1,2],[2,0,1]]
>>> df.iloc[2:2,0:2]
Empty DataFrame
Columns: [Eng, Tel]
Index: []
>>> df.iloc[1:3,2:0]
Empty DataFrame
Columns: []
Index: [Pavan, Mohan]
>>> df.iloc[[1,3],0:2]
IndexError: positional indexers are out-of-bounds
MODIFYING range of columns
from a range of rows (iloc):
>>>df.iloc[0:2,1:4]=[[21,22,23],[31,32,33]]
SINGLE VALUE
SELECTING/ACCESSING a single value:
Either give name of row or numeric index in square brackets.
Syntax:<DF Object>.<column>
[<row name or row numeric index>]
Ex: >>df.Eng['Pavan']
72
MODIFYING a single value:
>>>df.Eng['Pavan']=200 will change the value to 200
>>> df.Tel[0]=500
SELECTING/ACCESSING a single value (loc):
>>>df.loc['Pavan','Mat']
100
MODIFYING a single value (loc):
Specify the row label and the column name, then assign the new value.
>>>df.loc['Pavan','Mat']=100
SELECTING/ACCESSING a single value (iloc):
>>>df.iloc[2,3]
85
MODIFYING a single value (iloc):
>>>df.iloc[2,3]=500
.at function: Access a single value for a row/column label pair by labels.
Syntax:<DF Object>.at[<row label>,<col label>]
>>> df.at['Raj','Mat']
60
>>> df.at['Raj','Mat']=150 will change the value to 150
>>> df.at['Kiran','Soc']
KeyError: 'Kiran'
>>> df.at['Raj','IP']
KeyError: 'IP'
.iat function: Access a single value for a row/column label pair by index position.
Syntax:
<DF Object>.at[<row index no><col index no>]
>>> df.iat[2,2]
65
# df.iat[2,3]=30 will change the value to 30
HEAD & TAIL FUNCTIONS
head(n): To display the first n rows in the DataFrame. Default value of n is 5.
tail(n): To display the last n rows in the DataFrame. Default value of n is 5.
Create the following DataFrame “MyDF”.
Execute the following commands:
MyDF.head(3)
MyDF.head( )
MyDF.head(15)
MyDF.head(-3)
MyDF.tail(3)
MyDF.tail( )
MyDF.tail(777)
MyDF.tail(-3)
Create the following DataFrame in any method
df
RENAMING ROWS/COLUMNS
To change the name of any row/column individually, we can use the rename( ) function.
rename( ) function by default does not make changes in the original dataframe. It creates a new dataframe with the changes and the original dataframe remains unchanged.
Syntax:
<DF>.rename(index={<names dictionary>},
columns={<names dictionary>}, inplace=False)
Renaming Row Indexes:
>>>df.rename(index={'Raj':'Mr.Rajesh','Mohan':'Mohan Garu'},inplace=True)
Renaming Column Indexes (Column Labels):
>>> df.rename(columns={'Eng':'English', 'Mat':'Maths'},inplace=True)
Another Example:
dict={'RNo':[51,52,53],'SName':['Suresh','Naresh','Bhavesh']}
df=pd.DataFrame(dict, index=['First','Second','Third'])
>>>df.rename(index={'Second':'Two'}, columns={'RNo':'RollNo'},inplace=True)
Note : If we do not add “inplace=True”, when we are executing the commands only, it will show the modified values. But really it won’t modifies the values. So to modify values we need to add “inplace=True”.
Create the following DataFrame in any method
df
ASSIGN FUNCTION
<DF object>=<DF object>.assign(<column name>=<values for column>)
>>> df=df.assign(Mat=[10,11,12])
>>>df=df.assign(IP=[81,82,83])
>>>df=df.assign(Tel=77)
>>>df=df.assign(New=[55,56])
ValueError: Length of values (2) does not match length of index (4)
DELETING ROWS/COLUMNS
Two ways to delete rows and columns
– del( ) and drop( )
We can use the DataFrame.drop() method to delete rows and columns from a DataFrame. We need to specify the names of the labels to be dropped and the axis from which they need to be dropped. To delete a row, the parameter axis is assigned the value 0 and for deleting a column,the parameter axis is assigned the value 1.
(i) Delete row(s) using drop( ) function:
Syntax:<DF>.drop(index or sequence of indexes)
>>> df.drop('Pavan',axis=0,inplace=True)
#df.drop('Pavan',inplace=True)
#df=df.drop('Pavan',axis=0)
# Default axis is 0, so no need to give
>>> df.drop(['Raj','Pavan'],inplace=True)
Note: If the DataFrame has more than one row with the same label, the DataFrame.drop() method will delete all the matching rows from it.
(Other examples:
df.drop(range(2,15,3)) – 2,5,8,11,14
df.drop([2,4,6,8,12])
Argument to drop( ) should be either an index, or a sequence containing indexes.)
(ii) Delete a column, using drop( ) function:
>>> df.drop('Tel',axis=1,inplace=True)
>>>df.drop(['Soc','Eng'],axis=1,inplace=True)
(iii) Delete a column, using del( ) function:
Syntax: del <DF object>[<column name>]
>>> del df['Mat']
ITERATION
ITERATION (Pandas 2 Chapter)
Iterating Over a Data Frame
Iterating Over a DataFrame:
>>> dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
>>>DF=pd.DataFrame(dict,index=['Private','Govt'])
iterrows( ) : This method iterates over dataframe row wise where each horizontal subset is in the form of (row-index,Series) where Series contains all column values for that row-index.
Example Program: Using iterrows( ) to
extract data from dataframe row wise.
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for (row,rowSeries) in DF.iterrows():
print("Row index:", row)
print("Containing: ")
print(rowSeries)
Row index: Private
Containing:
Teachers 20
Students 200
Ratio 10
Name: Private, dtype: int64
Row index: Govt
Containing:
Teachers 10
Students 150
Ratio 15
Name: Govt, dtype: int64
OUTPUT
Example : Using iterrows( ) to extract row-wise Series objects
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for (row,rowSeries) in DF.iterrows():
print("Row index:",row)
print("Containing: ")
i=0
for val in rowSeries:
print("At",i,"position: ",val)
i=i+1
OUTPUT
Row index: Private
Containing:
At 0 position: 20
At 1 position: 200
At 2 position: 10
Row index: Govt
Containing:
At 0 position: 10
At 1 position: 150
At 2 position: 15
Write a program to print the DataFrame DF, one row at a time
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for i,j in DF.iterrows():
print(i)
print(j)
print("____________")
OUTPUT
Private
Teachers 20
Students 200
Ratio 10
Name: Private, dtype: int64
____________
Govt
Teachers 10
Students 150
Ratio 15
Name: Govt, dtype: int64
Putting Individual columns from a row:
When accessing rows of a DataFrame using iterrows(), then by using rowSeries[<column>], you can print individual column value from that row ie.,after the line,for r, Row in df.iterrows( ):
You can print individual column value as :
Row[<column name>]
Write a program to print only the values from Teachers column, for each row
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for row,rowSeries in DF.iterrows():
print(rowSeries['Teachers'])
print("------")
OUTPUT
20
------
10
------
iteritems( ): This method iterates over dataframe column wise where each vertical subset is in the form of (col-index,Series) where Series contains all row values for that column-index.
Note: in present versions, iteritems( ) is
replaced with items( )
Example : Using iteritems( ) to extract data from
dataframe column wise.
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for (col,colSeries) in DF.items(): # iteritems( )
print("Column index:",col)
print("Containing: ")
print(colSeries)
Column index: Teachers
Containing:
Private 20
Govt 10
Name: Teachers, dtype: int64
Column index: Students
Containing:
Private 200
Govt 150
Name: Students, dtype: int64
Column index: Ratio
Containing:
Private 10
Govt 15
Name: Ratio, dtype: int64
OUTPUT
Example : Using iteritems( ) to extract
dataframe column wise series object
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for (col,colSeries) in DF.items(): #iteritems( )
print("Column index:",col)
print("Containing: ")
i=0
for val in colSeries:
print("At row ",i,":",val)
i=i+1
OUTPUT
Column index: Teachers
Containing:
At row 0 : 20
At row 1 : 10
Column index: Students
Containing:
At row 0 : 200
At row 1 : 150
Column index: Ratio
Containing:
At row 0 : 10
At row 1 : 15
Write a program to print the DataFrame DF, one column at a time
import pandas as pd
dict={'Teachers':[20,10],'Students':[200,150],
'Ratio':[10,15]}
DF=pd.DataFrame(dict,index=['Private','Govt'])
for i,j in DF.items(): #iteritems( )
print(i)
print(j)
print("____________")
OUTPUT
Teachers
Private 20
Govt 10
Name: Teachers, dtype: int64
____________
Students
Private 200
Govt 150
Name: Students, dtype: int64
____________
Ratio
Private 10
Govt 15
Name: Ratio, dtype: int64
Binary Operations:
addition, subtraction, multiplication, division
import pandas as pd
dict1={'A':[11,17,23],'B':[13,19,25],'C':[15,21,27]}
DF1=pd.DataFrame(dict1)
dict2={'A':[12,18,24],'B':[14,20,26],'C':[16,22,28]}
DF2=pd.DataFrame(dict2)
dict3={'A':[1,3,5],'B':[2,4,6]}
DF3=pd.DataFrame(dict3)
dict4={'A':[7,9],'B':[8,10]}
DF4=pd.DataFrame(dict4)
Addition : [ Using +, add( ), radd( ) ]
Note : DF1.add(DF2) is equal to DF1+DF2
DF1.radd(DF2) is equal to DF2+DF1
radd( ) means reverse addition
>>>DF1+DF2 #DF1.add(DF2)
>>>DF1+DF3 >>>DF1+DF4
>>>DF3+DF4 >>>DF3.add(DF4)
Subtraction: [ Using -, sub( ), rsub( ) ]
Note : DF1.sub(DF2) is equal to DF1-DF2
DF1.rsub(DF2) is equal to DF2-DF1
rsub( ) means reverse subtraction
>>>DF1-DF2 >>>DF2-DF1
>>>DF1-DF3 >>>DF3-DF1
>>>DF3-DF4 >>>DF4-DF3
#DF3.sub(DF4) #DF3.rsub(DF4)
Multiplication: [ Using *, mul( ), rmul( ) ]
Note : DF1.mul(DF2) is equal to DF1*DF2
DF1.rmul(DF2) is equal to DF2*DF1
rmul( ) means reverse multiplication >>DF1*DF2 >>>DF1*DF3
Division: [ Using /, div( ), rdiv( ) ]
Note : DF1.div(DF2) is equal to DF1/DF2
DF1.rdiv(DF2) is equal to DF2/DF1
rdiv( ) means reverse division.
>>>DF1/DF2
>>>DF2/DF1
>>>DF2/DF3