SERIES PPT
XII – IP –PYTHON
2022.23 Python Syllabus
Data Handling using Pandas and Data Visualization
(25 Marks)
Introduction to Python libraries- Pandas, Matplotlib.
Data structures in Pandas - Series and Data Frames.
Series: Creation of Series from – ndarray, dictionary, scalar value; mathematical operations;
Head and Tail functions; Selection, Indexing and Slicing.
Data Frames: Creation - from dictionary of Series, list of dictionaries, Text/CSV files; display; iteration; Operations on rows and columns: add, select, delete, rename; Head and Tail functions; Indexing using Labels, Boolean Indexing;
Importing/Exporting Data between CSV files and Data Frames.
Data Visualization: Purpose of plotting; drawing and saving following types of plots using Matplotlib – line plot, bar graph,Histogram.
Customizing plots: adding label, title, and legend in plots.
PANDAS INTRODUCTION
Pandas or Python Pandas is Python’s library for data analysis. Pandas has derived its name from “panel data system”, which is an ecometrics term for multi dimensional, structured data sets. Pandas has become a popular choice for data analysis.
Data analysis refers to process of evaluating big data sets using analytical and statistical tools so as to discover useful information and conclusions to support business decision-making.
The main author of Pandas is Wes McKinney.
Pandas is an open source, BSD library built for Python programming language.
Pandas Offers high performance, easy to use data structures and data analysis tools.
We need to import pandas: import pandas (or)import pandas as <identifier>
Ex: import pandas as pd
If we use numpy arrays, import numpy as np
Why Pandas?
Pandas is the best at handling huge tabular data sets comprising different data formats.
Pandas Data Structure:
.
A data structure is a particular way of storing and organizing data in a computer to suit a specific purpose so that it can be accessed and worked with in appropriate ways.
We can think Pandas data structures as enhanced versions of NumPy structured arrays in which the rows and columns can be identified and accessed with labels rather than simple integer indices.
Out of many data structures of Pandas, two basic data structures – Series and DataFrame are universally popular for their dependability. (Pandas also supports Panel Data Structure, but it is not in syllabus)
Property | Series | DataFrame |
Dimensions | 1 Dimensional | 2-Dimensional |
Type of Data | Homegeneous, i.e., all the elements must be of same type in a Series object | Heterogeneous, i.e., a DataFrame object can have elements of different data types |
Mutability | Value mutable, i.e., their elements value can change | Value mutable, i.e., their elements value can change |
Size-immutable, i.e., size of a Series object, once created, cannot change. It we want to add/drop an element, internally a new Series object will be created | Size-mutable, i.e., size of a Dataframe object, once created, can change in place. That is, you can add/drop elements in an existing dataframe object. |
1 2 3 4 | ‘A’ ‘B’ ‘C’ ‘D’ |
Index Data
Series:
DataFrame:
A Series is a Pandas data structure that represents a one dimensional array of indexed data.
It represents 1 D array like object containing an array of data for any NumPy data type and an associated array of data labels, called its index.
A Series type object has two main components:
* An array of actual data
* An associated array of indexes or data labels.
Both components are one-dimensional arrays with the same length. The index is used to access individual values.
SERIES
Ex:
CREATING SERIES OBJECTS
A Series object can be created in many ways using pandas library’s Series( ). First import pandas and numpy modules with import statements.
If we import pandas as pd, we can use pd.Series( ) instead of pandas.Series( ).
(i) Creating of empty Series object:
Syntax:
<Series Object>=pandas.Series( ) #S Upper case
Ex: S=pd.Series( )
Creates an empty Series S with no value, default datatype is float64.
(ii) Creating non-empty Series objects:
Specify arguments for data and indexes.
Syntax:
<Series object>=pd.Series(data,index=idx)
Here, idx is a valid Numpy datatype and data is the data part of the Series object, it can be one of the following:
Note : If we do not give index, the by default index array consists of the integers 0 through N-1 (N is the length of data).
(a) Specify data as Python Sequence:
Syntax:
<Series Object>=Series(<any Python sequence>)
It will return an object of Series type.
Using Lists:
>>> S=pd.Series([78,45,87])
>>> S
0 78
1 45
2 87
dtype: int64
>>> S=pd.Series([15.9,23.7])
>>> S
0 15.9
1 23.7
dtype: float64
>>> S=pd.Series([2.5,7.,9.2])
>>> S
0 2.5
1 7.0
2 9.2
dtype: float64
Note: It has taken 7. as 7.0
>>>S=pd.Series(["Welcome","To","my","School"])
>>> S
0 Welcome
1 To
2 my
3 School
dtype: object
>>> S=pd.Series([10,20.5])
>>> S
0 10.0
1 20.5
dtype: float64
>>> S=pd.Series([2.7,5,"Welcome"])
>>> S
0 2.7
1 5
2 Welcome
dtype: object
Note: Left column displays index and right column displays values.
Using Tuples:
>>> S=pd.Series((15,20,25))
>>> S
0 15
1 20
2 25
dtype: int64
>>> S=pd.Series((15.5,17))
>>> S
0 15.5
1 17.0
dtype: float64
>>> S=pd.Series((10,15.5,"Welcome to World"))
>>> S
0 10
1 15.5
2 Welcome to World
dtype: object
String:
>>> S=pd.Series("Welcome to World")
>>> S
0 Welcome to World
dtype: object
range() function: It generates a sequence
Ex: range(7) generates a sequence [0,1,2,3,4,5,6]
>>> S=pd.Series(range(7))
>>> S
0 0
1 1
2 2
3 3
4 4
5 5
6 6
dtype: int64
Program to create a Series object using the Python sequence
[10,15.9,"Welcome Friends"].
Solution:
import pandas as pd
S1=pd.Series([10,15.9,"Welcome Friends"])
print("Series Object is : ")
print(S1)
Output
Series Object is :
0 10
1 15.9
2 Welcome Friends
dtype: object
(b) Specify data as an ndarray:
Numpy contains function arange ( ) with the following syntax:
arrange(begin, end, update value)
Ex:np.arange(20,30,3) generates [20 23 26 29]
np.arange(20,30,2.5) generates [20. 22.5 25. 27.5]
# end value excluded
Program to create a Series object using an ndarray which uses arrange function(numpy array) to generate sequences between 20 and 30.
import pandas as pd
import numpy as np
nda1=np.arange(20,30,3)
nda2=np.arange(20,30,2.5)
print("Numpy array 1",nda1)
print("Numpy array 2",nda2)
S1=pd.Series(nda1)
S2=pd.Series(nda2)
print("Series 1\n",S1)
print("Series 2\n",S2)
#We can directly give as #S1=pd.Series(np.arange(20,30,3))
Output
Numpy array 1 [20 23 26 29]
Numpy array 2 [20. 22.5 25. 27.5]
Series 1
0 20
1 23
2 26
3 29
dtype: int32
Series 2
0 20.0
1 22.5
2 25.0
3 27.5
dtype: float64
Numpy contains function linspace ( ) with the following syntax:
linspace(begin, end,no.of elements between these values) end value includes
Ex: np.linspace(20,30,6) generates [20. 22. 24. 26. 28. 30.]
Program to create a Series object using an ndarray which uses linspace function(numpy array) to generate sequences between 20 and 30.
import pandas as pd
import numpy as np
nda1=np.linspace(20,30,4)
nda2=np.linspace(20,30,6)
S1=pd.Series(nda1)
S2=pd.Series(nda2)
print("Series 1\n",S1)
print("Series 2\n",S2)
Output
Series 1
0 20.000000
1 23.333333
2 26.666667
3 30.000000
dtype: float64
Series 2
0 20.0
1 22.0
2 24.0
3 26.0
4 28.0
5 30.0
dtype: float64
Numpy contains function tile ( ) for tiling a list for number of times.
Ex: np.tile([5,10],3) generates [5,10,5,10,5,10]
Program to create a Series object using an ndarray that is created by tiling a list [5,10] for 3 times.
import numpy as np
import pandas as pd
S=pd.Series(np.tile([3,5],3))
print(S)
Output
(c) Specify data as a Python Dictionary:
Keys of the dictionary object will becomes index of the Series and values of the dictionary become the data of Series object. Indexes, which are created from keys may not be in the same order as we have typed them.
Program to create a Series object using a dictionary that stores section wise toppers averages in each section of class X in a school.
import pandas as pd
Stu={'A':89.5,'B':92.34,'C':91.5}
S=pd.Series(Stu)
print(S)
Output
A 89.50
B 92.34
C 91.50
dtype: float64
(d) Specifying data as scalar value:
The data can be in the form of a single value or a scalar value. If data is a scalar value, then the index argument to Series( ) function must be provided.
The scalar value (given as data) will be repeated to match the length of index.
The index argument has to be a sequence of numbers or labels of any type.
>>> Marks=pd.Series(92)
>>> Marks
0 92
dtype: int64
>>> Marks=pd.Series(95,index=[11,12,13])
>>> Marks
11 95
12 95
13 95
dtype: int64
>>>Unknown=pd.Series('I don\'t know',index=['Un1','Un2'])
>>> Unknown
Un1 I don't know
Un2 I don't know
dtype: object
>>> Capital=pd.Series('Delhi',index=['State 1', 'State 2','State 3'])
>>> Capital
State 1 Delhi
State 2 Delhi
State 3 Delhi
dtype: object
>>> prizes=pd.Series(12,index=range(1,5))
>>> prizes
1 12
2 12
3 12
4 12
dtype: int64
>>> cer=pd.Series("Welcome",index=range(1,10,3))
>>> cer
1 Welcome
4 Welcome
7 Welcome
dtype: object
Program to create a Series object that stores the initial budget allocated (75000/- each ) for the four quarters of the year: Q1, Q2, Q3, Q4.
import pandas as pd
S=pd.Series(75000,index=['Q1','Q2','Q3','Q4'])
print(S)
Output
Q1 75000
Q2 75000
Q3 75000
Q4 75000
dtype: int64
Creating Series Objects – Additional Functionality:
(i) Specifying/Adding NaN values in a Series Object:
When we need to create a series object of a certain size but not having complete data, we can fill missing data with a NaN (Not a Number) value. Legal empty value NaN is defined in NumPy module, we can use np.NaN to specify missing value, or use None.
>>> import numpy as np
>>> S=pd.Series([10,"Hai",np.NaN,2.3,np.NaN])
>>> S
0 10
1 Hai
2 NaN
3 2.3
4 NaN
dtype: object
(ii) Specifying index(es) as well as data with Series( ):
Both values and indexes are sequences. None is taken by default, if you skip these parameters.
Syntax:<Series Object> = pandas.Series(data=None,index=None)
>>> stu=["Kamal","Mahesh","Jhansi"]
>>> marks=[76,82,79]
>>> S=pd.Series(data=marks,index=stu)
>>> S
Kamal 76
Mahesh 82
Jhansi 79
dtype: int64
We can use loop for defining index sequence also.
>>> S1=pd.Series(range(1,20,4),index=[vowel for vowel in 'aeiou'])
>>> S1
a 1
e 5
i 9
o 13
u 17
dtype: int64
Note: If specifying indexes explicitly using an index sequence, we must provide indexes equal to the number of values in data array; providing too few or too many indices will lead to an error, the ValueError.
(iii) Specify Data Type along with data and index:
<Series Object> = pandas.Series(data=None, index=None, dtype=None)
None is the default value for different parameters taken in case no value is provided for a parameter.
If we do not specify datatype, the nearest datatype to store the given values will be taken. We can specify our own datatype by specifying a NumPy datatype with dtype attribute.
>>> stu=["Kamal","Mahesh","Jhansi"]
>>> marks=[76,82,79]
>>> S=pd.Series(data=marks,index=stu,dtype=np.float64)
>>> S
Kamal 76.0
Mahesh 82.0
Jhansi 79.0
dtype: float64
(iv) Using a Mathematical Function/Expression to Create Data Array in Series( ):
<Series Object>=pandas.Series(index=None, data=<function/expression>)
a=[5,10,15,20]
>>> S=pd.Series(data=a*2)
#Python list a replicates 2 times
>>> S
0 5
1 10
2 15
3 20
4 5
5 10
6 15
7 20
dtype: int64
>>> S=pd.Series(index=a,data=a*2)
ValueError: Length of values (8) does not match length of index (4)
>>> m=np.arange(9,13)
>>> m
array([ 9, 10, 11, 12])
>>> S2=pd.Series(index=m,data=m*2)
>>> S2
9 18
10 20
11 22
12 24
dtype: int32
>>> S3=pd.Series(index=m,data=m**2)
>>> S3
9 81
10 100
11 121
12 144
dtype: int32
Indices need not be unique in Pandas Series Object. This will only cause an error if/when you perform an operation that requires unique indices.
>>> val=[10.5,12,"Welcome"]
>>> S=pd.Series(data=val,index=['a','b','a'])
>>> S
a 10.5
b 12
a Welcome
dtype: object
Series Object Attributes: When we create a Series type object, all information related to it is available through attributes.
Syntax: <Series object>.<attribute name>
Some common attributes:
Attribute | Description |
index | The index(axis labels) of the Series |
values | Return Series as ndarray or ndarray-like (data) depending on the dtype |
dtype | Return the dtype object of the underlying data (datatype) |
shape | Return a tuple of the shape of the underlying data |
nbytes | Return the number of bytes in the underlying data |
ndim | Return the number of dimensions of the underlying data |
size | Return the number of elements in the underlying data |
itemsize | Return the size of the dtype of the item of the underlying data |
hasnans | Return True if there are any NaN values; otherwise return False |
empty | Return True if the Series object is empty, false otherwise. |
Consider the following Series Object:
>>> Marks=[34,33,np.NaN,38,40]
>>> Exams=["CT1","CT2","CT3","CT4","CT5"]
>>> S=pd.Series(Marks,index=Exams)
>>> S
CT1 34.0
CT2 33.0
CT3 NaN
CT4 38.0
CT5 40.0
dtype: float64
(i) index :
>>> S.index
Index(['CT1', 'CT2', 'CT3', 'CT4', 'CT5'], dtype='object')
(ii) values:
>>> S.values
array([34., 33., nan, 38., 40.])
(iii) dtype:
>>> S.dtype
dtype('float64')
(iv) shape:
>>> S.shape
(5, )
(v) nbytes:
>>> S.nbytes #5 elements X 4 bytes for float
40
(vi) ndim:
>>> S.ndim # Series is One Dimensional
1
(vii) size:
>>> S.size # 5 elements
5
>>> S
CT1 34.0
CT2 33.0
CT3 NaN
CT4 38.0
CT5 40.0
dtype: float64
(viii) itemsize:
AttributeError: 'Series' object has no attribute 'itemsize'
(ix) hasnans:
>>> S.hasnans
True
(x) empty:
>>> S.empty
False
Other example related to index:
>>> S3=pd.Series(data=np.arange(5,25,4))
>>> S3.index
RangeIndex(start=0, stop=5, step=1)
>>> a=np.arange(9,13)
>>> S4=pd.Series(index=a,data=a*2)
>>> S4.index
Int64Index([9, 10, 11, 12], dtype='int64')
Some functions
Function | Use |
len( ) | To get total number of elements (including NaN values) |
count( ) | To get the count of non-NaN values in a series object |
type( ) | To know the data type of an object |
>>> len(S)
5
>>> S.count()
4
>>> type(S)
<class 'pandas.core.series.Series'>
Accessing a Series Object and its Elements
We can access Series indexes separately, data separately, also can access individual elements and slices.
Let us take some example Series Objects.
>>>S1=pd.Series(data=[5,6,7,8,9,10,11,12],
index=['May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'])
>>> S2=pd.Series(data=[75,72,89],
index=['Raj','Kamal','Nani'])
>>> S3=pd.Series([87,99,52],index=[11,12,13])
(a) Accessing Individual Elements: With index value or with its position.
Syntax:<Series Object name>[<valid index>]
>>> S1['Jul']
7
Note: (1) If the Series object has duplicate indexes, then giving an index with the Series object will return all the entries with that index.
(2) If the indexes are string type, then it will work with position value also, otherwise, KeyError will come.
>>> S1[2]
7
>>> S3[11]
87
>>> S3[0]
KeyError
(b) Extracting Slices from Series Object:
Slicing takes place position wise and not the index wise in a series object.
All individual elements have position numbers starting form 0 onwards.
Syntax: <object>[start:end:step]
(end value is excluding)
The slice object of a Series object is also a panda Series type object.
>>> S1[1:5]
Jun 6
Jul 7
Aug 8
Sep 9
dtype: int64
>>> S1[10:12]
#position wise, not index wise
Series([], dtype: int64)
>>> S2[::-1] #slice with values reversed
Nani 89
Kamal 72
Raj 75
dtype: int64
>>> S3
11 87
12 99
13 52
dtype: int64
>>> S1
5 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1[0:2:2]
5 5
dtype: int64
>>> S1[2:6:3]
Jul 7
Oct 10
dtype: int64
>>> S1[1:9:2]
Jun 6
Aug 8
Oct 10
Dec 12
dtype: int64
>>> S1[0::2]
5 5
Jul 7
Sep 9
Nov 11
dtype: int64
>>> S1[::-2]
Dec 12
Oct 10
Aug 8
Jun 6
dtype: int64
>>> S1[21:2:1]
Series([], dtype: int64)
>>> S1[6:1:-2]
Nov 11
Sep 9
Jul 7
dtype: int64
Operations on Series Object
(a) Modifying Elements of Series Object:
Syntax: <SeriesObject>[<index>]=<new data value>
Above assignment will change the data value of the given index in the Series object.
<SeriesObject>[start:stop]=<new data value>
Above assignment will replace all the values falling in given slice.
>>> S2=pd.Series(data=[75,72,89],index=['Raj','Kamal','Nani'])
>>> S2
Raj 75
Kamal 72
Nani 89
dtype: int64
>>> S2["Raj"]=94
>>> S2[1]=99
>>> S2
Raj 94
Kamal 99
Nani 89
dtype: int64
>>> S1[1:6]=25
Renaming Indexes:
Syntax:<Object>.index=<new index array>
>>> S3=pd.Series([87,99,52],index=[11,12,13])
>>> S3
11 87
12 99
13 52
dtype: int64
>>> S3.index=['First','Second','Third']
>>> S3
First 87
Second 99
Third 52
dtype: int64
>>> S3.index=['One','Two']
ValueError
head( ) & tail( )function:
head( ) function is used to fetch first n rows from a Pandas object and tail( ) function returns last n rows from a Pandas object.
Syntax:
<pandas object>.head([n])
<pandas object>.tail([n])
Note: If you do not provide any value for n, the head( ) and tail( ) will return first 5 and last 5 rows.
>>> S1
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.head(3)
May 5
Jun 6
Jul 7
dtype: int64
>>> S1.head()
May 5
Jun 6
Jul 7
Aug 8
Sep 9
dtype: int64
>>> S1.head(77)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.head(-2)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
dtype: int64
>>> S1.tail(3)
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail()
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail(22)
May 5
Jun 6
Jul 7
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
>>> S1.tail(-3)
Aug 8
Sep 9
Oct 10
Nov 11
Dec 12
dtype: int64
Vector operations on Series Object:
Vector operations means that if we apply a function or expression, then it is individually applied on each item of the object. As Series Objects are built upon NumPy arrays (ndarrays), they also support vectorised operations just like ndarrays.
>>> S=pd.Series([2,3,4,5])
>>> S
0 2
1 3
2 4
3 5
dtype: int64
>>> S+2
0 4
1 5
2 6
3 7
dtype: int64
>>> S-1
0 1
1 2
2 3
3 4
dtype: int64
>>> S*3
0 6
1 9
2 12
3 15
dtype: int64
>>> S/2
0 1.0
1 1.5
2 2.0
3 2.5
dtype: float64
>>> S>3
0 False
1 False
2 True
3 True
dtype: bool
>>> S5=pd.Series([2,3,4,5])
>>> S6=S5**2
>>> S6
0 4
1 9
2 16
3 25
dtype: int64
Arithmetic on Series Object
We can do arithmetic like addition, subtraction, division, etc with two Series objects and it will calculate result on two corresponding items of the two objects given in expression.
The operation is performed only in the matching indexes, for non matching indexes, it will produce NaN (not a number).
If the data items of the two matching indexes are not compatible for the operation, it will return NaN.
>>> Ob4=pd.Series(["Welcome","to","World"])
>>> Ob5=pd.Series(["I","am","Human"])
>>> Ob1+Ob4
TypeError: unsupported operand type(s) for +: 'int' and 'str'.
Note : If the indexes are not matched, then NaN values will come.
Note: When we perform airthmetic operations on two Series type objects, the data is aligned on the basis of matching indexes (this is called Data alignmane in Pandas object) and then performed arithmetic; for non-overlapping indexes, the arithmetic operations result as a NaN (Not a Number).
We can store the result of object arithmetic in another object, which will also be a Series object.
>>>Ob3=Ob1+Ob2
Filtering Entries:
We can filter entries from a Series object using expressions that are of Boolean type (ie the exptressions that results Boolean value True/False)
When we apply a comparison operator directly on a Pandas Series object, then it works like vectorized operation and applies this check on each individual element of Series object.
Syntax: <Series Object>[[<Boolean Expression on Series Object>]
Ex: >>> S=pd.Series([5,10,20,25,30])
Series Obj Vectorized Operation
Filtered Result
Sorting Series Values
We can sort the values of a Series object on the basis of values and indexes.
Sorting on the Basis of Values:
Syntax:
<Series object>.sort_values([ascending=True/False])
The argument ascending is optional and if skipped, it takes the value True by default.
>>> S=pd.Series([2500,1200,1700,-500,700])
>>> S.sort_values(ascending=True)
# or >>> S.sort_values( )
>>> S.sort_values(ascending=False)
# will display in descending order
Note : To make the sorted values permanent in the Series object, use “inplace=True”.
Ex:
>>> S.sort_values(ascending=True,inplace=True)
# or >>> S.sort_values(inplace=True)
# will sort the Series in ascending order permanently.
Sorting on the Basis of Indexes: sort_index()
Syntax: <Series object>.sort_index([ascending=True/False])
The argument ascending is optional and if skipped, it takes the value True by default.
Ex: Obj=pd.Series([2500,-500,3500,1500],index=['C','B','D','A'])
Note : To make the sorted values permanent in the Series object, use “inplace=True”.
>>> Obj.sort_index(ascending=False,inplace=True)
Difference between NumPy Arrays and Series Objects
ndarrays | Series Objects |
We can perform vectorised operations only if the shapes of two ndarrays match, otherwise it returns an error (ValueError) | In case of vectorised operations, the data of two Series objects is aligned as per matching indexes and operation is performed on them and for non-matching indexes, NaN is returned. |
The indexes are always numeric starting from 0 onwards | Series objects can have any type of indexes, including numbers (not necessarily starting from 0), letters, labels, strings, etc. |
Reindexing: To create a similar object with different order of same indexes.
<Series Object>=<Object>.reindex
(<sequence with new order of indexes>)
>>> Obj2=Obj1.reindex(['C','A','B','D'])
>>> Obj3=Obj1.reindex(['D','B','Mar','Apr'])
With this, the same data values and their indexes will be stored in the new object as per the defined order of index in the reindex( ).
Dropping Entries from an Axis
To remove an entry from Series object use drop( ).
Syntax: <Series Object>.drop(<index to be removed>)
>>> Obj.drop('C',inplace=True)