1 of 25

Data Handling using NumPy

Chapter 5: Understanding Data

By: Bhoop Singh, PGT CS

JNV Rangareddy

2 of 25

What is Data?

Data refer to unorganized facts that can be processed to generate meaningful result or information. Data is a collection of characters, numbers, and other symbols that represents values of some situations or variables. eg:

  • Name, age, gender or contact details of a person
  • Transactions data generated through banking, ticketing, shopping, etc. whether online or offline
  • Images, graphics, animations, audio, video
  • Documents and web pages
  • Online posts, comments and messages
  • Signals generated by sensors
  • Satellite data including meteorological data, communication data, earth observation data, etc.

3 of 25

Importance of Data (used for making decisions)

  • The meteorological offices continuously keep on monitoring satellite data for any upcoming cyclone or heavy rain.
  • The dynamic pricing concept used by airlines and railway is another example where they decide the price based on relationships between demand and supply.
  • The cab booking Apps increase or decrease the price based on demand for cabs at a particular time.
  • Certain restaurants offer discounted price (called happy hours), they decide when and how much discount to offer by analyzing sales data at different time periods.
  • The electronic voting machines are used for recording the votes cast. Subsequently, the voting data from all the machines are accumulated to declare election results in a short time as compared to manual counting of ballot papers.
  • Scientists record data while doing experiments to calculate and compare results.
  • Pharmaceutical companies record data while trying out a new medicine to see its effectiveness.
  • Libraries maintain data about books in the library and the membership of the library.
  • The search engines give us results after analyzing large volume of data available on the websites across World Wide Web
  • Weather alerts are generated by analyzing data received from various satellites.

4 of 25

Types of Data

  1. Structured Data:
    • It is organized
    • Can be recorded in a well defined format
    • Usually stored in a tabular format, having rows and columns
    • Each column represents a particular parameter called Attribute/Characteristics/Variable
    • Each row represents data of an observation for different attributes.
    • eg:

Moden No

Product Name

Price

Discount(%)

Stock

ABC1

Water Bottle

126

8

13

ABC2

Melamine Plates

320

5

45

ABC3

Dinner Set

4200

10

8

5 of 25

Types of Data

  1. Unstructured Data:
    • Doesn’t have a recognizable structure
    • It is unorganized and raw
    • It can be textual or non-textual
    • It is not in a form of traditional tabular structure
    • Sometimes called as metadata (ie. data about data) eg: different parts of email like subject, recipient, main body, attachment etc.
    • Some examples:
      1. Various types of news items such as pictures, text, graphs etc. in a news paper.
      2. Data of an email including subject, recipient, body, attachments etc.
      3. Web pages consisting of text and multimedia elements.

6 of 25

Data Collection

Data collection means identifying already available data or collecting from the appropriate sources.

  • Hospitals are collecting data about patients for improving their services.
  • Shopping malls are collecting data about the items being purchased by people. On analyzing such data, suppose it appears that bed sheets and groceries are frequently bought together. Hence, the shop owner may decide to display bed sheets near the grocery section in the mall to increase the sales.
  • A political analyst may look at the data contained in the posts and messages at a social media platform and analyze to see public opinion before an election.
  • Organizations like World Bank and International Monetary Fund (IMF) are collecting data related to various economic parameters from different countries for making economic forecasts.

7 of 25

Data Storage

  • The process of storing data on storage devices so that data can be retrieved later.
  • Due to advancement in technology a large amount of data is being generated.
  • Data can be stored on Hard disk, Solid state drives, CD/DVD, Tape drives etc.
  • Data like images, documents, audios/ videos etc. will be stored as files in computers. Likewise, school/ hospital data are stored in data files.
  • By using computers, we can add, modify or delete data in these files or process these data files to get results.

8 of 25

What is Information ???

Data on its own, has very little meaning/value

HOWEVER!!!

When data is arranged in a meaningful manner

INFORMATION

So, Information is organized or classified data, which has some meaningful values for the receiver. Information is the processed data on which decisions and actions are based.

9 of 25

Data Processing

An operation that performs upon raw facts (Data), such as collection, recording, organization, storage or alteration to convert it into useful information.

DATA PROCESSING USUALLY INVOLVES THREE BASIC ACTIVITIES KNOWN AS DATA PROCESSING CYCLE:

INPUT

PROCESSING

OUTPUT

Data Collection

Data Encoding

Data Transmission

Data Communication

Classification

Storing

Calculation

Decoding

Communication

Retrieval

10 of 25

INPUT is a process through which data collected is transformed into a form that the computer can understand.It is a very important step (!!!) because the correct output results depends completely on the input data.

INPUT

Activities carried out in Data Input are:

  1. Data Collection: Gathering raw facts and preparing it for the input process.
  2. Data Encoding: Process of converting raw facts into a form that is easier to process.
  3. Data Transmission: Sending input data to the processor and carrying it across various components.
  4. Data Communication: Set of activities which allow the data to be sent from one data processing system to another.

11 of 25

Transforming raw data into information by performing under actual data manipulation techniques.

PROCESSING

Some of the Data Manipulation Techniques are:

  1. Classification: When the data is classified into different groups and subgroups so that each data can be handled properly.
  2. Storing: Data is arranged into an order so that it can be accessed quickly when it’s required.
  3. Calculation:The operations are performed on the numeric data to get the required results.

12 of 25

After the processing step, output is generated. Output is where processed data is present to users as useful Information.

OUTPUT

Activities carried out in output are:

  1. Decoding: Encoded data is converted into a form that is easier to understand.
  2. Communication: Generated output is sent to different places to be used by individuals.
  3. Retrieval: Output stored on the storage media can be retrieved at any time.

13 of 25

An additional step to the Data Processing Cycle is :

The information that has been processed is stored for future uses.

STORAGE STAGE

14 of 25

Example of Data Processing

Student details like name, address, qualification, marks, mobile number, photo and sign, center choice, online fee payment details like credit/debit card, net banking or other mode of payment, etc.

Processing of filled in details for correctness of data received, eligibility as per advertisement or not, fees paid or not, photo and signature uploaded or not. Then, generate a roll number and add this applicant in the list of eligible applicants.

Examination Admit card specifying roll number, center address, date and time of test.

INPUT

PROCESSING

OUTPUT

15 of 25

Special Techniques for Data Processing

  1. Measures of Central Tendency: Let us know, what is normal or ‘average’ for a set of data. A measure of central tendency is a single value that gives some idea about the data. It includes :
    1. Mean: Average score of the data set.
    2. Mode: Middle score after the scores have been arranged in numerical order.
    3. Median: The most often occuring value
  2. Measures of Variability: It describe, how spread out the data is. It refer to the spread or variation of the values around the mean. They are also called measures of dispersion that indicate the degree of diversity in a data set. They also indicate difference within the group. They are :
    • Range: A single number representing the spread of data
    • Standard Deviation: A number representing how far from average each score is.
    • Variance: A number indicating how spread out the data is

16 of 25

1. Mean :

Measures of Central Tendency

It is the average of numeric values of an attribute. It is also called average. For given n values x1, x2, x3, .........., xn mean is computed as

Eg: Assume that heights (in cm) of students in a class are [90, 102, 110, 115, 85, 90, 100, 110, and 110]. Mean or average height of the class is

17 of 25

2. Mode :

Measures of Central Tendency

Value that appears most number of times in the given data of an attribute/variable is called Mode. It is computed on the basis of frequency of occurrence of distinct values in the given data. A data set has no mode if each value occurs only once. There may be multiple modes in the data if more than one values have same highest frequency. Mode can be found for numeric as well as non–numeric data.

Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110] mode is, 110

18 of 25

3. Median :

Measures of Central Tendency

It is computed for a single attribute/variable at a time. When all the values are sorted in ascending or descending order, the middle value is called the Median. When there are odd number of values, then median is the value at the middle position. If the list has even number of values, then median is the average of the two middle values.

Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110] median is,

Arrange values in ascending order: 85, 90, 90, 100, 102, 110, 110, 110, 115

Median: 102

19 of 25

1. Range :

Measures of Variability

It is the difference between maximum and minimum values of the data. Range can be calculated only for numerical data. It is a measure of dispersion and tells about coverage/spread of data values. For example difference in salaries of employees, marks of a student, price of toys, etc.

Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110], minimum value is 85 cm and maximum value is 115 cm. Hence range is 115–85 = 30 cm.

20 of 25

2. Standard Deviation :

Measures of Variability

It refers to differences within the group or set of data of a variable.It is calculated as the positive square root of the average of squared difference of each value from the mean value of data. Smaller value of standard deviation means data are less spread where as a larger value of standard deviation means data are more spread.

For given n values x1, x2, x3,...xn, and their mean x, the standard deviation is represented as σ (sigma) and is computed as

21 of 25

2. Standard Deviation (contd.)

Example: Compute SD for the data [90,102,110,115,85,90,100,110,110]

Mean of the data = 101.33

22 of 25

3. Variance :

Measures of Variability

variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value.

Variance is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.

where

23 of 25

3. Variance (contd.) :

24 of 25

References:

  1. NCERT Text Book
  2. Informatics Practices by Sumita Arora; Dhanpat Rai Publications.
  3. Informatics Practices with Python by Preeti Arora. Sultan Chand Publications.
  4. https://www.mathsisfun.com/data/index.html
  5. https://numpy.org/doc/stable/index.html
  6. https://www.w3schools.com/python/
  7. https://www.tutorialspoint.com/numpy/index.htm
  8. https://www.tutorialgateway.org/python-tutorial/#

25 of 25