Data Handling using NumPy
Chapter 5: Understanding Data
By: Bhoop Singh, PGT CS
JNV Rangareddy
What is Data?
Data refer to unorganized facts that can be processed to generate meaningful result or information. Data is a collection of characters, numbers, and other symbols that represents values of some situations or variables. eg:
Importance of Data (used for making decisions)
Types of Data
Moden No | Product Name | Price | Discount(%) | Stock |
ABC1 | Water Bottle | 126 | 8 | 13 |
ABC2 | Melamine Plates | 320 | 5 | 45 |
ABC3 | Dinner Set | 4200 | 10 | 8 |
Types of Data
Data Collection
Data collection means identifying already available data or collecting from the appropriate sources.
Data Storage
What is Information ???
Data on its own, has very little meaning/value
HOWEVER!!!
When data is arranged in a meaningful manner
INFORMATION
So, Information is organized or classified data, which has some meaningful values for the receiver. Information is the processed data on which decisions and actions are based.
Data Processing
An operation that performs upon raw facts (Data), such as collection, recording, organization, storage or alteration to convert it into useful information.
DATA PROCESSING USUALLY INVOLVES THREE BASIC ACTIVITIES KNOWN AS DATA PROCESSING CYCLE:
INPUT
PROCESSING
OUTPUT
Data Collection
Data Encoding
Data Transmission
Data Communication
Classification
Storing
Calculation
Decoding
Communication
Retrieval
INPUT is a process through which data collected is transformed into a form that the computer can understand.It is a very important step (!!!) because the correct output results depends completely on the input data.
INPUT
Activities carried out in Data Input are:
Transforming raw data into information by performing under actual data manipulation techniques.
PROCESSING
Some of the Data Manipulation Techniques are:
After the processing step, output is generated. Output is where processed data is present to users as useful Information.
OUTPUT
Activities carried out in output are:
An additional step to the Data Processing Cycle is :
The information that has been processed is stored for future uses.
STORAGE STAGE
Example of Data Processing
Student details like name, address, qualification, marks, mobile number, photo and sign, center choice, online fee payment details like credit/debit card, net banking or other mode of payment, etc.
Processing of filled in details for correctness of data received, eligibility as per advertisement or not, fees paid or not, photo and signature uploaded or not. Then, generate a roll number and add this applicant in the list of eligible applicants.
Examination Admit card specifying roll number, center address, date and time of test.
INPUT
PROCESSING
OUTPUT
Special Techniques for Data Processing
1. Mean :
Measures of Central Tendency
It is the average of numeric values of an attribute. It is also called average. For given n values x1, x2, x3, .........., xn mean is computed as
Eg: Assume that heights (in cm) of students in a class are [90, 102, 110, 115, 85, 90, 100, 110, and 110]. Mean or average height of the class is
2. Mode :
Measures of Central Tendency
Value that appears most number of times in the given data of an attribute/variable is called Mode. It is computed on the basis of frequency of occurrence of distinct values in the given data. A data set has no mode if each value occurs only once. There may be multiple modes in the data if more than one values have same highest frequency. Mode can be found for numeric as well as non–numeric data.
Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110] mode is, 110
3. Median :
Measures of Central Tendency
It is computed for a single attribute/variable at a time. When all the values are sorted in ascending or descending order, the middle value is called the Median. When there are odd number of values, then median is the value at the middle position. If the list has even number of values, then median is the average of the two middle values.
Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110] median is,
Arrange values in ascending order: 85, 90, 90, 100, 102, 110, 110, 110, 115
Median: 102
1. Range :
Measures of Variability
It is the difference between maximum and minimum values of the data. Range can be calculated only for numerical data. It is a measure of dispersion and tells about coverage/spread of data values. For example difference in salaries of employees, marks of a student, price of toys, etc.
Example: For the data [90, 102, 110, 115, 85, 90, 100, 110, 110], minimum value is 85 cm and maximum value is 115 cm. Hence range is 115–85 = 30 cm.
2. Standard Deviation :
Measures of Variability
It refers to differences within the group or set of data of a variable.It is calculated as the positive square root of the average of squared difference of each value from the mean value of data. Smaller value of standard deviation means data are less spread where as a larger value of standard deviation means data are more spread.
For given n values x1, x2, x3,...xn, and their mean x, the standard deviation is represented as σ (sigma) and is computed as
2. Standard Deviation (contd.)
Example: Compute SD for the data [90,102,110,115,85,90,100,110,110]
Mean of the data = 101.33
3. Variance :
Measures of Variability
variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of numbers are spread out from their average value.
Variance is calculated by taking the differences between each number in the data set and the mean, then squaring the differences to make them positive, and finally dividing the sum of the squares by the number of values in the data set.
where
3. Variance (contd.) :
References: