1 of 262

Unit-I DATA MANAGEMENT

INTRODUCTION:

In the beginning times of computers and Internet, the data used was not as much of as it is today, the data then could be so easily stored and managed by all the users and business enterprises on a single computer, because the data never exceeded to the extent of 19 Texabytes but now in this era, the data has increased about 2.5 quintillion per day.

2 of 262

Why is Data Analytics important?
Data Analytics has a key role in improving your business as it is used to gather hidden insights, Interesting Patterns in Data, generate reports, perform market analysis, and improve business requirements
What is the role of Data Analytics?
Gather Hidden Insights – Hidden insights from data are gathered and then analyzed with respect to business requirements.

3 of 262

Generate Reports – Reports are generated from the data and are passed on to the respective teams and individuals to deal with further actions for a high rise in business.
Perform Market Analysis – Market Analysis can be performed to understand the strengths and weaknesses of competitors.
Improve Business Requirement – Analysis of Data allows improving Business to customer requirements and experience.

4 of 262

What are the tools used in Data Analytics?

With the increasing demand for Data Analytics in the market, many tools have emerged with various functionalities for this purpose. Either open-source or user-friendly, the top tools in the data analytics market are as follows.

R programming

Python

Tableau Public

QlikView

SAS

Microsoft Excel

RapidMiner

KNIME

OpenRefine

Apache Spark

5 of 262

Data and architecture design:
Data architecture in Information Technology is composed of models, policies, rules or standards that govern which data is collected, and how it is stored, arranged, integrated, and put to use in data systems and in organizations.
A data architecture should set data standards for all its data systems as a vision or a model of the eventual interactions between those data systems.

6 of 262

The Data Architect breaks the subject down by going through 3 traditional architectural processes:
Conceptual model: It is a business model which uses Entity Relationship (ER) model for relation between entities and their attributes.
Logical model: It is a model where problems are represented in the form of logic such as rows and column of data, classes, xml tags and other DBMS techniques.
Physical model: Physical models holds the database design like which type of database technology will be suitable for architecture.

7 of 262

The data architecture is formed by dividing into three essential models and then are combined:

8 of 262

Factors that influence Data Architecture:
Various constraints and influences will have an effect on data architecture design. These include enterprise requirements, technology drivers, economics, business policies and data processing need. Enterprise requirements:
These will generally include such elements as economical and effective system expansion, acceptable performance levels (especially system access speed), transaction reliability, and transparent data management.

9 of 262

Technology drivers:
These are usually suggested by the completed data architecture and database architecture designs.
In addition, some technology drivers will derive from existing organizational integration frameworks and standards, organizational economics, and existing site resources (e.g. previously purchased software licensing).
Economics:
These are also important factors that must be considered during the data architecture phase. It is possible that some solutions, while optimal in principle, may not be potential candidates due to their cost.

10 of 262

Business policies:
Business policies that also drive data architecture design include internal organizational policies, rules of regulatory bodies, professional standards, and applicable governmental laws that can vary by applicable agency.
These policies and rules will help describe the manner in which enterprise wishes to process their data.
Data processing needs
These include accurate and reproducible transactions performed in high volumes, data warehousing for the support of management information systems (and potential data mining), repetitive periodic reporting, ad hoc reporting, and support of various organizational initiatives as required (i.e. annual budgets, new product development)

11 of 262

The General Approach is based on designing the Architecture at three Levels of Specification.

The Logical Level
�The Physical Level
The Implementation Level

Understand various sources of the Data:
Data can be generated from two types of sources namely Primary and Secondary Sources of Primary Data.
Data collection is the process of acquiring, collecting, extracting, and storing the voluminous amount of data which may be in the structured or unstructured form like text, video, audio, XML files, records, or other image files used in later stages of data analysis.

12 of 262

The actual data is then further divided mainly into two types known as:

Primary data
Secondary data

13 of 262

Primary data:

The data which is Raw, original, and extracted directly from the official sources is known as primary data. This type of data is collected directly by performing techniques such as

�questionnaires, interviews, and surveys. The data collected must be according to the demand and requirements of the target audience on which analysis is performed otherwise it would be a burden in the data processing.
Few methods of collecting primary data:
Interview method:

The data collected during this process is through interviewing the target audience by a person called interviewer and the person who answers the interview is known as the interviewee.

14 of 262

Survey method:

The survey method is the process of research where a list of relevant questions are asked and answers are noted down in the form of text, audio, or video.

Observation method:

The observation method is a method of data collection in which the researcher keenly observes the behaviour and practices of the target audience using some data collecting tool and stores the observed data in the form of text, audio, video, or any raw formats.

15 of 262

Experimental method:

The experimental method is the process of collecting data through performing experiments, research, and investigation.
The most frequently used experiment methods are CRD, RBD, LSD, FD.

CRD- Completely Randomized design is a simple experimental design used in data analytics which is based on randomization and replication. It is mostly used for comparing the experiments.
RBD- Randomized Block Design is an experimental design in which the experiment is divided into small units called blocks.

16 of 262

LSD – Latin Square Design is an experimental design that is similar to CRD and RBD blocks but contains rows and columns.
FD- Factorial design is an experimental design where each experiment has two factors each with possible values and on performing trail other combinational factors are derived. This design allows the experimenter to test two or more variables simultaneously.

17 of 262

Secondary data:
Secondary data is the data which has already been collected and reused again for some valid purpose. This type of data is previously recorded from primary data and it has two types of sources named internal source and external source.
Internal source:
These types of data can easily be found within the organization such as market record, a sales record, transactions, customer data, accounting resources, etc. The cost and time consumption is less in obtaining internal sources.

Accounting resources- This gives so much information which can be used by the marketing researcher. They give information about internal factors.
Sales Force Report- It gives information about the sales of a product. The information provided is from outside the organization.
Internal Experts- These are people who are heading the various departments. They can give an idea of how a particular thing is working.
Miscellaneous Reports- These are what information you are getting from operational reports. If the data available within the organization are unsuitable or inadequate, the marketer should extend the search to external secondary data sources.

18 of 262

External source:
The data which can’t be found at internal organizations and can be gained through external third-party resources is external source data. The cost and time consumption are more because this contains a huge amount of data. Examples of external sources are Government publications, news publications, Registrar General of India, planning commission, international labour bureau, syndicate services, and other non-governmental publications.
Government Publications-

Government sources provide an extremely rich pool of data for the researchers. In addition, many of these data are available free of cost on internet websites. There are number of government agencies generating data.

19 of 262

Central Statistical Organization-

This organization publishes the national accounts statistics. It contains estimates of national income for several years, growth rate, and rate of major economic activities. Annual survey of Industries is also published by the CSO.

Director General of Commercial Intelligence-

This office operates from Kolkata. It gives information about foreign trade i.e. import and export. These figures are provided region-wise and country-wise.

Ministry of Commerce and Industries-

This ministry through the office of economic advisor provides information on wholesale price index. These indices may be related to a number of sectors like food, fuel, power, food grains etc.

20 of 262

Planning Commission-

It provides the basic statistics of Indian Economy.

Reserve Bank of India-

This provides information on Banking Savings and investment. RBI also prepares currency and finance reports.

Labour Bureau-

It provides information on skilled, unskilled, white collared jobs etc.

National Sample Survey-

This is done by the Ministry of Planning and it provides social, economic, demographic, industrial and agricultural statistics.

Department of Economic Affairs-

It conducts economic survey and it also generates information on income, consumption, expenditure, investment, savings and foreign trade.

State Statistical Abstract-

This gives information on various types of activities related to the state like - commercial activities, education, occupation etc.

Non-Government Publications-

These includes publications of various industrial and trade associations, such as The Indian Cotton Mill Association Various chambers of commerce.

21 of 262

The Bombay Stock Exchange

It publishes a directory containing financial accounts, key profitability and other relevant matter) Various Associations of Press Media.

Export Promotion Council.

Syndicate Services-
These services are provided by certain organizations which collect and tabulate the marketing information on a regular basis for a number of clients who are the subscribers to these services.

22 of 262

In collecting data from household, they use three approaches:
Survey- They conduct surveys regarding - lifestyle, sociographic, general topics.
Mail Diary Panel- It may be related to 2 fields - Purchase and Media. Electronic Scanner Services- These are used to generate data on volume. They collect data for Institutions from

Whole sellers
Retailers, and
Industrial Firms

23 of 262

Importance of Syndicate Services:
Syndicate services are becoming popular since the constraints of decision making are changing and we need more of specific decision-making in the light of changing environment. Also, Syndicate services are able to provide information to the industries at a low unit cost.
Disadvantages of Syndicate Services:
The information provided is not exclusive. A number of research agencies provide customized services which suits the requirement of each individual organization.
International Organization-
These includes
The International Labour Organization (ILO):

It publishes data on the total and active population, employment, unemployment, wages and consumer prices.

The Organization for Economic Co-operation and development (OECD):

It publishes data on foreign trade, industry, food, transport, and science and technology.

The International Monetary Fund (IMA):

It publishes reports on national and international foreign exchange regulations.

24 of 262

Other sources:
Sensor’s data: With the advancement of IoT devices, the sensors of these devices collect data which can be used for sensor data analytics to track the performance and usage of products.
Satellites data: Satellites collect a lot of images and data in terabytes on daily basis through surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by users on different platforms can be predicted and collected with their permission for data analysis. The search engines also provide their data through keywords and queries searched mostly.

25 of 262

Data Management:
Data management is the practice of collecting, keeping, and using data securely, efficiently, and cost- effectively.
The goal of data management is to help people, organizations, and connected things optimize the use of data within the bounds of policy and regulation so that they can make decisions and take actions that maximize the benefit to the organization.

26 of 262

Create, access, and update data across a diverse data tier
Store data across multiple clouds and on premises
Provide high availability and disaster recovery
Use data in a growing variety of apps, analytics, and algorithms
Ensure data privacy and security
Archive and destroy data in accordance with

27 of 262

What is Cloud Computing?
Cloud computing is a term referred to storing and accessing data over the internet. It doesn’t store any data on the hard disk of your personal computer. In cloud computing, you can access data from a remote server.
Service Models of Cloud computing are the reference models on which the Cloud Computing is based.
These can be categorized into
three basic service models as listed below:
INFRASTRUCTURE as a SERVICE (IaaS)
IaaS provides access to fundamental resources such as physical machines, virtual machines, virtual storage, etc.
PLATFORM as a SERVICE (PaaS)
PaaS provides the runtime environment for applications, development & deployment tools, etc.
SOFTWARE as a SERVICE (SAAS)
SaaS model allows to use software applications as a service to end users.

28 of 262

Amazon S3 Features

Low cost and Easy to Use − Using Amazon S3, the user can store a large amount of data at very low charges.
Secure − Amazon S3 supports data transfer over SSL and the data gets encrypted automatically once it is uploaded. The user has complete control over their data by configuring bucket policies using AWS IAM.
Scalable − Using Amazon S3, there need not be any worry about storage concerns. We can store as much data as we have and access it anytime.
Higher performance − Amazon S3 is integrated with Amazon CloudFront, that distributes content to the end users with low latency and provides high data transfer speeds without any minimum usage commitments.
Integrated with AWS services − Amazon S3 integrated with AWS services include Amazon CloudFront, Amazon CLoudWatch, Amazon Kinesis, Amazon RDS, Amazon Route 53, Amazon VPC, AWS Lambda, Amazon EBS, Amazon Dynamo DB, etc.

29 of 262

Data Quality:
What is Data Quality?
There are many definitions of data quality, in general, data quality is the assessment of how much the data is usable and fits its serving context.
Why Data Quality is Important?
Enhancing the data quality is a critical concern as data is considered as the core of all activities within organizations, poor data quality leads to inaccurate reporting which will result inaccurate decisions and surely economic damages.

30 of 262

Data Accuracy: Data are accurate when data values stored in the database correspond to real-world values.
Data Uniqueness: A measure of unwanted duplication existing within or across systems for a particular field, record, or data set.
Data Consistency: Violation of semantic rules defined over the dataset.
Data Completeness: The degree to which values are present in a data collection.
Data Timeliness: The extent to which age of the data is appropriated for the task at hand.
Other factors can be taken into consideration such as Availability, Ease of Manipulation, Believability.

31 of 262

OUTLIERS:

Outlier is a point or an observation that deviates significantly from the other observations.
Outlier is a commonly used terminology by analysts and data scientists

as it needs close attention else it can result in wildly wrong estimations. Simply speaking, Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

Reasons for outliers: Due to experimental errors or “special circumstances”.
There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise.

32 of 262

Types of Outliers:
Outlier can be of two types:
Univariate: These outliers can be found when we look at distribution of a single variable. Multivariate: Multi-variate outliers are outliers in an n-dimensional space.

33 of 262

Impact of Outliers on a dataset:
Outliers can drastically change the results of the data analysis and statistical modelling. There are numerous unfavourable impacts of outliers in the data set:

It increases the error variance and reduces the power of statistical tests
If the outliers are non-randomly distributed, they can decrease normality
They can bias or influence estimates that may be of substantive interest

34 of 262

Detect Outliers:
Most commonly used method to detect outliers is visualization. We use various visualization methods, like Box-plot, Histogram, Scatter Plot (above, we have used box plot and scatter plot for visualization).
Outlier treatments are three types:
Retention:

There is no rigid mathematical definition of what constitutes an outlier; determining whether or not an observation is an outlier is ultimately a subjective exercise. There are various methods of outlier detection. Some are graphical such as normal probability plots. Others are model- based. Box plots are a hybrid.

35 of 262

Exclusion:

According to a purpose of the study, it is necessary to decide, whether and which outlier will be removed/excluded from the data, since they could highly bias the final results of the analysis.

Rejection:

Rejection of outliers is more acceptable in areas of practice where the underlying model of the process being measured and the usual distribution of measurement error are confidently known.
An outlier resulting from an instrument reading error may be excluded but it is desirable that the reading is at least verified.

36 of 262

Other treatment methods
OUTLIER package in R: to detect and treat outliers in Data. Outlier detection from graphical representation:

Scatter plot and Box plot

37 of 262

Missing Data treatment:
Missing Values
Missing data in the training data set can reduce the power / fit of a model or can lead to a biased model because we have not analyzed the behavior and
relationship with other variables correctly. It can lead to wrong prediction or classification

38 of 262

Data Pre-processing:
Preprocessing in Data Mining: Data preprocessing is a data mining technique which is used to transform the raw data in a useful and efficient format.

40 of 262

Steps Involved in Data Preprocessing:
1. Data Cleaning:
The data can have many irrelevant and missing parts. To handle this part, data cleaning is done. It involves handling of missing data, noisy data etc.
Missing Data:
This situation arises when some data is missing in the data. It can be handled in various ways. Some of them are:

Ignore the tuples:

This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.

41 of 262

Fill the Missing values:

There are various ways to do this task. You can choose to fill the missing values manually, by attribute mean or the most probable value.
(b). Noisy Data:
Noisy data is a meaningless data that can’t be interpreted by machines. It can be generated due to faulty data collection, data entry errors etc. It can be handled in following ways:
Binning Method:
This method works on sorted data in order to smooth it. Binning, also called discretization, is a technique for reducing the cardinality (The total number of unique values for a dimension is known as its cardinality) of continuous and discrete data. Binning groups related values together in bins to reduce the number of distinct values

42 of 262

Regression:
Here data can be made smooth by fitting it to a regression function. The regression used may be linear (having one independent variable) or multiple (having multiple independent variables).
Clustering:
This approach groups the similar data in a cluster. The outliers may be undetected or it will fall outside the clusters.
Data Transformation:
This step is taken in order to transform the data in appropriate forms suitable for mining process. This involves following ways:

Normalization:

Normalization is a technique often applied as part of data preparation in Data Analytics through machine learning.

43 of 262

Attribute Selection:

In this strategy, new attributes are constructed from the given set of attributes to help the mining process.

Discretization:

Discretization is the process through which we can transform continuous variables, models or functions into a discrete form.

Concept Hierarchy Generation:

Here attributes are converted from lower level to higher level in hierarchy. For Example-The attribute “city” can be converted to “country”.

44 of 262

Data Reduction:
Since data mining is a technique that is used to handle huge amount of data. While working with huge volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
The various steps to data reduction are:

Data Cube Aggregation:

Aggregation operation is applied to data for the construction of the data cube.

Attribute Subset Selection:

The highly relevant attributes should be used, rest all can be discarded. For performing attribute selection, one can use level of significance and p- value of the attribute. The attribute having p-value greater than significance level can be discarded.

45 of 262

Numerosity Reduction:

This enable to store the model of data instead of whole data, for example: Regression Models.

Dimensionality Reduction:

This reduce the size of data by encoding mechanisms. It can be lossy or lossless. If after reconstruction from compressed data, original data can be retrieved, such reduction are called lossless reduction else it is called lossy reduction.

46 of 262

Data Processing:
Data processing occurs when data is collected and translated into usable information. Usually performed by a data scientist or team of data scientists, it is important for data processing to be done correctly as not to negatively affect the end product, or data output.

Six stages of data processing

Data collection
Collecting data is the first step in data processing. Data is pulled from available sources, including data lakes and data warehouses.

47 of 262

Data preparation

Once the data is collected, it then enters the data preparation stage. Data preparation, often referred to as “pre-processing” is the stage at which raw data is cleaned up and organized for the following stage of data processing.

Data input

The clean data is then entered into its destination (perhaps a CRM like Salesforce or a data warehouse like Redshift), and translated into a language that it can understand. Data input is the first stage in which raw data begins to take the form of usable information.

48 of 262

Processing
During this stage, the data inputted to the computer in the previous stage is actually processed for interpretation.
Data output/interpretation
The output/interpretation stage is the stage at which data is finally usable to non-data scientists. It is translated, readable, and often in the form of graphs, videos, images, plain text, etc.).
Data storage
The final stage of data processing is storage. After all of the data is processed, it is then stored for future use. While some information may be put to use immediately, much of it will serve a purpose later on. When data is properly stored, it can be quickly and easily accessed by members of the organization when needed.

49 of 262

***END OF UNIT 1***

50 of 262

UNIT -II

2.1 Introduction to Data Analytics:
As an enormous amount of data gets generated, the need to extract useful insights is a must for a business enterprise. Data Analytics has a key role in improving your business. Here are 4 main factors which signify the need for Data Analytics:

2.1.1.Factors for Data Analytics:

Gather Hidden Insights: Hidden insights from data are gathered and then analyzed with respect to business requirements.(Surveys,Interviews,Observations etc…)

Generate Reports: Reports are generated from the data and are passed on to the respective teams and individuals to deal with further actions for a high rise in business.

Perform Market Analysis: Market Analysis can be performed to understand the strengths and the weaknesses of competitors.

51 of 262

Improve Business Requirement – Analysis of Data allows improving Business to customer requirements and experience.

Data Analytics refers to the techniques to analyze data to enhance productivity and business gain.
Data is extracted from various sources and is cleaned and categorized to analyze different behavioral patterns.
The techniques and the tools used vary according to the organization or individual

52 of 262

Data analysts translate numbers into plain English.

A Data Analyst delivers value to their companies by taking information about specific topics and then interpreting, analyzing, and presenting findings in comprehensive reports.

So, if you have the capability to collect data from various sources, analyze the data, gather hidden insights and generate reports, then you can become a Data Analyst. Refer to the image below:

54 of 262

In general data analytics also deals with bit of human knowledge, each type of analytics there is a part of human knowledge required in prediction.

Descriptive analytics requires the highest human input while predictive analytics requires less human input.

In case of prescriptive analytics no human input is required since all the data is predicted

56 of 262

1. Understand the problem: Understanding the business problems, defining the organizational goals, and planning a lucrative solution is the first step in the analytics process.
E-commerce companies often encounter issues such as predicting the return of items, giving relevant product recommendations, cancellation of orders, identifying frauds, optimizing vehicle routing, etc.

2. Data Collection: Next, you need to collect transactional business data and customer-related information from the past few years to address the problems your business is facing.
The data can have information about the total units that were sold for a product, the sales, and profit that were made, and also when was the order placed.
Past data plays a crucial role in shaping the future of a business.

57 of 262

3. Data Cleaning: Now, all the data you collect will often be disorderly, messy, and contain unwanted missing values.
Such data is not suitable or relevant for performing data analysis.
Hence, you need to clean the data to remove unwanted, redundant, and missing values to make it ready for analysis.

4. Data Exploration and Analysis: After you gather the right data, the next vital step is to execute exploratory(investigate) data Analysis.
You can use data visualization and business intelligence tools, data mining techniques, and predictive modeling to analyze, visualize, and predict future outcomes from this data.
Applying these methods can tell you the impact and relationship of a certain feature as compared to other variables.

58 of 262

Interpret the results: The final step is to interpret the results and validate if the outcomes meet your expectations. You can find out hidden patterns and future trends. This will help you gain insights that will support you with appropriate data-driven decision making.
What are the tools used in Data Analytics?
With the increasing demand for Data Analytics in the market, many tools have emerged with various functionalities for this purpose. Either open-source or user-friendly, the top tools in the data analytics market are as follows.

60 of 262

R programming – This tool is the leading analytics tool used for statistics and data modeling. R compiles and runs on various platforms such as UNIX, Windows, and Mac OS. It also provides tools to automatically install all packages as per user-requirement.
Python – Python is an open-source, object-oriented programming language that is easy to read, write, and maintain. It provides various machine learning and visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras, etc. It also can be assembled on any platform like SQL server, a MongoDB database or JSON

61 of 262

Tableau Public – This is a free software that connects to any data source such as Excel, corporate Data Warehouse, etc. It then creates visualizations, maps, dashboards etc with real-time updates on the web.
QlikView – This tool offers in-memory data processing with the results delivered to the end-users quickly. It also offers data association and data visualization with data being compressed to almost 10% of its original size.
SAS – A programming language and environment for data manipulation and analytics, this tool is easily accessible and can analyze data from different sources.
Microsoft Excel – This tool is one of the most widely used tools for data analytics. Mostly used for clients’ internal data, this tool analyzes the tasks that summarize the data with a preview of pivot tables.
RapidMiner – A powerful, integrated platform that can integrate with any data source types

62 of 262

such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc. This tool is mostly used for predictive analytics, such as data mining, text analytics, machine learning.

KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics platform, which allows you to analyze and model data. With the benefit of visual programming, KNIME provides a platform for reporting and integration through its modular data pipeline concept.
OpenRefine – Also known as GoogleRefine, this data cleaning software will help you clean up data for analysis. It is used for cleaning messy data, the transformation of data and parsing data from websites.
Apache Spark – One of the largest large-scale data processing engine, this tool executes applications in Hadoop clusters 100 times faster in memory and 10 times faster on disk. This tool is also popular for data pipelines and machine learning model development.

63 of 262

Data Analytics Applications:
Data analytics is used in almost every sector of business, let’s discuss a few of them:
Retail: Data analytics helps retailers understand their customer needs and buying habits to predict trends, recommend new products, and boost their business. They optimize the supply chain, and retail operations at every step of the customer journey.
Healthcare: Healthcare industries analyse patient data to provide lifesaving diagnoses and treatment options. Data analytics help in discovering new drug development methods as well.

64 of 262

Manufacturing: Using data analytics, manufacturing sectors can discover new cost-saving opportunities. They can solve complex supply chain issues, labour constraints, and equipment breakdowns.
Banking sector: Banking and financial institutions use analytics to find out probable loan defaulters and customer churn out rate. It also helps in detecting fraudulent transactions immediately.

65 of 262

Logistics: Logistics companies use data analytics to develop new business models and optimize routes. This, in turn, ensures that the delivery reaches on time in a cost-efficient manner.
Cluster computing:

Cluster computing is a collection of tightly or loosely connected computers that work together so that they act as a single entity.
The connected computers execute operations all together thus creating the idea of a single system.
The clusters are generally connected through fast local area networks (LANs)

68 of 262

Cluster computing gives a relatively inexpensive, unconventional to the large server or mainframe computer solutions.
It resolves the demand for content criticality and process services in a faster way.
Many organizations and IT companies are implementing cluster computing to augment their scalability, availability, processing speed and resource management at economic prices.
It ensures that computational power is always available. It provides a single general strategy for the implementation and application of parallel high-performance systems independent of certain hardware vendors and their product decisions.

69 of 262

Apache Spark:

71 of 262

Apache Spark is a lightning-fast cluster computing technology, designed for fast computation. It is based on Hadoop MapReduce and it extends the MapReduce model to efficiently use it for more types of computations, which includes interactive queries and stream processing.
The main feature of Spark is its in-memory cluster computing that increases the processing speed of an application.
Spark is designed to cover a wide range of workloads such as batch applications, iterative algorithms, interactive queries and streaming.
Apart from supporting all these workloads in a respective system, it reduces the management burden of maintaining separate tools.

72 of 262

Evolution of Apache Spark
Spark is one of Hadoop’s sub project developed in 2009 in UC Berkeley’s AMPLab by Matei Zaharia. It was Open Sourced in 2010 under a BSD license. It was donated to Apache software foundation in 2013, and now Apache Spark has become a top level Apache project from Feb-2014.
Features of Apache Spark:
Apache Spark has following features.
Speed − Spark helps to run an application in Hadoop cluster, up to 100 times faster in memory, and 10 times faster when running on disk. This is possible by reducing number of read/write operations to disk. It stores the intermediate processing data in memory.

73 of 262

Supports multiple languages − Spark provides built-in APIs in Java, Scala, or Python. Therefore, you can write applications in different languages. Spark comes up with 80 high-level operators for interactive querying.
Advanced Analytics − Spark not only supports ‘Map’ and ‘reduce’. It also supports SQL queries, Streaming data, Machine learning (ML), and Graph algorithms.

74 of 262

Spark Built on Hadoop�The following diagram shows three ways of how Spark can be built with Hadoop components

75 of 262

There are three ways of Spark deployment as explained below.
Standalone − Spark Standalone deployment means Spark occupies the place on top of HDFS(Hadoop Distributed File System) and space is allocated for HDFS, explicitly. Here, Spark and MapReduce will run side by side to cover all spark jobs on cluster.
Hadoop Yarn − Hadoop Yarn deployment means, simply, spark runs on Yarn (Yet Another Resource Negotiator) without any pre-installation or root access required. It helps to integrate Spark into Hadoop ecosystem or Hadoop stack. It allows other components to run on top of stack. Spark in MapReduce (SIMR) − Spark in MapReduce is used to launch spark job in addition to standalone deployment. With SIMR, user can start Spark and uses its shell without any administrative access.

76 of 262

Components of Spark
�The following illustration depicts the different components of Spark.

77 of 262

Apache Spark Core
Spark Core is the underlying general execution engine for spark platform that all other functionality is built upon. It provides In-Memory computing and referencing datasets in external storage systems.
Spark SQL
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for structured and semi-structured data.
Spark Streaming
Spark Streaming leverages Spark Core's fast scheduling capability to perform streaming analytics. It ingests data in mini-batches and performs RDD (Resilient Distributed Datasets) transformations on those mini-batches of data.

78 of 262

MLlib (Machine Learning Library)
MLlib is a distributed machine learning framework above Spark because of the distributed memory- based Spark architecture. It is, according to benchmarks, done by the MLlib developers against the Alternating Least Squares (ALS) implementations. Spark MLlib is nine times as fast as the Hadoop disk-based version of Apache Mahout (before Mahout gained a Spark interface).
GraphX
GraphX is a distributed graph-processing framework on top of Spark. It provides an API for expressing graph computation that can model the user-defined graphs by using Pregel abstraction API. It also provides an optimized runtime for this abstraction.

79 of 262

What is Scala?
Scala is a statically typed programming language that incorporates both functional and object oriented, also suitable for imperative programming approaches.to increase scalability of applications. It is a general-purpose programming language. It is a strong static type language. In scala, everything is an object whether it is a function or a number. It does not have concept of primitive data.
Scala primarily runs on JVM platform and it can also be used to write software for native platforms using Scala-Native and JavaScript runtimes through ScalaJs.
This language was originally built for the Java Virtual Machine (JVM) and one of Scala’s strengths is that it makes it very easy to interact with Java code.
Scala is a Scalable Language used to write Software for multiple platforms. Hence, it got the name “Scala”. This language is intended to solve the problems of Java

80 of 262

while simultaneously being more concise. Initially designed by Martin Odersky, it was released in 2003.
Why Scala?

Scala is the core language to be used in writing the most popular distributed big data processing framework Apache Spark. Big Data processing is becoming inevitable from small to large enterprises.
Extracting the valuable insights from data requires state of the art processing tools and frameworks.

Scala is easy to learn for object-oriented programmers, Java developers. It is becoming one of the popular languages in recent years.
Scala offers first-class functions for users

81 of 262

Where Scala can be used?
Web Applications
Utilities and Libraries
Data Streaming
Parallel batch processing
Concurrency and distributed application
Data analytics with Spark
AWS lambda Expression

82 of 262

Cloudera Impala:
Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.
Impala is the open source, massively parallel processing (MPP) SQL query engine for
native analytic database in a computer cluster running Apache Hadoop.
It is shipped by vendors such as Cloudera, MapR, Oracle, and Amazon.
Cloudera Impala is a query engine that runs on Apache Hadoop.
The project was announced in October 2012 with a public beta test distribution and became generally available in May 2013.

83 of 262

Impala brings enabling users to issue low latency SQL queries to data stored in HDFS and Apache HBase without requiring data movement or transformation.
Impala is integrated with Hadoop to use the same file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other Hadoop software.
Impala is promoted for analysts and data scientists to perform analytics on data stored in Hadoop via SQL or business intelligence tools.
The result is that large-scale data processing (via MapReduce) and interactive queries can be done on the same system using the same data and metadata – removing the need to migrate data sets into specialized systems and/or proprietary formats simply to perform analysis.

84 of 262

Features include:
Supports HDFS and Apache HBase storage,
Reads Hadoop file formats, including text, LZO, SequenceFile, Avro, RCFile, and Parquet,
Supports Hadoop security (Kerberos authentication),
Fine-grained, role-based authorization with Apache Sentry,
Uses metadata, ODBC driver, and SQL syntax from Apache Hive.

85 of 262

Databases & Types of Data and variables
Data Base: A Database is a collection of related data.
Database Management System: DBMS is a software or set of Programs used to define, construct and manipulate the data.
Relational Database Management System: RDBMS is a software system used to maintain relational databases. Many relational database systems have an option of using the SQL.

86 of 262

NoSQL:
NoSQL Database is a non-relational Data Management System, that does not require a fixed schema. It avoids joins, and is easy to scale. The major purpose of using a NoSQL database is for distributed data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For example, companies like Twitter, Facebook and Google collect terabytes of user data every single day.
NoSQL database stands for “Not Only SQL” or “Not SQL.” Though a better term would be “NoREL”, NoSQL caught on. Carl Strozz introduced the NoSQL concept in 1998.
�Traditional RDBMS uses SQL syntax to store and retrieve data for further insights. Instead, a NoSQL database system encompasses a wide range of database technologies that can store structured, semi-structured, unstructured and polymorphic data.

88 of 262

Why NoSQL?
The concept of NoSQL databases became popular with Internet giants like Google, Facebook, Amazon, etc. who deal with huge volumes of data. The system response time becomes slow when you use RDBMS for massive volumes of data.
To resolve this problem, we could “scale up” our systems by upgrading our existing hardware. This process is expensive. The alternative for this issue is to distribute database load on multiple hosts whenever the load increases. This method is known as “scaling out.”

91 of 262

Document-oriented: JSON documents MongoDB and CouchDB
Key-value: Redis and DynamoDB
Wide-column: Cassandra and HBase
Graph: Neo4j and Amazon Neptune

94 of 262

Benefits of NoSQL
The NoSQL data model addresses several issues that the relational model is not designed to address:
Large volumes of structured, semi-structured, and unstructured data.
Object-oriented programming that is easy to use and flexible.
Efficient, scale-out architecture instead of expensive, monolithic architecture.

95 of 262

Variables:
Data consist of individuals and variables that give us information about those individuals. An individual can be an object or a person.
A variable is an attribute, such as a measurement or a label.
Two types of Data

Quantitative data(Numerical)
Categorical data

97 of 262

Quantitative Variables: Quantitative data, contains numerical that can be added, subtracted, divided, etc.
There are two types of quantitative variables: discrete and continuous.

100 of 262

Categorical variables: Categorical variables represent groupings of some kind. They are sometimes recorded as numbers, but the numbers represent categories rather than actual amounts of things.
There are three types of categorical variables: binary, nominal, and ordinal variables.

102 of 262

Imputation is the process of replacing missing data with substituted values.

Missing Imputations:
Types of missing data
Missing data can be classified into one of three categories
MCAR
Data which is Missing Completely At Random has nothing systematic about which observations are missing values. There is no relationship between missingness and either observed or unobserved covariates.�

103 of 262

MAR: Missing At Random is weaker than MCAR. The missingness is still random, but due entirely to observed variables. For example, those from a lower socioeconomic status may be less willing to provide salary information (but we know their SES status). The key is that the missingness is not due to the values which are not observed. MCAR implies MAR but not vice-versa.

104 of 262

MNAR: If the data are Missing Not At Random, then the missingness depends on the values of the missing data. Censored data falls into this category. For example, individuals who are heavier are less likely to report their weight. Another example, the device measuring some response can only measure values above .5. Anything below that is missing.

105 of 262

There can be two types of gaps in Data:

Missing Data Imputation
Model based Technique

Imputations: (Treatment of Missing Values)
Ignore the tuple: This is usually done when the class label is missing (assuming the mining task involves classification). This method is not very effective, unless the tuple contains several attributes with missing values. It is especially poor when the percentage of missing values per attribute varies considerably.

106 of 262

Fill in the missing value manually: In general, this approach is time-consuming and may not be feasible given a large data set with many missing values.
Use a global constant to fill in the missing value: Replace all missing attribute values by the same constant, such as a label like “Unknown” or -∞. If missing values
�

107 of 262

Use the attribute mean to fill in the missing value: Considering the average value of that particular attribute and use this value to replace the missing value in that attribute column.
Use the attribute mean for all samples belonging to the same class as the given tuple:
For example, if classifying customers according to credit risk, replace the missing value with the average income value for customers in the same credit risk category as that of the given tuple.

108 of 262

Use the most probable value to fill in the missing value: This may be determined with regression, inference-based tools using a Bayesian formalism, or decision tree induction. For example, using the other customer attributes in your data set, you may construct a decision tree to predict the missing values for income.

109 of 262

Need for Business Modelling:
The main need of Business Modelling for the Companies that embrace big data analytics and transform their business models in parallel will create new opportunities for revenue streams, customers, products and services Having a big data strategy and vision that
identifies and capitalizes on new opportunities.

110 of 262

Analytics applications to various Business Domains
Application of Modelling in Business:
Applications of Data Modelling can be termed as Business analytics.
Business analytics involves the collating, sorting, processing, and studying of business-related data using statistical models and iterative methodologies. The goal of BA is to narrow down which datasets are useful and which can increase revenue, productivity, and efficiency.

111 of 262

Business analytics (BA) is the combination of skills, technologies, and practices used to examine an organization's data and performance as a way to gain insights and make data-driven decisions in the future using statistical analysis.

112 of 262

Although business analytics is being leveraged in most commercial sectors and industries, the following applications are the most common.
Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an ideal way of gathering information about a purchaser’s spending habits, financial situation, behavior trends, demographics, and lifestyle preferences.
Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer loyalty to stay in business for the long haul. CRM systems analyze important performance indicators such as demographics, buying patterns, socio-economic information, and lifestyle.

113 of 262

Finance
The financial world is a volatile place, and business analytics helps to extract insights that help organizations maneuver their way through tricky terrain. Corporations turn to business analysts to optimize budgeting, banking, financial planning, forecasting, and portfolio management.
Human Resources
Business analysts help the process by pouring through data that characterizes high performing candidates, such as educational background, attrition rate, the average length of employment, etc. By working with this information, business analysts help HR by forecasting the best fits between the company and candidates.

114 of 262

Manufacturing
Business analysts work with data to help stakeholders understand the things that affect operations and the bottom line. Identifying things like equipment downtime, inventory levels, and maintenance costs help companies streamline inventory management, risks, and supply-chain management to create maximum efficiency.
Marketing
Business analysts help answer these questions and so many more, by measuring marketing and advertising metrics, identifying consumer behavior and the target audience, and analyzing market trends.

115 of 262

Data Modelling Techniques in Data Analytics:
What is Data Modelling?

Data Modelling is the process of analyzing the data objects and their relationship to the other objects. It is used to analyze the data requirements that are required for the business processes. The data models are created for the data to be stored in a database.
The Data Model's main focus is on what data is needed and how we have to organize data rather than what operations we have to perform.
Data Model is basically an architect's building plan. It is a process of documenting complex software system design as in a diagram that can be easily understood.

116 of 262

Uses of Data Modelling:

Data Modelling helps create a robust design with a data model that can show an organization's entire data on the same platform.
The database at the logical, physical, and conceptual levels can be designed with the help data model.
Data Modelling Tools help in the improvement of data quality.
Redundant data and missing data can be identified with the help of data models.
The data model is quite a time consuming, but it makes the maintenance cheaper and faster.

117 of 262

Data Modelling Techniques:

�

118 of 262

Below given are 5 different types of techniques used to organize the data:
Hierarchical Technique
The hierarchical model is a tree-like structure. There is one root node, or we can say one parent node and the other child nodes are sorted in a particular order. But, the hierarchical model is very rarely used now. This model can be used for real-world model relationships.

119 of 262

Object-oriented Model
The object-oriented approach is the creation of objects that contains stored values. The object- oriented model communicates while supporting data abstraction, inheritance, and encapsulation.
Network Technique
The network model provides us with a flexible way of representing objects and relationships between these entities. It has a feature known as a schema representing the data in the form of a graph. An object is represented inside a node and the relation between them as an edge, enabling them to maintain multiple parent and child records in a generalized manner.

120 of 262

Entity-relationship Model
ER model (Entity-relationship model) is a high-level relational model which is used to define data elements and relationship for the entities in a system. This conceptual design provides a better view of the data that helps us easy to understand. In this model, the entire database is represented in a diagram called an entity-relationship diagram, consisting of Entities, Attributes, and Relationships.

121 of 262

Relational Technique
Relational is used to describe the different relationships between the entities. And there are different sets of relations between the entities such as one to one, one to many, many to one, and many to many.

122 of 262

***END OF UNIT-II***

123 of 262

UNIT-III

What Is Regression?
Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between a dependent variable and one or more independent variables.
Linear regression is the most common form of this technique. Also called simple regression or ordinary least squares (OLS), linear regression establishes the linear relationship between two variables.

124 of 262

Types of Regression Analysis:

Linear Regression.
Logistic Regression.
Polynomial Regression.
Ridge Regression.
Lasso Regression.
Quantile Regression.
Bayesian Linear Regression.
Principal Components Regression.

125 of 262

1. Linear Regression:
The most extensively used modeling technique is linear regression, which assumes a linear connection between a dependent variable (Y) and an independent variable (X).

It employs a regression line, also known as a best-fit line. The linear connection is defined as Y = c+m*X + e, where ‘c’ denotes the intercept, ‘m’ denotes the slope of the line, and ‘e’ is the error term.

The linear regression model can be simple (with only one dependent and one independent variable) or complex (with numerous dependent and independent variables) (with one dependent variable and more than one independent variable).

127 of 262

2. Logistic Regression:
When the dependent variable is discrete, the logistic regression technique is applicable.
In other words, this technique is used to compute the probability of mutually exclusive occurrences such as pass/fail, true/false, 0/1, and so forth.
Thus, the target variable can take on only one of two values, and a sigmoid curve represents its connection to the independent variable, and probability has a value between 0 and 1.

129 of 262

3. Polynomial Regression
Polynomial regression analysis represents a non-linear relationship between dependent and independent variables.

This technique is a variant of the multiple linear regression model, but the best fit line is curved rather than straight.

131 of 262

4. Ridge Regression
When data exhibits multicollinearity, that is, the ridge regression technique is applied when the independent variables are highly correlated.

Ridge regression reduces standard errors by biasing the regression estimates.

The lambda (λ) variable in the ridge regression equation resolves the multicollinearity problem.

133 of 262

5. Lasso Regression
As with ridge regression, the lasso (Least Absolute Shrinkage and Selection Operator) technique penalizes the absolute magnitude of the regression coefficient.

Additionally, the lasso regression technique employs variable selection, which leads to the shrinkage of coefficient values to absolute zero.

135 of 262

6. Quantile Regression
The quantile regression approach is a subset of the linear regression technique.

Statisticians and econometricians employ quantile regression when linear regression requirements are not met or when the data contains outliers.

137 of 262

7. Bayesian Linear Regression
Machine learning utilizes Bayesian linear regression, a form of regression analysis, to calculate the values of regression coefficients using Bayes’ theorem.

Rather than determining the least-squares, this technique determines the features’ posterior distribution.

As a result, the approach outperforms ordinary linear regression in terms of stability.

139 of 262

How to Create the Regression Model?
To create the Regression Model here is the following steps:
Define the Problem:
Figure out what you want to predict or explain (the outcome).
Identify the factors that might influence this outcome.
Gather Data:
Collect information related to your problem.
Make sure the data is clean and accurate.
Explore the Data:
Look for patterns and relationships between the factors and the outcome.
Use charts and graphs to visualize the data.

140 of 262

*Choose a Model:
Select a type of model that best fits your problem (e.g., linear, logistic).
Train the Model:
Teach the model to recognize patterns in the data.
Use part of your data to train the model.
Test the Model:
Use a different part of your data to see how well the model works.
Check if it can accurately predict outcomes.
Make Predictions:
Use the trained model to predict new outcomes based on new data.
Interpret Results:
Understand what the model learned from the data.
Explain how the factors influence the outcome.

141 of 262

Homoscedasticity vs Heteroscedasticity:
�
The Assumption of homoscedasticity (meaning “same variance”) is central to linear regression models. Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.
Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable.
The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases.
Homoscedasticity means “having the same scatter.” For it to exist in a set of data, the points must be about the same distance from the line, as shown in the picture above.
The opposite is heteroscedasticity (“different scatter”), where points are at widely varying distances from the regression line.

142 of 262

Variable Rationalization:
The data set may have a large number of attributes. But some of those attributes can be irrelevant or redundant. The goal of Variable Rationalization is to improve the Data Processing in an optimal way through attribute subset selection.
This process is to find a minimum set of attributes such that dropping of those irrelevant attributes does not much affect the utility of data and the cost of data analysis could be reduced.
Mining on a reduced data set also makes the discovered pattern easier to understand. As part of Data processing, we use the below methods of Attribute subset selection

Stepwise Forward Selection
Stepwise Backward Elimination
Combination of Forward Selection and Backward Elimination
Decision Tree Induction.

All the above methods are greedy approaches for attribute subset selection.

143 of 262

1. Stepwise Forward Selection: This procedure starts with an empty set of attributes as the minimal set. The most relevant attributes are chosen (having minimum p-value) and are added to the minimal set. In each iteration, one attribute is added to a reduced set.
Stepwise Backward Elimination: Here all the attributes are considered in the initial set of attributes. In each iteration, one attribute is eliminated from the set of attributes whose p-value is higher than significance level.
Combination of Forward Selection and Backward Elimination: The stepwise forward selection and backward elimination are combined so as to select the relevant attributes most efficiently. This is the most common technique which is generally used for attribute selection.
Decision Tree Induction: This approach uses decision tree for attribute selection. It constructs a flow chart like structure having nodes denoting a test on an attribute. Each branch corresponds to the outcome of test and leaf nodes is a class prediction. The attribute that is not the part of tree is considered irrelevant and hence discarded.

144 of 262

Model Building Life Cycle in Data Analytics:
When we come across a business analytical problem, without acknowledging the stumbling blocks, we proceed towards the execution. Before realizing the misfortunes, we try to implement and predict the outcomes. The problem-solving steps involved in the data science model-building life cycle.
Let’s understand every model building step in-depth,
The data science model-building life cycle includes some important steps to follow. The following are the steps to follow to build a Data Model

146 of 262

Problem Definition
Hypothesis Generation
Data Collection
Data Exploration/Transformation
Predictive Modelling
Model Deployment

147 of 262

Problem Definition

The first step in constructing a model is to

understand the industrial problem in a more comprehensive way. To identify the purpose of the problem and the prediction target, we must define the project objectives appropriately.

Therefore, to proceed with an analytical approach, we have to recognize the obstacles first. Remember, excellent results always depend on a better understanding of the problem.

148 of 262

Hypothesis Generation

Hypothesis generation is the guessing approach through which we derive some essential data parameters that have a significant correlation with the prediction target.
Your hypothesis research must be in-depth, looking for every perceptive of all stakeholders into account. We search for every suitable factor that can influence the outcome.
Hypothesis generation focuses on what you can create rather than what is available in the dataset.

Data Collection

Data collection is gathering data from relevant sources regarding the analytical problem, then we extract meaningful insights from the data for prediction.

150 of 262

The data gathered must have:

Proficiency in answer hypothesis questions.
Capacity to elaborate on every data parameter.
Effectiveness to justify your research.
Competency to predict outcomes accurately.

Data Exploration/Transformation

The data you collected may be in unfamiliar shapes and sizes. It may contain unnecessary features, null values, unanticipated small values, or immense values. So, before applying any algorithmic model to data, we have to explore it first.
By inspecting the data, we get to understand the explicit and hidden trends in data. We find the relation between data features and the target variable.
Usually, a data scientist invests his 60–70% of project time dealing with data exploration only.
There are several sub steps involved in data exploration:

151 of 262

Feature Identification:

You need to analyze which data features are available and which ones are not.
Identify independent and target variables.
Identify data types and categories of these variables.

�Univariate Analysis:

We inspect each variable one by one. This kind of analysis depends on the variable type whether it is categorical and continuous.

Continuous variable: We mainly look for statistical trends like mean, median, standard deviation, skewness, and many more in the dataset.
Categorical variable: We use a frequency table to understand the spread of data for each category. We can measure the counts and frequency of occurrence of values.

152 of 262

Multi-variate Analysis:

The bi-variate analysis helps to discover the relation between two or more variables.
We can find the correlation in case of continuous variables and the case of categorical, we look for association and dissociation between them.

Filling Null Values:

Usually, the dataset contains null values which lead to lower the potential of the model. With a continuous variable, we fill these null values using the mean or mode of that specific column. For the null values present in the categorical column, we replace them with the most frequently occurred categorical value. Remember, don’t delete that rows because you may lose the information.

153 of 262

Predictive Modeling

Predictive modeling is a mathematical approach to create a statistical model to forecast future behavior based on input test data.

Steps involved in predictive modeling:

Algorithm Selection:

When we have the structured dataset, and we want to estimate the continuous or categorical outcome then we use supervised machine learning methodologies like regression and classification techniques. When we have unstructured data and want to predict the clusters of items to which a particular input test sample belongs, we use unsupervised algorithms. An actual data scientist applies multiple algorithms to get a more accurate model.

154 of 262

Train Model:

After assigning the algorithm and getting the data handy, we train our model using the input data applying the preferred algorithm. It is an action to determine the correspondence between independent variables, and the prediction targets.

Model Prediction:

We make predictions by giving the input test data to the trained model. We measure the accuracy by using a cross-validation strategy or ROC curve which performs well to derive model output for test data.

Model Deployment

There is nothing better than deploying the model in a real-time environment. It helps us to gain analytical insights into the decision-making procedure. You constantly need to update the model with additional features for customer satisfaction.
To predict business decisions, plan market strategies, and create personalized customer interests, we integrate the machine learning model into the existing production domain.
When you go through the Amazon website and notice the product recommendations completely based on your curiosities.

156 of 262

SUMMARY OF DA MODEL LIFE CYCLE:

Understand the purpose of the business analytical problem.
Generate hypotheses before looking at data.
Collect reliable data from well-known resources.
Invest most of the time in data exploration to extract meaningful insights from the data.
Choose the signature algorithm to train the model and use test data to evaluate.
Deploy the model into the production environment so it will be available to users and strategize to make business decisions effectively.

157 of 262

Logistic Regression: Model Theory, Model fit Statistics, Model Construction
Introduction:

Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.
The outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).
The curve from the logistic function indicates the likelihood of something such as whether or not the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic regression, but is used to classify samples; therefore, it falls under the classification algorithm.

158 of 262

Types of Logistic Regressions:
On the basis of the categories, Logistic Regression can be classified into three types:

Binomial: In binomial Logistic regression, there can be only two possible types of the dependent variables, such as 0 or 1, Pass or Fail, etc.
Multinomial: In multinomial Logistic regression, there can be 3 or more possible unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of dependent variables, such as "low", "Medium", or "High".

159 of 262

Definition: Multi-collinearity:

Multicollinearity is a statistical phenomenon in which multiple independent variables show high correlation between each other and they are too inter-related.
Multicollinearity also called as Collinearity and it is an undesired situation for any statistical regression model since it diminishes the reliability of the model itself.
If two or more independent variables are too correlated, the data obtained from the regression will be disturbed because the independent variables are actually dependent between each other.

Assumptions for Logistic Regression:

The dependent variable must be categorical in nature.
The independent variable should not have multi-collinearity.

160 of 262

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The mathematical steps to get Logistic Regression equations are given below:
Logistic Regression uses a more complex cost function, this cost function can be defined as the ‘Sigmoid function’ or also known as the ‘logistic function’ instead of a linear function.
The hypothesis of logistic regression tends it to limit the cost function between 0 and

1. Therefore linear functions fail to represent it as it can have a value greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

161 of 262

0 £ hq (x) <= 1Logistic Regression Hypothesis Expectation
Logistic Function (Sigmoid Function):

The sigmoid function is a mathematical function used to map the predicted values to probabilities.
The sigmoid function maps any real value into another value within a range of 0 and 1, and so forma S-Form curve.
The value of the logistic regression must be between 0 and 1, which cannot go beyond this limit, so it forms a curve like the "S" form.
The below image is showing the logistic function:

163 of 262

The Sigmoid function can be interpreted as a probability indicating to a Class-1 or Class-
So the Regression model makes the following predictions as
�z = sigmoid ( y) = s ( y) = Hypothesis Representation�1
1+ e^- ^y�When using linear regression, we used a formula for the line equation as:
y = b0 + b1x1 + b2 x2 + ... + bn xn

�In the above equation y is a response variable,�x1, x2 ,...xn�are the predictor variables,�

�and
�b0 , b1, b2 ,..., bn�are the coefficients, which are numeric constants.

164 of 262

Confusion Matrix (or) Error Matrix (or) Contingency Table:
What is a Confusion Matrix?
“A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making. It is a specific table layout that allows visualization of the performance of an algorithm, typically a supervised learning one (in unsupervised learning it is usually called a matching matrix).”
For a binary classification problem, we would have a 2 x 2 matrix as shown below with 4 values:

166 of 262

Let’s decipher the matrix:

The target variable has two values: Positive or Negative
The columns represent the actual values of the target variable
The rows represent the predicted values of the target variable

True Positive
True Negative
False Positive – Type 1 Error
False Negative – Type 2 Error

Why we need a Confusion matrix?

Precision vs Recall
F1-score

167 of 262

Understanding True Positive, True Negative, False Positive and False Negativ e in a Confusion Matrix
True Positive (TP)

The predicted value matches the actual value
The actual value was positive and the model predicted a positive value

True Negative (TN)

The predicted value matches the actual value
The actual value was negative and the model predicted a negative value

False Positive (FP) – Type 1 error

The predicted value was falsely predicted
The actual value was negative but the model predicted a positive value

Also known as the Type 1

168 of 262

error False Negative (FN) – Type 2 error
The predicted value was falsely predicted
The actual value was positive but the model predicted a negative value
Also known as the Type 2 error

To evaluate the performance of a model, we have the performance metrics called,
Accuracy, Precision, Recall & F1-Score metrics Accuracy:
Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations.

Accuracy is a great measure to understand that the model is Best.
Accuracy is dependable only when you have symmetric datasets where values of false positive and false negatives are almost same.
Accuracy = (TP + TN)/ TP + FP + TN + FN ��

169 of 262

Precision:
Precision =TP/TP+ FP

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

It tells us how many of the correctly predicted cases actually turned out to be positive.

�

170 of 262

Recall: (Sensitivity)
Recall is the ratio of correctly predicted positive observations to the all observations in actual class.�Recall = TP/TP+ FN
Recall is a useful metric in cases where False Negative trumps False Positive.

Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

F1-Score:
F1-score is a harmonic mean of Precision and Recall. It gives a combined idea about these two metrics. It is maximum when Precision is equal to Recall.
Therefore, this score takes both false positives and false negatives into account.

172 of 262

Example:
Suppose we had a classification dataset with 1000 data points. We fit a classifier on it and get the below confusion matrix:
The different values of the Confusion matrix would be as follows:

True Positive (TP) = 560

-Means 560 positive class data points were correctly classified by the model.

True Negative (TN) = 330

-Means 330 negative class data points were correctly classified by the model.

False Positive (FP) = 60

-Means 60 negative class data points were incorrectly classified as belonging to the positive class by the model.

False Negative (FN) = 50

-Means 50 positive class data points were incorrectly classified as belonging to the negative class by the model.

173 of 262

AUC (Area Under Curve) ROC (Receiver Operating Characteristics) Curves: Performance measurement is an essential task in Data Modelling Evaluation. It is one of the most important evaluation metrics for checking any classification model’s performance. It is also written as AUROC (Area Under the Receiver Operating Characteristics) So when it comes to a classification problem, we can count on an AUC - ROC Curve.
When we need to check or visualize the performance of the multi-class classification problem, we use the AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve.
What is the AUC - ROC Curve?
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings.

175 of 262

ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters:

True Positive Rate
False Positive Rate

True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:

TPR=TPTP+FN

�

177 of 262

Analytics applications to various Business Domains:
Application of Modelling in Business:

Applications of Data Modelling can be termed as Business analytics.

Credit Card Companies
Credit and debit cards are an everyday part of consumer spending, and they are an ideal way of gathering information about a purchaser’s spending habits, financial situation, behaviour trends, demographics, and lifestyle preferences.
Customer Relationship Management (CRM)
Excellent customer relations is critical for any company that wants to retain customer loyalty to stay in business for the long haul.
Finance
The financial world is a volatile place, and business analytics helps to extract insights that help organizations maneuver their way through tricky terrain.
Human Resources
Business analysts help the process by pouring through data that characterizes high performing candidates, such as educational background, attrition rate, the average length of employment, etc.

178 of 262

Manufacturing
Business analysts work with data to help stakeholders understand the things that affect operations and the bottom line. Identifying things like equipment downtime, inventory levels, and maintenance costs help companies streamline inventory management, risks, and supply-chain management to create maximum efficiency.
Marketing:
Business analysts help answer these questions and so many more, by measuring marketing and advertising metrics, identifying consumer behavior and the target audience, and analyzing market trends.

179 of 262

*** END OF UNIT-III***

180 of 262

UNIT –IV Object Segmentation & Time Series Methods�

Supervised and Unsupervised Learning Supervised Learning:
Supervised learning is a machine learning method in which models are trained using labeled data. In supervised learning, models need to find the mapping function to map the input variable (X) with the output variable (Y).
We find a relation between x & y, suchthat y=f(x)

181 of 262

Unsupervised Machine Learning:
Unsupervised learning is another machine learning method in which patterns inferred from the unlabeled input data. The goal of unsupervised learning is to find the structure and patterns from the input data. Unsupervised learning does not need any supervision. Instead, it finds patterns from the data by its own.

183 of 262

Segmentation
Segmentation refers to the act of segmenting data according to your company’s needs in order to refine your analyses based on a defined context. It is a technique of splitting customers into separate groups depending on their attributes or behavior.

185 of 262

Steps:
Define purpose – Already mentioned in the statement above
Identify critical parameters – Some of the variables which come up in mind are skill, motivation, vintage, department, education etc. Let us say that basis past experience, we know that skill and motivation are most important parameters. Also, for sake of simplicity we just select 2 variables. Taking additional variables will increase the complexity, but can be done if it adds value.
Granularity – Let us say we are able to classify both skill and motivation into High and Low using various techniques.
There are two broad set of methodologies for segmentation:

186 of 262

Objective (supervised) segmentation
Non-Objective (unsupervised) segmentation
Objective Segmentation
Segmentation to identify the type of customers who would respond to a particular offer.
Segmentation to identify high spenders among customers who will use the e- commerce channel for festive shopping.
Segmentation to identify customers who will default on their credit obligation for a loan or credit card.�

187 of 262

Non-Objective Segmentation:

188 of 262

Segmentation of the customer base to understand the specific profiles which exist within the customer base so that multiple marketing actions can be personalized for each segment
Segmentation of geographies on the basis of affluence and lifestyle of people living in each geography so that sales and distribution strategies can be formulated accordingly.
Hence, it is critical that the segments created on the basis of an objective segmentation methodology must be different with respect to the stated objective (e.g. response to an offer).

189 of 262

Regression Vs Segmentation
Regression analysis focuses on finding a relationship between a dependent variable and one or more independent variables.
Predicts the value of a dependent variable based on the value of at least one independent variable.
Explains the impact of changes in an independent variable on the dependent variable.

190 of 262

Decision Tree Classification Algorithm:
Decision Tree is a supervised learning technique that can be used for both classification and Regression problems, but mostly it is preferred for solving Classification problems.
Decision Trees usually mimic human thinking ability while making a decision, so it is easy to understand.
A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.

191 of 262

There are two main types of Decision Trees:
Classification trees (Yes/No types)
What we’ve seen above is an example of classification tree, where the outcome was a variable like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Regression trees (Continuous data types)
Here the decision or the outcome variable is Continuous, e.g. a number like 123.
Decision Tree Terminologies
Root Node: Root node is from where the decision tree starts. It represents the entire dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes are called the child nodes.

192 of 262

Decision Tree Representation:
Each non-leaf node is connected to a test that splits its set of possible answers into subsets corresponding to different test results.
Each branch carries a particular test result's subset to another node.
Each node is connected to a set of possible answers.
Below diagram explains the general structure of a decision tree:

194 of 262

A decision tree is an arrangement of tests that provides an appropriate classification at every step in an analysis.
"In general, decision trees represent a disjunction of conjunctions of constraints on the attribute-values of instances. Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions" (Mitchell, 1997, p.53).
More specifically, decision trees classify instances by sorting them down the tree from the root node to some leaf node, which provides the classification of the instance. Each node in the tree specifies a test of some attribute of the instance,

195 of 262

Appropriate Problems for Decision Tree Learning
Decision tree learning is generally best suited to problems with the following characteristics:
Instances are represented by attribute-value pairs.

There is a finite list of attributes (e.g. hair colour) and each instance stores a value for that attribute (e.g. blonde).
When each attribute has a small number of distinct values (e.g. blonde, brown, red) it is easier for the decision tree to reach a useful solution.

196 of 262

How does the Decision Tree algorithm Work?
The decision of making strategic splits heavily affects a tree’s accuracy. The decision criteria are different for classification and regression trees.
Decision trees use multiple algorithms to decide to split a node into two or more sub- nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

197 of 262

Tree Building: Decision tree learning is the construction of a decision tree from class- labeled training tuples. A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. There are many specific decision-tree algorithms. Notable ones include the following.
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines): Extends decision trees to handle numerical data better

198 of 262

Conditional Inference Trees → Statistics-based approach that uses non-parametric tests as splitting criteria, corrected for multiple testing to avoid over fitting.
The ID3 algorithm builds decision trees using a top-down greedy search approach through the space of possible branches with no backtracking. A greedy algorithm, as the name suggests, always makes the choice that seems to be the best at that moment.

199 of 262

The complete process can be better understood using the below algorithm:
Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
Step-2: Find the best attribute in the dataset using Attribute Selection Measure (ASM).
Step-3: Divide the S into subsets that contains possible values for the best attributes.
Step-4: Generate the decision tree node, which contains the best attribute.
Step-5: Recursively make new decision trees using the subsets of the dataset created in
Step -6: Continue this process until a stage is reached where you cannot further classify the nodes and called the final node as a leaf node.

200 of 262

Entropy:
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random.
From the graph, it is quite evident that the entropy H(X) is zero when the probability is either 0 or 1. The Entropy is maximum when the probability is 0.5 because it projects perfect randomness in the data and there is no chance if perfectly determining the outcome.

201 of 262

Information Gain
Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.

202 of 262

Basic algorithm for inducing a decision tree from training tuples:

Algorithm:

Generate decision tree. Generate a decision tree from the training tuples of data

partition D.

Input:

Data partition, D, which is a set of training tuples and their associated class labels;

attribute list, the set of candidate attributes;

Attribute selection method, a procedure to determine the splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute

and, possibly, either a split point or splitting subset.

Output: A decision tree.

Method:

create a node N;

if tuples in D are all of the same class, C then

return N as a leaf node labeled with the class C;

if attribute list is empty then

return N as a leaf node labeled with the majority class in D;

// majority voting

apply Attribute selection method(D, attribute list) to find the

“best”

splitting criterion;

Label node N with splitting criterion;

if splitting attribute is discrete-valued and multiway splits allowed

203 of 262

then // not restricted to binary trees

attribute list= attribute list - splitting attribute

for each outcome j of splitting criterion

// partition the tuples and grow subtrees for

each partition

let Dj be the set of data tuples in D satisfying outcome j;

// a partition

if Dj is empty then

attach a leaf labeled with the majority class in D to node N;

else

attach the node returned by Generate decision tree(Dj,

attribute list) to node N;

return N;

204 of 262

Advantages of Decision Tree:
Simple to understand and interpret. People are able to understand decision tree models after a brief explanation.
�Requires little data preparation. Other techniques often require data normalization, dummy variables need to be created and blank values to be removed.
Able to handle both numerical and categorical data. Other techniques are usually specialized in analysing datasets that have only one type of variable. (For example, relation rules can be used only with nominal variables while neural networks can be used only with numerical variables.)

205 of 262

Tools used to make Decision Tree:
Many data mining software packages provide implementations of one or more decision tree algorithms. Several examples include:
SAS Enterprise Miner
Matlab
R (an open source software environment for statistical computing which includes several CART implementations such as rpart, party and random Forest packages)
Weka (a free and open-source data mining suite, contains many decision tree algorithms)
Orange (a free data mining software suite, which includes the tree module orngTree)
KNIME
Microsoft SQL Server
Scikit-learn (a free and open-source machine learning library for the Python programming language).

206 of 262

Multiple Decision Trees: Classification & Regression Trees:
that are used for classification and regression learning tasks.
The Classification and Regression Tree methodology, also known as the CART were introduced in 1984 by Leo Breiman, Jerome Friedman, Richard Olshen, and Charles Stone.
Classification Trees:
A classification tree is an algorithm where the target variable is fixed or categorical. The algorithm is then used to identify the “class” within which a target variable would most likely fall.

208 of 262

Regression Trees
A regression tree refers to an algorithm where the target variable is and the algorithm is used to predict its value which is a continuous variable.

209 of 262

Difference Between Classification and Regression Trees
Classification trees are used when the dataset needs to be split into classes that belong to the response variable. In many cases, the classes Yes or No.
In other words, they are just two and mutually exclusive. In some cases, there may be more than two classes in which case a variant of the classification tree algorithm is used.
Regression trees, on the other hand, are used when the response variable is continuous. For instance, if the response variable is something like the price of a property or the temperature of the day, a regression tree is used.

210 of 262

CART: CART stands for Classification And Regression Tree.
CART algorithm was introduced in Breiman et al. (1986). A CART tree is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly, beginning with the root node that contains the whole learning sample. The CART growing method attempts to maximize within-node homogeneity.

212 of 262

Decision tree pruning:�Pruning is a data compression technique in machine learning and search algorithms that reduces the size of decision trees by removing sections of the tree that are non-critical and redundant to classify instances. Pruning reduces the complexity of the final classifier, and hence improves predictive accuracy by the reduction of overfitting.

213 of 262

Pruning Techniques:
Pruning processes can be divided into two types: PrePruning & Post Pruning

Pre-pruning procedures prevent a complete induction of the training set by replacing a stop () criterion in the induction algorithm (e.g. max. Tree depth or information gain (Attr)> minGain). They considered to be more efficient because they do not induce an entire set, but rather trees remain small from the start.
Post-Pruning (or just pruning) is the most common way of simplifying trees. Here, nodes and subtrees are replaced with leaves to reduce complexity.

The procedures are differentiated on the basis of their approach in the tree: Top-down approach & Bottom-Up approach

214 of 262

Bottom-up pruning approach:

These procedures start at the last node in the tree (the lowest point).
Following recursively upwards, they determine the relevance of each individual node.
If the relevance for the classification is not given, the node is dropped or replaced by a leaf.
The advantage is that no relevant sub-trees can be lost with this method.
These methods include Reduced Error Pruning (REP), Minimum Cost Complexity Pruning (MCCP), or Minimum Error Pruning (MEP).

Top-down pruning approach:

In contrast to the bottom-up method, this method starts at the root of the tree. Following the structure below, a relevance check is carried out which decides whether a node is relevant for the classification of all n items or not.
By pruning the tree at an inner node, it can happen that an entire sub-tree (regardless of its relevance) is dropped.

215 of 262

CHAID:
CHAID stands for CHI-squared Automatic Interaction Detector. Morgan and Sonquist (1963) proposed a simple method for fitting trees to predict a quantitative variable.
Each predictor is tested for splitting as follows: sort all the n cases on the predictor and examine all n-1 ways to split the cluster in two. For each possible split, compute the within-cluster sum of squares about the mean of the cluster on the dependent variable.

216 of 262

GINI Index Impurity Measure:
GINI Index Used by the CART (classification and regression tree) algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability fi of each item being chosen times the probability 1-fi of a mistake in categorizing that item.

217 of 262

Overfitting and Underfitting

218 of 262

Time Series Methods:

Time series forecasting focuses on analyzing data changes across equally spaced time intervals.
Time series analysis is used in a wide variety of domains, ranging from econometrics to geology and earthquake prediction; it’s also used in almost all applied sciences and engineering.
Time-series databases are highly popular and provide a wide spectrum of numerous applications such as stock market analysis, economic and sales forecasting, budget analysis, to name a few.

219 of 262

The different types of models and analyses that can be created through time series analysis are:

Classification: To Identify and assign categories to the data.
Curve fitting: Plot the data along a curve and study the relationships of variables present within the data.
Descriptive analysis: Help Identify certain patterns in time-series data such as trends, cycles, or seasonal variation.
Explanative analysis: To understand the data and its relationships, the dependent features, and cause and effect and its tradeoff.
Exploratory analysis: Describe and focus on the main characteristics of the time series data, usually in a visual format.
Forecasting: Predicting future data based on historical trends. Using the historical data as a model for future data and predicting scenarios that could happen along with the future plot points.

220 of 262

Intervention analysis: The Study of how an event can change the data.
Segmentation: Splitting the data into segments to discover the underlying properties from the source information.

Components of Time Series:
Long term trend – The smooth long term direction of time series where the data can increase or decrease in some pattern.
Seasonal variation – Patterns of change in a time series within a year which tends to repeat every year.
Cyclical variation – Its much alike seasonal variation but the rise and fall of time series over periods are longer than one year.
Irregular variation – Any variation that is not explainable by any of the three above mentioned components. They can be classified into – stationary and non – stationary variation.
Stationary Variation: When the data neither increases nor decreases, i.e. it’s completely random it’s called stationary variation. Or When the data has some explainable portion remaining and can be analyzed further then such case is called non – stationary variation.

222 of 262

ARIMA & ARMA:
What is ARIMA?

In time series analysis, ARIMA is an acronym that stands for AutoRegressive Integrated Moving Average model is ageneralization of an autoregressive moving average (ARMA) model. These models are fitted to time series data either to better understand the data or to predict future points in the series (forecasting).
They are applied in some cases where data show evidence of non-stationary,
.A popular and very widely used statistical method for time series forecasting and analysis is the ARIMA model.

223 of 262

The parameters are substituted with an integer value to indicate the specific ARIMA model being used quickly. The parameters of the ARIMA model are further described as follows:

p: Stands for the number of lag observations included in the model, also known as the lag order.
d: The number of times the raw observations are differentiated, also called the degree of differencing.
q: Is the size of the moving average window and also called the order of moving average.

224 of 262

Univariate stationary processes (ARMA)
A covariance stationary process is an ARMA (p, q) process of autoregressive order p and moving
average order q if it can be written as
The acronym ARIMA stands for Auto-Regressive Integrated Moving Average. Lags of the stationarized series in the forecasting equation are called "autoregressive" terms, lags of the forecast errors are called "moving average" terms, and a time series which needs to be differenced to be made stationary is said to be an "integrated" version of a stationary series. Random-walk and random-trend models, autoregressive models, and exponential smoothing models are all special cases of ARIMA models.

225 of 262

ETL Approach:
Extract, Transform and Load (ETL) refers to a process in database usage and especially in data warehousing that:

Extracts data from homogeneous or heterogeneous data sources
Transforms the data for storing it in proper format or structure for querying and analysis purpose
Loads it into the final target (database, more specifically, operational data store, data mart, or data warehouse)

226 of 262

Commercially available ETL tools include:

Anatella
Alteryx
CampaignRunner
ESF Database Migration Toolkit
InformaticaPowerCenter
Talend
IBM InfoSphereDataStage
Ab Initio
Oracle Data Integrator (ODI)
Oracle Warehouse Builder (OWB)
Microsoft SQL Server Integration Services (SSIS)
Tomahawk Business Integrator by Novasoft Technologies.
Pentaho Data Integration (or Kettle) opensource data integration framework
Stambia

227 of 262

Extract:
The Extract step covers the data extraction from the source system and makes it accessible for further processing.

Update notification - if the source system is able to provide a notification that a record has been changed and describe the change, this is the easiest way to get the data.
Incremental extract - some systems may not be able to provide notification that an update has occurred, but they are able to identify which records have been modified and provide an extract of such records. During further ETL steps, the system needs to identify changes and propagate it down. Note, that by using daily extract, we may not be able to handle deleted records properly.
Full extract - some systems are not able to identify which data has been changed at all, so a full extract is the only way one can get the data out of the system. The full extract requires keeping a copy of the last extract in the same format in order to be able to identify changes. Full extract handles deletions as well.
When using Incremental or Full extracts, the extract frequency is extremely important. Particularly for full extracts; the data volumes can be in tens of gigabytes.

228 of 262

Transform:

The transform step applies a set of rules to transform the data from the source to the target.
This includes converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.
The transformation step also requires joining data from several sources, generating aggregates, generating surrogate keys, sorting, deriving new calculated values, and applying advanced validation rules.

Load:

During the load step, it is necessary to ensure that the load is performed correctly and with as little resources as possible. The target of the Load process is often a database.
In order to make the load process efficient, it is helpful to disable any constraints and indexes before the load and enable them back only after the load completes. The referential integrity needs to be maintained by ETL tool to ensure consistency.

229 of 262

Managing ETL Process
The ETL process seems quite straight forward. As with every application, there is a possibility that the ETL process fails. This can be caused by missing extracts from one of the systems, missing values in one of the reference tables, or simply a connection or power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery in mind.

231 of 262

***END OF UNIT-IV***

232 of 262

UNIT - V�Data Visualization�

Data Visualization
Data visualization is the art and practice of gathering, analyzing, and graphically representing empirical information.
They are sometimes called information graphics, or even just charts and graphs.
The goal of visualizing data is to tell the story in the data.
Telling the story is predicated on understanding the data at a very deep level, and gathering insight from comparisons of data points in the numbers

233 of 262

Why data visualization?
Gain insight into an information space by mapping data onto graphical primitives Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, and relationships among data.
Help find interesting regions and suitable parameters for further quantitative analysis.
Provide a visual proof of computer representations derived.

234 of 262

Categorization of visualization methods
Pixel-oriented visualization techniques
Geometric projection visualization techniques
Icon-based visualization techniques
Hierarchical visualization techniques
Visualizing complex data and relations

235 of 262

Pixel-Oriented Visualization Techniques

236 of 262

Geometric Projection Visualization Techniques
Visualization of geometric transformations and projections of the data.Methods

Direct visualization
Scatterplot and scatterplot matrices
Landscapes Projection pursuit technique: Help users find meaningful projections ofmultidimensional data
Prosection views
Hyperslice

237 of 262

Line Plot:�This is the plot that you can see in the nook and corners of any sort of analysis between 2 variables.
The line plots are nothing but the values on a series of data points will be connected with straight lines.
Bar Plot
This is one of the widely used plots, that we would have seen multiple times not just in data analysis, but we use this plot also wherever there is a trend analysis in many fields.
We can visualize the data in a cool plot and can convey the details straight forward to others.

239 of 262

Stacked Bar Graph:

240 of 262

Unlike a Multi-set Bar Graph which displays their bars side-by-side, Stacked Bar Graphs segment their bars. Stacked Bar Graphs are used to show how a larger category is divided into smaller categories and what the relationship of each part has on the total amount. There are two types of Stacked Bar Graphs:
Simple Stacked Bar Graphs place each value for the segment after the previous one. The total value of the bar is all the segment values added together.

241 of 262

Scatter Plot:

242 of 262

It is one of the most commonly used plots used for visualizing simple data in Machine learning and Data Science.
This plot describes us as a representation, where each point in the entire dataset is present with respect to any 2 to 3 features(Columns).
Scatter plots are available in both 2-D as well as in 3-D. The 2-D scatter plot is the common one, where we will primarily try to find the patterns, clusters, and separability of the data.

243 of 262

Box and Whisker Plot
This plot can be used to obtain more statistical details about the data.
The straight lines at the maximum and minimum are also called whiskers.
Points that lie outside the whiskers will be considered as an outlier.
The box plot also gives us a description of the 25th, 50th,75th quartiles.

244 of 262

Pie Chart :
A pie chart shows a static number and how categories represent part of a whole the composition of something. A pie chart represents numbers in percentages, and the total sum of all segments needs to equal 100%.

245 of 262

Donut Chart:
A donut chart is essentially a Pie Chart with an area of the centre cut out. Pie Charts are sometimes criticised for focusing readers on the proportional areas of the slices to one another and to the chart as a whole. This makes it tricky to see the differences between slices, especially when you try to compare multiple Pie Charts together.

247 of 262

Marimekko Chart:

248 of 262

Also known as a Mosaic Plot.
Marimekko Charts are used to visualise categorical data over a pair of variables. In a Marimekko Chart, both axes are variable with a percentage scale, that determines both the width and height of each segment.
Icon-Based Visualization Techniques

It uses small icons to represent multidimensional data values
Visualization of the data values as features of icons
Typical visualization methods

Chernoff Faces
Stick Figures

249 of 262

Chernoff Faces:

250 of 262

A way to display variables on a two-dimensional surface, e.g., let x be eyebrow slant, y be eye size, z be nose length, etc.

The figure shows faces produced using 10 characteristics–head eccentricity,

eye size, eye spacing, eye eccentricity, pupil size, eyebrow slant, nose size, mouth shape, mouth size, and mouth opening. Each assigned one of 10 possible values.

252 of 262

Circle Packing is a variation of a Treemap that uses circles instead of rectangles. Containment within each circle represents a level in the hierarchy: each branch of the tree is represented as a circle and its sub-branches are represented as circles inside of it. The area of each circle can also be used to represent an additional arbitrary value, such as quantity or file size. Colour may also be used to assign categories or to represent another variable via different shades.