1 of 124

Big Data Analytics - 17CI18

2 of 124

COURSE OUTCOMES (COS)

After the completion of this course, the student will be able to:

CO1: Identify Big Data and its Business Implications.

CO2: Access and Process Data on Distributed File System.

CO3: Manage Job Execution in Hadoop Environment.

CO4: Develop Big Data Solutions using Hadoop Eco System.

CO5: Apply Machine Learning Techniques using R.

3 of 124

Welcome to the World of Big Data

4 of 124

Data- What Makes it Big???

How big is big?

No Single Definition

The term ‘big data’ is self-explanatory − a collection of huge data sets that normal computing techniques cannot process.

Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…

5 of 124

6 of 124

7 of 124

Big Facts About Big Data

As of 2013, experts believed that 90% of the world’s data was generated from 2011 to 2012.

In 2018, more than 2.5 quintillion bytes of data were created every day.

The amount of data in the world was estimated to be 44 zettabytes at the dawn of 2020.

Google handles a staggering 1.2 trillion searches every year.

8 of 124

Of Bits and Bytes…….

A gigabyte is equal to 1,024 megabytes. A terabyte is equal to 1,024 gigabytes.

A petabyte is equal to 1,024 terabytes.

An exabyte is equal to 1,024 petabytes. A zettabyte is equal to 1,024 exabytes. A yottabyte is equal to 1,024 zettabytes.

1021 bytes equal to 1 zettabyte or one billion terabytes forms a zettabyte.

  • Looking at these figures one can easily understand why the name Big Data is given and imagine the challenges involved in its storage and processing.

9 of 124

Three attributes stand out as defining Big Data characteristics

Huge volume of data: Rather than thousands or millions of rows, Big Data can be billions of rows and millions of columns.

Complexity of data types and structures: Big Data reflects the variety of new data sources, formats, and structures, including digital traces being left on the web and other digital repositories for subsequent analysis.

Speed of new data creation and growth: Big Data can describe high velocity data, with rapid data ingestion and near real time analysis.

10 of 124

Big Data- the 4Vs

11 of 124

Characteristics of Big Data: 1-Scale (Volume)

  • Data Volume
    • 44x increase from 2009 2020
  • Data volume is increasing exponentially

12 of 124

Characteristics of Big Data: 2-Complexity (Variety)

13 of 124

Characteristics of Big Data: 3-Speed (Velocity)

  • Data is begin generated fast and need to be processed fast
  • Online Data Analytics
  • Late decisions 🡺 missing opportunities
  • Examples
    • E-Promotions: Based on your current location, your purchase history, what you like 🡺 send

promotions right now for store next to you

    • Healthcare monitoring: sensors monitoring your activities and body 🡺 any abnormal measurements require immediate reaction.

14 of 124

Characteristics of Big Data:

4-Accuracy/ Trustworthiness (Veracity)

15 of 124

The 4Vs in a Nutshell

16 of 124

The Sources….

17 of 124

Sources of Big Data Deluge

18 of 124

Who’s Generating Big Data

Social media and networks

(all of us are generating data)

Scientific instruments

(collecting all sorts of data)

Mobile devices

(tracking all objects all the time)

Sensor technology and networks

(measuring all kinds of data)

  • The progress and innovation is no longer hindered by the ability to collect data
  • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

15

19 of 124

The Evolution…

20 of 124

Evolution of Big Data by technology

21 of 124

Evolution of Big Data by Internet Of Things

22 of 124

Evolution of Big Data by Social Media

23 of 124

Evolution of Big Data by other factors

24 of 124

The Model Has Changed…

  • The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

17

25 of 124

Harnessing Big Data

(DBMSs)

(Data Warehousing)

  • OLTP: Online Transaction Processing
  • OLAP: Online Analytical Processing
  • RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

18

26 of 124

Value of Big Data Analytics

  • Big data is more real-time in nature than traditional DW applications.
  • Traditional DW architectures are not well- suited for big data apps

20

27 of 124

Challenges in Handling Big Data

  • The Bottleneck is in technology
    • New architecture, algorithms, techniques are needed
  • Also in technical skills
    • Experts in using the new technology and dealing with big data

21

28 of 124

Challenge #1: Insufficient understanding and acceptance of big data

Challenge #2: Confusing variety of big data technologies

Challenge #3: Paying loads of money

Challenge #4: Complexity of managing data quality

Challenge #5: Dangerous big data security holes

Challenge #6: Tricky process of converting big data into valuable insights

Challenge #7: Troubles of up scaling

29 of 124

30 of 124

Types of big data

BigData' could be found in three forms:

  • Structured
  • Unstructured
  • Semi-structured

31 of 124

Structured

  • Any data that can be stored, accessed and processed in the form of fixed format is termed as a 'structured' data.
  • The format is well known in advance. Can derive value out of it.
  • Currently typical sizes are being in the rage of multiple zettabytes.

32 of 124

Structured Data

Employee_I

D

Employee_

Name

Gender

Department

Salary_In_la

cs

2365

Rajesh Kulkarni

Male

Finance

650000

3398

Pratibha Joshi

Female

Admin

650000

7465

Shushil Roy

Male

Admin

500000

7500

Shubhojit Das

Male

Finance

500000

7699

Priya Sane

Female

Finance

550000

33 of 124

Unstructured Data

34 of 124

Unstructured Data

  • Data with unknown form or the structure.
  • Apart from being huge it poses multiple challenges in terms of its processing for deriving value out of it.
  • A typical example of unstructured data is a heterogeneous data source containing a combination of simple text files, images, videos etc.
  • Raw form or unstructured format created complexity.

35 of 124

Semi-structured

  • Semi-structured data can contain both the forms of data. We can see semi-structured data as a structured in form but it is actually not defined like a table definition in relational DBMS.
  • Example of semi-structured data is a data represented in an XML file.

36 of 124

Semi Structured

<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>

<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>

<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>

<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>

<rec><name>Jeremiah J.</name><gender>Male</gender><age>35</age></rec>

37 of 124

A Contrast of the Three Types

38 of 124

Why Big Data Analytics

Big Data analytics is a process used to extract meaningful insights, such as hidden patterns, unknown correlations, market trends, and customer preferences.

39 of 124

Big Data Analytics Advantages

Cost Savings :  help in identifying more efficient ways of doing business.

Time Reductions :helps businesses analyzing data immediately and make quick decisions based on the learnings.

New Product Development : By knowing the trends of customer needs and satisfaction through analytics you can create products according to the wants of customers.

Understand the market conditions : By analyzing big data you can get a better understanding of current market conditions.

Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get feedback about who is saying what about your company.

40 of 124

Big Data Analytics Applications

41 of 124

  • Ecommerce - Predicting customer trends and optimizing prices are few of the ways e-commerce uses Big Data analytics

  • Marketing - Big Data analytics helps to drive high ROI marketing campaigns, which result in improved sales

  • Education - Used to develop new and improve existing courses based on market requirements

  • Healthcare - With the help of a patient’s medical history, Big Data analytics is used to predict how likely they are to have health issues

  • Media and entertainment - Used to understand the demand of shows, movies, songs, and more to deliver a personalized recommendation list to its users

42 of 124

  • Banking - Customer income and spending patterns help to predict the likelihood of choosing various banking offers, like loans and credit cards

  • Telecommunications - Used to forecast network capacity and improve customer experience

  • Government - Big Data analytics helps governments in law enforcement, among other things

43 of 124

Big Data Use Cases

44 of 124

Netflix -Big Data & User Experience

  • When it comes to gathering data, Netflix’s huge user base of over 148 million subscribers gives it a massive advantage. It then focuses on the following metrics:
  • Date on which content was watched
  • The device on which the content was watched
  • How the nature of the content watched varied based on the device
  • Searches on its platform
  • Portions of content that got re-watched
  • Whether content was paused
  • User location data
  • Time of the day and week in which content was watched and how it influences the kind of content watched
  • Metadata from third parties like Nielsen
  • Social media data from Facebook and Twitter

45 of 124

  • Netflix focuses on giving each user just what the user wants through a personalized content ranker that organizes each Netflix user’s collection based on personal information collected about the user.
  • Netflix ranks top and trending content not only based on how popular the content is but also based on personal information available about the user. The content is promoted on the basis of the user’s Netflix activity.
  • Recently viewed content is sorted based on an analysis of whether users are expected to continue watching or rewatching, or whether users stopped watching due to not finding the content interesting. Netflix doesn’t bore its users.

46 of 124

Manufacturing Big Data Use Cases

The digital revolution has transformed the manufacturing industry. Manufacturers are now finding new ways to harness all the data they generate to improve operational efficiency, streamline business processes, and uncover valuable insights that will drive profits and growth.

Predictive MaintenanceBig data can help predict equipment failure. Potential issues can be discovered by analyzing both structured data (equipment year, make, and model) and multistructured data (log entries, sensor data, error messages, engine temperature, and other factors). With this data, manufacturers can maximize parts and equipment uptime and deploy maintenance more cost effectively.

Operational EfficiencyOperational efficiency is one of the areas in which big data can have the most impact on profitability. With big data, you can analyze and assess production processes, proactively respond to customer feedback, and anticipate future demands.

Production OptimizationOptimizing production lines can decrease costs and increase revenue. Big data can help manufacturers understand the flow of items through their production lines and see which areas can benefit. Data analysis will reveal which steps lead to increased production time and which areas are causing delays.

47 of 124

Retail Big Data Use Cases

Competition is fierce in retail. To stay ahead, companies strive to differentiate themselves. Big data is being used across all stages of the retail process—from product predictions to demand forecasting to in-store optimization. Using big data, retailers are finding new ways to innovate.

  • Product Development

  • Customer Experience
  • Customer Lifetime Value
  • The In-Store Shopping Experience
  • Pricing Analytics and Optimization

48 of 124

Healthcare Big Data Use Cases

Healthcare organizations are using big data for everything from improving profitability to helping save lives. Healthcare companies, hospitals, and researchers collect massive amounts of data. But all of this data isn’t useful in isolation. It becomes important when the data is analyzed to highlight trends and threats in patterns and create predictive models.

Patient Experience and Outcomes

Claims Fraud

Healthcare Billing Analytics

49 of 124

Oil and Gas Big Data Use Cases

For the past few years, the oil and gas industry has been leveraging big data to find new ways to innovate. The industry has long made use of data sensors to track and monitor the performance of oil wells, machinery, and operations. Oil and gas companies have been able to harness this data to monitor well activity, create models of the Earth to find new oil sources, and perform many other value-added tasks.

Predictive Equipment MaintenanceOil and gas companies often lack visibility into the condition of their equipment, especially in remote offshore and deep-water locations. Big data can help by providing insight so companies can predict the remaining optimal life of their systems and components, ensuring that their assets operate at optimum production efficiency.

Oil Exploration and DiscoveryExploring for oil and gas can be expensive. But companies can make use of the vast amount of data generated in the drilling and production process to make informed decisions about new drilling sites. Data generated from seismic monitors can be used to find new oil and gas sources by identifying traces that were previously overlooked.

Oil Production OptimizationUnstructured sensor and historical data can be used to optimize oil well production. By creating predictive models, companies can measure well production to understand usage rates. With deeper data analysis, engineers can determine why actual well outputs aren’t tallying with their predictions.

50 of 124

Telecommunications Big Data Use Cases

The popularity of smart phones and other mobile devices has given telecommunications companies tremendous growth opportunities. But there are challenges as well, as organizations work to keep pace with customer demands for new digital services while managing an ever-expanding volume of data.

Optimize Network CapacityOptimal network performance is essential for a telecom’s success. Network usage analytics can help companies identify areas with excess capacity and reroute bandwidth as needed. Big data analytics can help them plan for infrastructure investments and design new services that meet customer demands. With new insights, telecoms are able maintain customer loyalty and avoid losing revenue to competitors.

Telecom Customer ChurnBy analyzing the data telecoms already have about service quality, convenience, and other factors, telecoms can predict overall customer satisfaction. And they can set up alerts when customers are at risk of churning—and take action with retention campaigns and proactive offers.

New Product OfferingsUnstructured sensor and historical data can be used to optimize oil well production. By creating predictive models, companies can measure well production to understand usage rates. With deeper data analysis, engineers can determine why actual well outputs aren’t tallying with their predictions.

51 of 124

Financial Services Big Data Use Cases

Forward-thinking banks and financial services firms are capitalizing on big data. From capturing new market opportunities to reducing fraud, financial services organizations have been able to convert big data into a competitive advantage.

Fraud and ComplianceWhen it comes to security, it’s not just a few rogue hackers. The financial services industry is up against entire expert teams. While security landscapes and compliance requirements are constantly evolving. Using big data, companies can identify patterns that indicate fraud and aggregate large volumes of information to streamline regulatory reporting.

Anti-Money LaunderingFinancial services firms are under more pressure than ever before from governments passing anti-money laundering laws. These laws require that banks show proof of proper diligence and submit suspicious activity reports. In this extraordinarily complicated arena, big data analytics can help companies identify potential fraud patterns.

Financial Regulatory and Compliance AnalyticsFinancial services companies must be in compliance with a wide variety of requirements concerning risk, conduct, and transparency. At the same time, banks must comply with the Dodd-Frank Act, Basel III, and other regulations that require detailed reporting.

52 of 124

To Conclude…

  • Big data is bound to play a bigger role in the f ‘digital’.

  • Further technological developments lead to new frameworks.

  • Almost every sector ( e.g. Corporate, Government, Non Profit) would eventually deploy big data analytics to enhance their performance.

uture as all of become more

53 of 124

A Gentle Introduction to Hadoop

54 of 124

Hadoop is a well-adopted, standards-based, open-source software framework built on the foundation of Google’s MapReduce and Google File System.

It’s meant to leverage the power of massive parallel processing to take advantage of Big Data, generally by using lots of inexpensive commodity servers.

Hadoop is designed to abstract away much of the complexity of distributed processing.

This lets developers focus on the task at hand, instead of getting lost in the technical details of deploying such a functionally rich environment.

55 of 124

Hadoop

🢝 Hadoop is an Apache open source framework written in java that allows distributed processing of large datasets across clusters of computers using simple programming models.

🢝A Hadoop frame-worked application works in an environment that provides distributed storage and computation across clusters of computers.

🢝 Hadoop is designed to scale up from single server to thousands of machines, each offering local computation and storage.

56 of 124

MapReduce

  • Hadoop MapReduce is a software framework for easily writing applications which process big amounts of data in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
  • The Map Task: This is the first task, which takes input data and converts it into a set of data, where individual elements are broken down into tuples (key/value pairs).
  • The Reduce Task: This task takes the output from a map task as input and combines those data tuples into a smaller set of tuples. The reduce task is always performed after the map task.
  • Typically both the input and the output are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

57 of 124

58 of 124

Hadoop Distributed File System (HDFS)

Takes care of storage part of Hadoop architecture.

Hadoop HDFS broke the files into small pieces of data known as blocks. The default block size in HDFS is 128 MB.

We can configure the size of the block as per the requirements.

These blocks are stored in the cluster in a distributed manner on different nodes.

This provides a mechanism for MapReduce to process the data in parallel in the cluster.

59 of 124

Advantages of HDFS

Fault Tolerance–Each data blocks are replicated thrice ((everything is stored on three machines/Data Nodes by default) in the cluster. This helps to protect the data against data Node (machine) failure.

Space – Just add more data nodes if you need more disk space.

Scalability – Unlike traditional databases HDFS is highly scalable because it can store and distribute very large datasets across many nodes that can operate in parallel.

Flexibility –It can store any kind of data, whether its structured, semi-structured or unstructured.

Cost-effective – HDFS has direct attached storage and shares the cost of the network

and computers it runs on with the MapReduce. It's also an open source software.

60 of 124

Basic Data Analytic Methods Using R

61 of 124

Introduction to R

R is a programming language and software framework for statistical analysis and graphics.

The following R code illustrates a typical analytical situation in which a dataset is imported, the contents of the dataset are examined, and some model building tasks are executed.

In the following scenario the annual sales in U.S. dollars for 10,000 retail customers have been provided in the form of a comma separated- value (CSV) file. The read.csv() function is used to import the CSV file. This dataset is stored to the R variable sales using the assignment operator <-.

62 of 124

# import a CSV file of the total annual sales for each customer

sales <- read.csv("c:/data/yearly_sales.csv")

# examine the imported dataset

head(sales)

#The summary() function provides some descriptive statistics, such as the mean and median, for each data column. Additionally, the minimum and maximum values as well as the 1st and 3rd quartiles are provided. Because the gender column contains two possible characters, an “F” (female) or “M” (male),

summary(sales)

63 of 124

Quartiles

  • 1st quartile(25%), Median (50%), and 3rd quartile (75%)
  • 1 2 3 4 5 6 7 8 9

Median

75% Qu

75% Qu

Interquartile range (IQR)

  • Fivenum summary:
    • Minimum (1), 1st Quartile (3), Medium (5), 3rd Quartile (7), maximum (9)

64 of 124

#Plotting a dataset’s contents can provide information about the relationships between the various columns.

# plot num_of_orders vs. sales

plot(sales$num_of_orders,sales$sales_total,

main="Number of Orders vs. Sales"

the plot() function generates a scatterplot of the number of orders (sales$num_of_orders) against the annual sales (sales$sales_total).

Note:The $ is used to reference a specific column in the dataset sales.

65 of 124

  • Each point corresponds to the number of orders and the total sales for each customer.

  • The plot indicates that the annual sales are proportional to the number of orders placed.

66 of 124

Data Import and Export

The dataset was imported into R using the read.csv() function

#sales <- read.csv("c:/data/yearly_sales.csv")

The setwd() function can be used to set the working directory for the subsequent import and export operations

#setwd("c:/data/")

#sales <- read.csv("yearly_sales.csv")

read.table()

#sales_table <- read.table("yearly_sales.csv", header=TRUE, sep=",")

read.delim()

#sales_delim <- read.delim("yearly_sales.csv", sep=",")

67 of 124

add a column for the average sales per order

#sales$per_order <- sales$sales_total/sales$num_of_orders

export data as tab delimited without the row names

#write.table(sales,"sales_modified.txt", sep="\t", row.names=FALSE

68 of 124

Big Data Analytics

  • Big data analytics analyzes the collected data and find patterns from it.
  • The velocity, veracity, variety, and volume of data lying with organizations must be put to work to gain actionable insights out of the same.
  • Organizations leveraging big data analytics must thoroughly understand the best practices for big data first to be able to use the most relevant data for analysis.

69 of 124

Best Practices for Big data Analytics

70 of 124

Best practices for Big Data Analytics

71 of 124

1. UNDERSTAND THE BUSINESS REQUIREMENTS

Analyzing and understanding the business requirements and organizational goals is the first and the foremost step that must be carried out even before leveraging big data analytics into your projects.

The business users must understand which projects in their company must use big data analytics to make maximum profit.

72 of 124

2. DETERMINE THE COLLECTED DIGITAL ASSETS

The second best big data practice is to identify the type of data pouring into the organization, as well as, the data generated in-house. Usually, the data collected is disorganized and in varying formats. Moreover, some data is never even exploited (read dark data), and it is essential that organizations identify this data too.

73 of 124

3. IDENTIFY WHAT IS MISSING

The third practice is analyzing and understanding what is missing. Once you have collected the data needed for a project, identify the additional information that might be required for that particular project and where can it come from. For instance, if you want to leverage big data analytics in your organization to understand your employee's well-being, then along with information such as login logout time, medical reports, and email reports, we need to have some additional information about the employee’s, let’s say, stress levels. This information can be provided by co-workers or leaders.

74 of 124

4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED

After analyzing and collecting data from different sources, it's time for the organization to understand which big data technologies, such as predictive analytics, stream analytics, data preparation, fraud detection, sentiment analysis, and so on can be best used for the current business requirements.

For instance, big data analytics helps the HR team in companies for the recruitment process to identify the right talent faster by collaborating the social media and job portals using predictive and sentiment analysis.

75 of 124

5. ANALYZE DATA CONTINUOUSLY

This is the final best practice that an organization must follow when it comes to big data. You must always be aware of what data is lying with your organization and what is being done with it.

Check the health of your data periodically to never miss out on any important but hidden signals in the data.

Before implementing any new technology in your organization, it is vital to have a strategy to help you get the most out of it. With adequate and accurate data at their disposal, companies must also follow the above mentioned big data practices to extract value from this data.

76 of 124

Stages of Big Data Analytical Evolution

The process of dealing with big data is quite different from handling traditional data. Big Data processing consists of

  • Collecting
  • Storing
  • Organizing
  • Analyzing
  • Extracting Hidden Information For Decision Making.

77 of 124

1. Data Collection

This is the first stage which involves the collection of web data, log data, structured and unstructured data from several types of data sources, like mobile devices, sensor devices, social media.

78 of 124

2. Storing

In this stage the collected data has to be stored into distributed database systems and servers. Introduction to NOSQL facilitated to store big data. Since NOSQL does not have any fixed schema and there is no relationship between entities it is used to store dynamic and un structured data.

79 of 124

3. Data Organization

In this stage, data is arranged and organized as structured, unstructured and semi-unstructured data.

In order to access and analyzed. After the data is arranged and organized, the analysis stage is applied. Analyzing large data set involves more complexities and computations. More research and survey is going on to find the algorithm and mathematical model to minimize the computational and storage cost. The extracted hidden information will be useful for the Industries, Academicians and the Government to make necessary action and decision . The infrastructure needed for big data should be highly scalable, support statistical analytics and data mining and based on analytical model automated decision should be made in quick time.

80 of 124

4. Analysis

After the data is arranged and organized, the analysis stage is applied. Analyzing large data set involves more complexities and computations. More research and survey is going on to find the algorithm and mathematical model to minimize the computational and storage cost.

The extracted hidden information will be useful for the Industries, Academicians and the Government to make necessary action and decision .

The infrastructure needed for big data should be highly scalable, support statistical analytics and data mining and based on analytical model automated decision should be made in quick time.

81 of 124

5. Data Visualization

Once the information has been carried out from data, it has to be represented in a visualized manner. The representation is generally done using Data Visualization tools that enable decision makers to grasp difficult concepts and pattern easily.

82 of 124

State of the Practice in Analytics

83 of 124

Business Drivers for Advanced Analytics

84 of 124

BI Versus Data Science

85 of 124

Current Analytical Architecture

86 of 124

Drivers of Big Data

87 of 124

Emerging Big Data Ecosystem and a New Approach to Analytics

88 of 124

The Data Scientist

89 of 124

Profile of a Data Scientist

90 of 124

Data scientists having five main sets of skills and behavioral characteristics

Quantitative skill: such as mathematics or statistics

Technical aptitude: namely, software engineering, machine learning, and programming skills

Skeptical mind-set and critical thinking: It is important that data scientists can examine their work critically rather than in a one-sided way.

91 of 124

Curious and creative: Data scientists are passionate about data and finding creative ways to solve problems and portray information.

Communicative and collaborative: Data scientists must be able to articulate the business value in a clear way and collaboratively work with other groups, including project sponsors and key stakeholders.

92 of 124

Data Analytics Lifecycle

93 of 124

Data Analytics Lifecycle Overview

The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects.

The lifecycle has six phases, and project work can occur in several phases at once.

For most phases in the lifecycle, the movement can be either forward or backward.

This iterative depiction of the lifecycle is intended to more closely portray a real project, in which aspects of the project move forward and may return to earlier stages as new information is uncovered and team members learn more about various stages of the project.

This enables participants to move iteratively through the process and drive toward operationalizing the project work.

94 of 124

Key Roles for a Successful Analytics Project

95 of 124

Background and Overview of Data Analytics Lifecycle

96 of 124

Phase 1: Discovery

Learning the Business Domain

Resources

Framing the Problem

Identifying Key Stakeholders

Interviewing the Analytics Sponsor

Developing Initial Hypotheses

Identifying Potential Data Sources

97 of 124

Phase 2: Data Preparation

Preparing the Analytic Sandbox

Performing ETLT

Learning About the Data

Data Conditioning

Survey and Visualize

Common Tools for the Data Preparation Phase

98 of 124

Phase 3: Model Planning

Data Exploration and Variable Selection

Model Selection

Common Tools for the Model Planning Phase

99 of 124

Phase 4: Model Building

Common Tools for the Model Building Phase

Commercial Tools:

SAS Enterprise Miner allows users to run predictive and descriptive models based on large

volumes of data from across the enterprise. It interoperates with other large data stores, has many

partnerships, and is built for enterprise-level computing and analytics.

SPSS Modeler (provided by IBM and now called IBM SPSS Modeler) offers methods to

explore and analyze data through a GUI.

Matlab provides a high-level language for performing a variety of data analytics, algorithms,

and data exploration.

Alpine Miner provides a GUI front end for users to develop analytic workflows and interact

with Big Data tools and platforms on the back end.

STATISTICA and Mathematica are also popular and well-regarded data mining and

analytics tools.

100 of 124

Free or Open Source tools:

R and PL/R R was described earlier in the model planning phase, and PL/R is a procedural

language for PostgreSQL with R. Using this approach means that R commands can be executed

in database. This technique provides higher performance and is more scalable than

running R in memory.

Octave a free software programming language for computational modeling, has some of

the functionality of Matlab. Because it is freely available, Octave is used in major universities

when teaching machine learning.

WEKA is a free data mining software package with an analytic workbench. The functions

created in WEKA can be executed within Java code.

Python is a programming language that provides toolkits for machine learning and analysis,

such as scikit-learn, numpy, scipy, pandas, and related data visualization using matplotlib.

SQL in-database implementations, such as MADlib, provide an alterative to in-memory

desktop analytical tools. MADlib provides an open-source machine learning library of algorithms

that can be executed in-database, for PostgreSQL or Greenplum.

101 of 124

Phase 5: Communicate Results

After executing the model, the team needs to compare the outcomes of the modelling to the criteria established for success and failure.

it is critical to articulate the results properly and position the findings in a way that is appropriate for the audience.

the team needs to determine if it succeeded or failed in its objectives. Many times people do not want to admit to failing, but in this instance failure should not be considered as a true failure, but rather as a failure of the data to accept or reject a given hypothesis adequately.

the key is to remember that the team must be rigorous enough with the data to determine whether it will prove or disprove the hypotheses outlined in Phase 1 (discovery).

102 of 124

Phase 6: Operationalize

In the final phase, the team communicates the benefits of the project more broadly and sets up a pilot project to deploy the work in a controlled way before broadening the work to a full enterprise or ecosystem of users.

Rather than deploying these models immediately on a wide-scale basis, the risk can be managed more effectively and the team can learn by undertaking a small scope, pilot deployment before a wide-scale rollout.

This approach enables the team to learn about the performance and related constraints of the model in a production environment on a small scale and make adjustments before a full deployment.

Part of the operationalizing phase includes creating a mechanism for performing ongoing monitoring of model accuracy and, if accuracy degrades, finding ways to retrain the model.

103 of 124

Key outputs from a successful analytics project

104 of 124

105 of 124

Best Practices for Big data Analytics

106 of 124

1. UNDERSTAND THE BUSINESS REQUIREMENTS

Analyzing and understanding the business requirements and organizational goals is the first and the foremost step that must be carried out even before leveraging big data analytics into your projects.

The business users must understand which projects in their company must use big data analytics to make maximum profit.

107 of 124

2. DETERMINE THE COLLECTED DIGITAL ASSETS

The second best big data practice is to identify the type of data pouring into the organization, as well as, the data generated in-house. Usually, the data collected is disorganized and in varying formats. Moreover, some data is never even exploited (read dark data), and it is essential that organizations identify this data too.

108 of 124

3. IDENTIFY WHAT IS MISSING

The third practice is analyzing and understanding what is missing.

Once you have collected the data needed for a project, identify the additional information that might be required for that particular project and where can it come from.

For instance, if you want to leverage big data analytics in your organization to understand your employee's well-being, then along with information such as login logout time, medical reports, and email reports, we need to have some additional information about the employee’s, let’s say, stress levels. This information can be provided by co-workers or leaders.

109 of 124

4. COMPREHEND WHICH BIG DATA ANALYTICS MUST BE LEVERAGED�

After analyzing and collecting data from different sources, it's time for the organization to understand which big data technologies, such as predictive analytics, stream analytics, data preparation, fraud detection, sentiment analysis, and so on can be best used for the current business requirements.

For instance, big data analytics helps the HR team in companies for the recruitment process to identify the right talent faster by collaborating the social media and job portals using predictive and sentiment analysis.

110 of 124

5. ANALYZE DATA CONTINUOUSLY

This is the final best practice that an organization must follow when it comes to big data. You must always be aware of what data is lying with your organization and what is being done with it.

Check the health of your data periodically to never miss out on any important but hidden signals in the data.

Before implementing any new technology in your organization, it is vital to have a strategy to help you get the most out of it. With adequate and accurate data at their disposal, companies must also follow the above mentioned big data practices to extract value from this data.

111 of 124

Stages of Big Data Analytical Evolution

The process of dealing with big data is quite different from handling traditional data. Big Data processing consists of

  • Collecting
  • Storing
  • Organizing
  • Analyzing
  • Extracting Hidden Information For Decision Making.

112 of 124

1. Data Collection

This is the first stage which involves the collection of web data, log data, structured and unstructured data from several types of data sources, like mobile devices, sensor devices, social media.

113 of 124

2. Storing

In this stage the collected data has to be stored into distributed database systems and servers.

Introduction to NOSQL facilitated to store big data.

Since NOSQL does not have any fixed schema and there is no relationship between entities it is used to store dynamic and un structured data.

114 of 124

3. Data Organization

In this stage, data is arranged and organized as structured, unstructured and semi-unstructured data In order to access and analysis.

After the data is arranged and organized, the analysis stage is applied. Analyzing large data set involves more complexities and computations. More research and survey is going on to find the algorithm and mathematical model to minimize the computational and storage cost.

The extracted hidden information will be useful for the Industries, Academicians and the Government to make necessary action and decision .

The infrastructure needed for big data should be highly scalable, support statistical analytics and data mining and based on analytical model automated decision should be made in quick time.

115 of 124

4. Analysis

After the data is arranged and organized, the analysis stage is applied. Analyzing large data set involves more complexities and computations. More research and survey is going on to find the algorithm and mathematical model to minimize the computational and storage cost.

The extracted hidden information will be useful for the Industries, Academicians and the Government to make necessary action and decision .

The infrastructure needed for big data should be highly scalable, support statistical analytics and data mining and based on analytical model automated decision should be made in quick time.

116 of 124

5. Data Visualization

Once the information has been carried out from data, it has to be represented in a visualized manner. The representation is generally done using Data Visualization tools that enable decision makers to grasp difficult concepts and pattern easily.

117 of 124

State of the Practice in Analytics

118 of 124

Business Drivers for Advanced Analytics

119 of 124

BI Versus Data Science

120 of 124

Current Analytical Architecture

121 of 124

Drivers of Big Data

122 of 124

Emerging Big Data Ecosystem and a New Approach to Analytics

123 of 124

Question Time

124 of 124

Have a Great Learning Time!