JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 214

Miss Sadhana S Kekan

Information Technology

Modern College of Engineering, Shivajinagar, Pune

Email:sadhanakekan@moderncoe.edu.in

DATA SCIENCE AND BIG DATA Analytics

2 of 214

Books :

Krish Krishnan, Data warehousing in the age of Big Data, Elsevier, ISBN: 9780124058910, 1st Edition

Unit Objectives

To introduce basic need of Big Data and Data science to handle huge amount of data.
To understand the application and impact of Big Data.

INTRODUCTION: DATA SCIENCE AND BIG DATA

Unit outcomes:

To understand Big Data primitives.
To learn different programming platforms for big data analytics.

Outcome Mapping: PEO: I,V , PEO c, e CO: 1, 2, PSO: 3,4

3 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

Data science and Big Data
Defining Data science and Big Data
Big Data examples
Data explosion
Data volume
Data Velocity
Big data infrastructure and challenges
Big Data Processing Architectures
Data Warehouse
Re-Engineering the Data Warehouse
Shared everything and shared nothing Architecture
Big data learning approaches.

4 of 214

�

Is a process which examined that:

- from where the information can be taken

- what it signifies

- how it can be converted into a useful resource in the creation of business & IT strategies.

Goal : extract value from data in all its forms.

Manipulates the mathematics, statistics, & computer science regulations.

Includes methods like machine learning, cluster analysis, data mining & visualization.

INTRODUCTION: DATA SCIENCE AND BIG DATA

5 of 214

�

With the help of mining huge quantity of structured & unstructured data, organizations can:

- reduce costs

-raise efficiencies

- identifies new market opportunities.

- enhances organization’s competitive benefit.

INTRODUCTION: DATA SCIENCE AND BIG DATA

6 of 214

�Fig: Data Science

Data Science

Hacker Mindset

Statistics

Advanced Computing

Visualization

Math

Domain Expertise

Data Engineering

Scientific Method

INTRODUCTION: DATA SCIENCE AND BIG DATA

7 of 214

Data Scientists

Convert the organization’s raw data into the useful information.

Managing & understanding large amounts of data.

Create data visualization models that facilitates demonstrating the business value of digital information.

Can illustrates digital information easily with the help of smart phones, Internet of Things devices , Social media.

8 of 214

Fig: The Data Science Pipeline

9 of 214

Data and its structure�

Data comes in many forms, but at a high level, it falls into three categories: structured, semi-structured, and unstructured.
Structured data :

- highly organized data

- exists within a repository such as a database (or a comma- separated values [CSV] file).

- easily accessible.

- format of the data makes it appropriate for queries and computation (by using languages such as Structured Query Language (SQL)).

Unstructured data : lacks any content structure at all (for example, an audio stream or natural language text).
Semi-structure data: Include metadata or data that can be more easily processed than unstructured data by using semantic tagging.

10 of 214

Data and its structure�

Figure : Models of data

11 of 214

Data engineering�

Data wrangling:

Process of manipulating raw data to make it useful for data analytics or to train a machine learning model.

Include:

- sourcing the data from one or more data sets (in addition to reducing the set to the required data),

- normalizing the data so that data merged from multiple data sets is consistent.

- parsing data into some structure or storage for further use.

process by which you identify, collect, merge, and preprocess one or more data sets in preparation for data cleansing.

12 of 214

Data cleansing�

After you have collected and merged your data set, the next step is cleansing.

Data sets in the wild are typically messy and infected with any number of common issues.

Common issues, including missing values (or too many values), bad or incorrect delimiters (which segregate the data), inconsistent records, or insufficient parameters.

When data set is syntactically correct, the next step is to ensure that it is semantically correct.

13 of 214

Data preparation/preprocessing�

final step in data engineering.

This step assumes that you have a cleansed data set that might not be ready for processing by a machine learning algorithm.

Using normalization, you transform an input feature to distribute the data evenly into an acceptable range for the machine learning algorithm.

14 of 214

Machine learning��

Create and validate a machine learning model.

Sometimes, the machine learning model is the product, which is deployed in the context of an application to provide some capability (such as classification or prediction).

In other cases, the product isn’t the trained machine learning algorithm but rather the data that it produces.

15 of 214

Model learning��

In one model, the algorithm process the data, & create new data product as the result.

But, in a production sense, the machine learning model is the product itself, deployed to provide insight or add value (such as the deployment of a neural network to provide prediction capabilities for an insurance market).

16 of 214

Machine learning approaches:

Supervised learning
Unsupervised learning
Reinforcement learning
Supervised learning:

- algorithm is trained to produce the correct class and alter the model when it fails to do so.

- The model is trained until it reaches some level of accuracy.

Unsupervised learning:

- has no class; instead, it inspects the data and groups it based on some structure that is hidden within the data.

- these types of algorithms can be used in recommendation systems by grouping customers based on the viewing or purchasing history.

Machine learning approaches

17 of 214

Reinforcement learning

- is a semi-supervised learning algorithm.

- provides a reward after the model makes some number of decisions that lead to a satisfactory result.

Model validation

used to understand how model behave in production after a model is trained.
for that purpose it reserve a small amount of available training data to be tested against final model.(called as test data)
training data is used to train machine learning model.
Test data is used when the model is complete to validate how well it generalizes to unseen data.

18 of 214

Reinforcement learning

Operations:

end goal of the data science pipeline.
creating a visualization for data product.
Deploying machine learning model in a production environment to operate on unseen data to provide prediction or classification.

Model deployment:

When the product of the machine learning phase is a model then it will be deployed into some production environment to apply to new data.
This model could be a prediction system.

Prediction System

I /P

O/P

Historical Financial data eg.Sales & Revenue

Classification of whether a company is a reasonable acquisition target.

19 of 214

Reinforcement learning

Model visualization:

In smaller scale data science , the product is data ;instead of model produced in the machine learning phase.
Data product answers some questions about the original data set.
Options for visualization are vast and can be produced from the R programming language.

20 of 214

�Summary- Definitions of Data Science

Is a field of Big data, which searches for providing meaningful information from huge amounts of complex data.

Is a system used for retrieving the information in different forms either in structured or unstructured.

It combines different fields of work in statistics & computation in order to understand the data for the purpose of decision making.

21 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

Data science and Big Data
Defining Data science and Big Data
Big Data examples
Data explosion
Data volume
Data Velocity
Big data infrastructure and challenges
Big Data Processing Architectures
Data Warehouse
Re-Engineering the Data Warehouse
Shared everything and shared nothing Architecture
Big data learning approaches.

22 of 214

Introduction to Big Data �

Is huge amount of data.

Organizations use data generated through various sources to run their business.

They analyze the data to understand & interpret market trends, study customer behavior & take financial decisions.

Consists of large datasets that cannot be managed efficiently by the common DBMS.

These datasets range from terabytes to exabytes.

23 of 214

Introduction to Big Data�

Mobile phones, credit cards, Radio Frequency Identification (RFID) devices & Social Networking platforms create huge amounts of data that may reside unutilized at unknown servers for many years.

With the evolution of Big data, this data can be accessed & analyzed on a regular basis to generate useful information.

The Sheer Volume, Variety, Velocity & Veracity of data is signified by the term “ Big Data “

Is structured, Unstructured, Semi-structured or heterogeneous in nature.

24 of 214

Introduction to Big Data

Traditional DBMS, warehousing & analysis systems fizzle to analyze huge amount of data.

Big data is stored in Distributed architecture file system.

Hadoop by Apache is widely used for storing & managing Bigdata.

Sort, organize, analyze & after this critical data in a systematic manner is nothing but Big data.

The process of capturing or collecting Big data is known as “datafication”.

Bigdata is datafied so that it can be used productively.

25 of 214

Introduction to Big Data�

Big data can be made useful by:

- organizing it

- determining what we can do with it.

Big Data

Is a new data challenge that requires leveraging existing systems differently.

Is classified in terms of 4Vs:Volume ,Variety, Velocity, Veracity

Is usually unstructured & qualitative in nature.

26 of 214

Real-world Examples of Big Data

Consumer product companies & retail organizations are observing data on social media websites such as FB & Twitter. These sites help them to analyze customer behavior, preferences & product perception. Accordingly, the companies can line up their upcoming products to gain profits called as social media analytics.

Sports teams are using data for tracking ticket sales & even for tracking team strategies.

27 of 214

Big data is a pool of huge amounts of data of all types , shapes and formats collected from different sources.

Evolution of Bigdata

Bigdata is the new term of data evolution directed by the velocity, variety & volume of data.

Velocity: implies the speed with which the data flows in an organizations.

Variety: varied forms of data; such as structured, semi-structured, unstructured.

Volume: amount of data an organization has to deal with.

Big Data

28 of 214

Big Data

Structuring Big data

Arranging the available data in a manner such that it becomes easy to study, analyze & derive conclusion from it.

Structuring data helps in understanding user behaviors, requirements & preferences to make personalized recommendations for every individual.

Eg: when user regularly visits or purchases from online shopping sites , each time he logs in, the system can present a recommended list of products that may interest the user on the basis of his earlier purchases or searches.

Different types of data (eg. images, text, audio) can be structured only if it is sorted & organized in some logical pattern.

29 of 214

Parallel Processing

Data Science

Artificial Intelligence

Data mining

Distributed System

Data Storage

Analysis

Big Data

Fig: Concepts of Big Data

Big Data

30 of 214

Big Data

Semi-Structured Data

Structured Data

Unstructured Data

₊

₌

Fig: Types of Data

Big Data

31 of 214

Elements of Big Data

Volume
Velocity
Variety
Veracity

32 of 214

Elements of Big Data

Amount of data generated by organizations.

Volume of data in most organizations is exabytes .

Organizations are doing best to handle this ever-increasing volume of data.

Internet alone generates a huge amount of data.

Eg: Internet has around 14.3 trillion live pages , 48 billion web pages are indexed by Google Inc, 14 Billion web pages are indexed by Microsoft Bing.

33 of 214

Velocity

Rate at which data is generated, captured & shared.

Is flow of data from various sources such as networks, human resources, social media etc.

The data can be huge & flows in continuous manner.

Enterprises can capitalize on data only if it is captured & shared in real time.

Information processing systems such as CRM & ERP face problems associated with data, which keeps adding up but cannot be processed quickly.

34 of 214

Velocity

These systems are able to attend data in batches every few hours.

Eg: eBay analyzes around 5 million transactions per day in real time to detect & prevent frauds arising from the use of PayPal.

Sources of high velocity data includes:
IT devices: routers , switches, firewalls.
Social media: Facebook posts , tweets etc.
Portable devices: Mobile

Examples of data velocity :

Amazon, FB, Yahoo, Google, Sensor data, Mobile networks etc.

35 of 214

Variety

Data generated from different types of sources such as Internal, External, Social etc.

Data comes in different formats(images, text, videos)

Single source can generate data in varied formats.

Eg: GPS & Social networking sites such as Facebook produce data of all types, including text, images, videos.

36 of 214

Veracity

Refers to the uncertainty of data; that is whether the obtained data is correct or consistent.

Out of huge amount of data, correct & consistent data can be used for further analysis.

Unstructured & semi-structured data take lots of efforts to clean the data & make it suitable for analysis.

37 of 214

Data Explosion

Is rapid growth of the data.

One reason to this explosion is innovation.

Innovation has transformed the way we engage in business, provide services, and the associated measurement of value and profitability.

3 basic trends to build up the data:
Business model transformation
Globalization
Personalization of services

38 of 214

�Business model transformation

Modern companies can be moved towards the service oriented technologies rather than product oriented.

In service oriented, the value of the organization from customers point of view is measured by how much the service is effective instead of how much product is useful.

The amount of data produced &consumed by every organization today exceeds what the same organization produced prior to the business transformation.

Higher priority data are kept in center & the supporting data which is required but not available or accessible previously now can be available & accessible with the help of multiple channels.

39 of 214

�Globalization

Is a key trend that has radically changed the commerce of the world, starting from manufacturing to customer service.

Personalization of services

Business transformation’s maturity index is measured by the extent of personalization of services and the value perceived by their customers from such transformation.

40 of 214

Big Data Processing Architectures�

Big data architecture is designed to handle the processing & analysis of data that is too large or complex for traditional DBMS.

41 of 214

Fig: Components of Big data architecture:

42 of 214

Big Data Processing Architectures�

Data sources.

All big data solutions start with one or more data sources.
Examples include:
Application data stores, such as relational databases.
Static files produced by applications, such as web server log files.
Real-time data sources, such as IoT devices.

Data storage.

Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats.
This kind of store is often called a data lake.

43 of 214

Big Data Processing Architectures�

Batch processing.

data files are processed using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
Usually these jobs involve reading source files, processing them, and writing the output to new files.

Real-time message ingestion.

If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing.

Stream processing.

After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink.

44 of 214

Big Data Processing Architectures�

Analytical data store.

Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.

Analysis and reporting.

The goal of most big data solutions is to provide insights into the data through analysis and reporting.

Orchestration:

Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard.
To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

45 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

Data science and Big Data
Defining Data science and Big Data
Big Data examples
Data explosion
Data volume
Data Velocity
Big data infrastructure and challenges
Big Data Processing Architectures
Data Warehouse
Re-Engineering the Data Warehouse
Shared everything and shared nothing Architecture
Big data learning approaches.

46 of 214

Data processing infrastructure challenges

47 of 214

Data processing infrastructure challenges

Storage

The first & major problem to big data is storage.

As Big data is increased rapidly, there is need to process this huge data as well as to store it.

We need the additional 0.5 times storage to process & store the intermediate result set.

Storage has been a problem in the world of transaction processing and data warehousing.

Due to the design of the underlying software, we do not consume all the storage that is available on a disk.

Another problem with storage is the cost per byte.

48 of 214

Data processing infrastructure challenges

Transportation

One of the biggest issue is moving data between different systems and then storing it or loading it into memory for manipulation.

This continuous movement of data has been one of the reasons that structured data processing evolved to be restrictive in nature, (where the data had to be transported between the compute and storage layers. )

Network technologies facilitated the bandwidth of the transport layers to be much bigger and more scalable.

49 of 214

Data processing infrastructure challenges

Processing

Is to combine some form of logical and mathematical calculations together in one cycle of operation.

Divided into the 3 main areas:

1. CPU or processor.

2. Memory

3. Software

50 of 214

Data processing infrastructure challenges

CPU or processor.

With each generation:

- the computing speed and processing power have increased

-leading to more processing capabilities

- access to wider memory.

- architecture evolution within the software layers.

Memory.

While the storage of data to disk for offline processing proved the need for storage evolution and data management.

As the processor evaluations improving the capability of the processor, Memory has becomes cheaper and faster in terms of speed.

According to the allocated memory to system, the process resides within a system, has changed significantly.

51 of 214

Data processing infrastructure challenges

Software

Main component of data processing.

used to develop the programs to transform and process the data.

Software across different layers from operating systems to programming languages has evolved generationally.

Translates sequenced instruction sets into machine language that is used to process data with the infrastructure layers of CPU + memory + storage.

52 of 214

Data processing infrastructure challenges

Speed or throughput

The biggest continuing challenge.

Speed is a combination of various architecture layers: hardware, software, networking, and storage.

53 of 214

Big Data Processing Architectures

54 of 214

Big Data Processing Architectures

Centralized Processing Architecture

All the data is collected to a single centralized storage area and processed by a single computer

Evolved with transaction processing and are well suited for small organizations with one location of service.

Advantages :

requires minimal resources both from people & system perspectives.

Centralized processing is very successful when the collection and consumption of data occurs at the same location.

55 of 214

Big Data Processing Architectures

Distributed Processing Architecture

Data and its processing are distributed across geographies or data centers

Types:

1. Client –Server Architecture

Client : Collection and Presentation

Sever : Processing and Management

2. Three tier architecture

Client ,Server ,Middle tier

Middle Tier : Processing Logic

3.n-tier Architecture

clients, middleware, applications, and servers are isolated into tiers.

Any tier can be scaled independently

56 of 214

Big Data Processing Architectures

4. Cluster architecture.

Machines are connected in a network architecture .
Both software or hardware work together to process data or compute requirements in parallel.

Each machine in a cluster is associated with a task that is processed locally and the result sets are collected to a master server that returns it back to the user.

5. Peer-to-peer architecture.

No dedicated servers and clients; instead, all the processing responsibilities are allocated among all machines, known as peers.
Each machine can perform the role of a client or server or just process data.

57 of 214

Big Data Processing Architectures

Distributed processing advantages :

– Scalability of systems and resources can be achieved based on isolated needs.

– Processing and management of information can be architected based on desired unit of operation.

– Parallel processing of data reducing time latencies.

Distributed processing Disadvantages:

– Data redundancy

– Process redundancy

– Resource overhead

– Volumes

58 of 214

Big Data Processing Architectures

When working with very large data sets, it can take a long time to run the sort of queries that clients need.

These queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel across the entire data set.

The results are then stored separately from the raw data and used for querying.

One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return results that are several hours old.

Ideally, you would like to get some results in real time (perhaps with some loss of accuracy), and combine these results with the results from the batch analytics.

59 of 214

Big Data Processing Architectures

The lambda architecture, first proposed by Nathan Marz, addresses this problem by creating two paths for data flow.

All data coming into the system goes through these two paths:

A batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view.

A speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy.

60 of 214

Big Data Processing Architectures

Lamda Architecture

The batch layer feeds into a serving layer that indexes the batch view for efficient querying.

The speed layer updates the serving layer with incremental updates based on the most recent data.

61 of 214

Big Data Processing Architectures
The hot and cold paths converge at the analytics client application.

If the client needs to display timely, yet potentially less accurate data in real time, it will acquire its result from the hot path.

Otherwise, it will select results from the cold path to display less timely but more accurate data.

In other words, the hot path has data for a relatively small window of time, after which the results can be updated with more accurate data from the cold path.

62 of 214

Big Data Processing Architectures
The raw data stored at the batch layer is immutable.

Incoming data is always appended to the existing data, and the previous data is never overwritten.

Any changes to the value of a particular data are stored as a new timestamped event record.

This allows for recomputation at any point in time across the history of the data collected.

The ability to recompute the batch view from the original raw data is important, because it allows for new views to be created as the system evolves.

63 of 214

Big Data Processing Architectures

Lamda Architecture

Batch Layer (Cold Path)

Stores all incoming data & perform a batch processing

Managing all historical data

Recomputing the result using machine learning model

Results come at high latency due to computational cost

Data can be only appended not updated or deleted

Data is stored using memory databases or long term persistent like no-SQL storages

Uses Map-reduce

Speed Layer

Provide low –latency result

Data is processed in real-time

Incremental Algorithms

Create ,delete dataset is possible

64 of 214

Big Data Processing Architectures

Lamda Architecture

Serving Layer :

User fires query

Applications:

Ad-hoc queries

Netflix,Twitter,Yahoo

Pros:

Batch layer manages historical data so low error when system crashes

Good speed, reliability

Fault tolerance and scalable processing

Cons:

Caching overhead , complexity ,duplicate computation

Difficult to migrate or reorganize

65 of 214

Kappa Architecture

A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths.

The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture.

It has the same basic goals as the lambda architecture, but with an important distinction: All data flows through a single path, using a stream processing system.

Big Data Processing Architectures

66 of 214

Kappa Architecture

Big Data Processing Architectures

67 of 214

Big Data Processing Architectures

There are some similarities to the lambda architecture's batch layer, in that the event data is immutable and all of it is collected, instead of a subset.

The data is ingested as a stream of events into a distributed and fault tolerant unified log.

These events are ordered, and the current state of an event is changed only by a new event being appended.

Similar to a lambda architecture's speed layer, all event processing is performed on the input stream and persisted as a real-time view.

If you need to recompute the entire data set (equivalent to what the batch layer does in lambda), you simply replay the stream, typically using parallelism to complete the computation in a timely fashion.

68 of 214

Kappa Architecture

Simple lambda architecture with batch layer removed
Speed layer is capable of both real and batch data
Only two layers : Stream processing and Serving
All event processing is performed on the input stream and persisted as a real-time view
Speed layer is designed using Apache Storm,Spark

Big Data Processing Architectures

69 of 214

70 of 214

�Zeta architecture

This is the next generation Enterprise architecture cultivated by Jim Scott.

This is a pluggable architecture which consists of Distributed file system, Real-time data storage, Pluggable compute model/execution engine, Deployment/container management system, Solution architecture, Enterprise applications and Dynamic and global resource management.

Big Data Processing Architectures

71 of 214

�Zeta architecture diagram

Big Data Processing Architectures

72 of 214

The Traditional Research Approach

Source

. . .

Integration System

. . .

Metadata

Clients

Wrapper

Query-driven (lazy, on-demand)

73 of 214

Delay in query processing

Slow or unavailable information sources
Complex filtering and integration

Inefficient and potentially expensive for frequent queries
Competes with local processing at sources
Hasn’t caught on in industry

Big Data Processing Architectures

74 of 214

Data Warehouse

75 of 214

The Warehousing Approach

Data

Warehouse

Clients

Source

. . .

Extractor/

Monitor

Integration System

. . .

Metadata

Extractor/

Monitor

Extractor/

Monitor

Information integrated in advance
Stored in wh for direct querying and analysis

76 of 214

The Warehousing Approach

Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
A decision support database maintained separately from the organization’s operational database

77 of 214

The Warehousing Approach

Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

OLTP systems are usually designed independently of each other and it is difficult for them to share information.

78 of 214

Characteristics of Data Warehouse

Subject oriented. Data are organized based on how the users refer to them.
Integrated. All inconsistencies regarding naming convention and value representations are removed.
Nonvolatile. Data are stored in read-only format and do not change over time.
Time variant. Data are not current but normally time series.

79 of 214

Characteristics of Data Warehouse

Summarized Operational data are mapped into a decision-usable format
Large volume. Time series data sets are normally quite large.
Not normalized. DW data can be, and often are, redundant.
Metadata. Data about data are stored.
Data sources. Data come from internal and external unintegrated operational systems.

80 of 214

Need of Data Warehouses

Consolidation of information resources
Improved query performance
Separate research and decision support functions from the operational systems
Foundation for data mining, data visualization, advanced reporting and OLAP tools

81 of 214

Data Warehouse

A subject-oriented ,integrated, time-variant and non-volatile collection of data in support of managements decision making process is called as data warehouse.

Subject-oriented

DWH organized around major subjects of enterprise (e.g. customer ,product sales ) rather than application areas (customer invoicing ,stock control,product sale)

Integrated : Data coming from enterprise wide applications in different formats.

Time –Variant

DWH behave differently at different time interval

Non-Volatile

New data is always added in existing one rather than replacement

82 of 214

Merits Data Warehouse

Delivers enhanced Business Intelligence
Ensures Data quality and consistency

DWH supports data conversion into common & standard format

No discrepancy

3. Saves time and money

Saves user’s time data is at one place

DWH execution doesnot require IT support & higher no of channels

4. Tracks historically Intelligent data

updates about changing trends

5. Generates high revenue

83 of 214

Advantages of Warehousing Approach

High query performance

But not necessarily most current information

Doesn’t interfere with local processing at sources

Complex queries at warehouse
OLTP at information sources

Information copied at warehouse

Can modify, annotate, summarize, restructure, etc.
Can store historical information
Security, no auditing

Has caught on in industry

84 of 214

Which are our� lowest/highest margin �customers ?

Who are my customers �and what products �are they buying?

Which customers� are most likely to go �to the competition ?

What impact will �new products/services

have on revenue �and margins?

What product prom-�-otions have the biggest �impact on revenue?

What is the most �effective distribution �channel?

85 of 214

Knowledge discovery

Making consolidated reports
Finding relationships and correlations
Data mining
Examples

Banks identifying credit risks
Insurance companies searching for fraud
Medical research

86 of 214

Comparison Chart of Database Types

Data warehouse	Operational system
Subject oriented	Transaction oriented
Large (hundreds of GB up to several TB)	Small (MB up to several GB)
Historic data	Current data
De-normalized table structure (few tables, many columns per table)	Normalized table structure (many tables, few columns per table)
Batch updates	Continuous updates
Usually very complex queries	Simple to complex queries

87 of 214

Comparison Chart of Database Types

Data warehouse	Big Data
Extracts data from varieties of SQL based data sources(relational databases) & help for generating analytic reports.	Handle huge data coming from various heterogeneous recourses including social media.
Mainly handle structured data.	Can handle structured , unstructured , semi-structured data.
Helps to analytic on informed(specific) information.	Has a lot of data so analytic provides information by extracting useful information from data.
Don’t use Distributed file system.	Use Distributed file system.
Never erase previous data when new data is added.	Also Never erase previous data when new data is added but sometimes real-time data streams are processed.
Timing of fetching simultaneously is more.	Timing of fetching simultaneously is small using Hadoop File System.

88 of 214

Reengineering the data warehouse

89 of 214

Reengineering the Data Warehouse

Enterprise data warehouse platform

There are several layers of infrastructure that make the platform for the EDW:

1. The hardware platform:

● Database server:

– Processor

– Memory

– BUS architecture

● Storage server:

– Class of disk

– Controller

● Network

2. Operating system

3. Application software:

● Database

● Utilities

90 of 214

�Data distribution in a data warehouse

Operational data store

Reengineering the Data Warehouse

91 of 214

�Choices for reengineering the data warehouse

Replatforming

Replatform the data warehouse to a new platform including all hardware & infrastructure.

There are several new technology options and depending on the requirement of the organization, any of these technologies can be deployed.

The choices include data warehouse appliances, commodity platforms, tiered storage, private cloud, and in-memory technologies.

Reengineering the Data Warehouse

92 of 214

Reengineering the Data Warehouse

Benefits:

● move the data warehouse to a scalable and reliable platform.

● The underlying infrastructure and the associated application software layers can be architected

● to provide security, lower maintenance, and increase reliability.

● Optimize the application and database code.

● Provide some additional opportunities to use new functionality.

● Makes it possible to rearchitect things in a different/better way, which is almost impossible to do in an existing setup.

93 of 214

Reengineering the Data Warehouse

Disadvantages:

● Takes a long cycle time to complete, leading to disruption of business activities.

● Replatforming often means reverse engineering complex business processes and rules that may be undocumented or custom developed in the current platform.

● May not be feasible for certain aspects of data processing or there may be complex calculations that need to be rewritten if they cannot be directly supported by the functionality of the new platform.

● Replatforming is not economical in environments that have large legacy platforms, as it consumes too many business process cycles to reverse engineer logic and documenting the same.

94 of 214

Data warehouse platform.

95 of 214

Platform engineering

Modify parts of the infrastructure and get great gains in scalability and performance.

Used in the automotive industry where the focus was on improving quality, reducing costs, and delivering services and products to end users in a highly cost-efficient manner.

Applied to the data warehouse can translate to:

● Reduce the cost of the data warehouse.

● Increase efficiencies of processing.

● Simplify the complexities in the acquisition, processing, and delivery of data.

● Reduce redundancies.

96 of 214

Platform engineering

Platform reengineering can be done at multiple layers:
Storage level: storage layer of the data is engineered to process data at very high speeds for high or low volumes.
Server reengineering: hardware and its components can be replaced with more modern components that can be supported in the configuration
Network reengineering: In this approach the network layout and the infrastructure are reengineered.
Data warehouse appliances: In this approach the entire data warehouse or datamart can be ported to the data warehouse appliance .The data warehouse appliance is an integrated stack of hardware, software, and network components designed and engineered to handle data warehouse rigors.
Application server: In this approach the application server is customized to process reports and analytic layers across a clustered architecture.

97 of 214

Platform engineering

Data engineering

The data structures are reengineered to create better performance.

The data model developed as a part of the initial data warehouse is

often scrubbed and new additions are made to the data model.

Typical changes include:

Partitioning—a table can be vertically partitioned depending on the usage of columns, thus reducing the span of I/O operations. Another partition technique is horizontal partitioning where the table is partitioned by date or numeric ranges into smaller slices.

Colocation—a table and all its associated tables can be colocated in the same storage region.

98 of 214

Platform engineering

Distribution—a large table can be broken into a distributed set of smaller tables and used.

New data types—several new data types like geospatial and temporal data can be used in the data architecture and current workarounds for such data can be retired. This will provide a significant performance boost.

New database functions—several new databases provide native functions like scalar tables and indexed views, and can be utilized to create performance boosts.

99 of 214

Architectures

100 of 214

Shared-everything architecture

Is a system architecture where all resources are shared including storage, memory, and the processer.

Two variations of shared-everything architecture are:

Symmetric multiprocessing (SMP)

Distributed shared memory (DSM).

101 of 214

Symmetric multiprocessing (SMP)	Distributed shared memory (DSM).
Share a single pool of memory for read–write access concurrently and uniformly without latency.	addresses the scalability problem by providing multiple pools of memory for processors to use.
Referred to as uniform memory access (UMA) architecture.	Referred to as non uniform memory access (NUMA) architecture.
The drawback is when multiple processors are present & share a single system bus, which results in choking of the bandwidth for simultaneous memory access , therefore, the scalability of such system is very limited.	latency to access memory depends on the relative distances of the processors and their dedicated memory pools.

102 of 214

Both SMP and DSM architectures have been deployed for many transaction processing systems , where the transactional data is small in size and has a short burst cycle of resource requirements.

Data warehouses have been deployed on the shared-everything architecture for many years.

Due to the intrinsic architecture limitations, the direct impact has been on cost and performance.

Analytical applications and Big Data cannot be processed on a shared-everything architecture.

Shared-everything architecture

103 of 214

Fig: Shared-everything architecture

104 of 214

Shared-everything architecture�

Is a distributed computing architecture where multiple systems (called nodes) are networked to form a scalable system.

Each node has its own private memory, disks, and storage devices independent of any other node in the configuration.

None of the nodes share memory or disk storage.

Each processor has its own local memory & local disk.

Intercommunication channel is used by the processors to communicate.

Processors can independently act as a server to serve the data of local disk.

105 of 214

Fig: Shared-nothing architecture

106 of 214

Shared-everything architecture��

The flexibility of the architecture is its scalability.

This is the underlying architecture for data warehouse appliances and large data processing.

The extensibility and infinite scalability of this architecture makes it the platform architecture for Internet & web applications.

The key feature is that the operating system not the application server owns responsibility for controlling and sharing hardware resources.

A system can assign dedicated applications or partition its data among the different nodes to handle a particular task.

107 of 214

Shared-everything architecture�Advantages of Shared-nothing architecture

Scalable.
When node gets added transmission capacity increases.
Failure is local.(failure of one node cannot affect to other node)

Cost of communication is higher than shared memory architecture.
Data sending involves the software interaction.
More coordination is required.

Disadvantages of Shared-nothing architecture

108 of 214

108

109 of 214

110 of 214

What is big data?

Big Data is any thing which is crash Excel.

Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM.

110

Or, in other words, Big Data is data in volumes too great to process by traditional methods.

https://twitter.com/devops_borat

111 of 214

Data accumulation

111

Today, data is accumulating at tremendous rates

click streams from web visitors
supermarket transactions
sensor readings
video camera footage
GPS trails
social media interactions

– ...

It really is becoming a challenge to store and process it all in a meaningful way

112 of 214

From WWW to VVV

112

Volume

data volumes are becoming unmanageable

Variety

data complexity is growing
more types of data captured than previously

Velocity

some data is arriving so rapidly that it must either be processed instantly, or lost
this is a whole subfield called “stream processing”

113 of 214

The promise of Big Data

Data contains information of great business value
If you can extract those insights you can make far better decisions
...but is data really that valuable?

114 of 214

114

115 of 214

115

116 of 214

“quadrupling the average cow's milk production since your parents were born”

116

"When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)."

117 of 214

Some more examples

Sports

basketball increasingly driven by data analytics
soccer beginning to follow

Entertainment

House of Cards designed based on data analysis
increasing use of similar tools in Hollywood

“Visa Says Big Data Identifies Billions of Dollars in Fraud”

new Big Data analytics platform on Hadoop

“Facebook is about to launch Big Data play”

starting to connect Facebook with real life

https://delicious.com/larsbot/big-data

118 of 214

Ok, ok, but ... does it apply to our�customers?

118

Norwegian Food Safety Authority

accumulates data on all farm animals
birth, death, movements, medication, samples, ...

Hafslund

time series from hydroelectric dams, power prices, meters of individual customers, ...

Social Security Administration

data on individual cases, actions taken, outcomes...

Statoil

massive amounts of data from oil exploration, operations, logistics, engineering, ...

Retailers

see Target example above
also, connection between what people buy, weather forecast, logistics, ...

119 of 214

How to extract insight from data?

Monthly Retail Sales in New South Wales (NSW) Retail Department Stores

119

120 of 214

Types of algorithms

Clustering
Association learning
Parameter estimation
Recommendation engines
Classification
Similarity matching
Neural networks
Bayesian networks
Genetic algorithms

120

121 of 214

Basically, it’s all maths...

Linear algebra
Calculus
Probability theory
Graph theory
...

https://twitter.com/devops_borat

Only 10% in devops are know how of work with Big Data.

Only 1% are realize they are need 2 Big Data for fault tolerance

122 of 214

Big data skills gap

Hardly anyone knows this stuff
It’s a big field, with lots and lots of theory
And it’s all maths, so it’s tricky to learn

http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap

http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap

123 of 214

Two orthogonal aspects

Analytics / machine learning

learning insights from data

Big data

handling massive data volumes

Can be combined, or used separately

124 of 214

Data science?

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

125 of 214

How to process Big Data?

If relational databases are not enough, what is?

https://twitter.com/devops_borat

Mining of Big Data is problem solve in 2013 with zgrep

126 of 214

MapReduce

A framework for writing massively parallel code
Simple, straightforward model
Based on “map” and “reduce” functions from functional programming (LISP)

126

127 of 214

NoSQL and Big Data

127

Not really that relevant
Traditional databases handle big data sets, too
NoSQL databases have poor analytics
MapReduce often works from text files

can obviously work from SQL and NoSQL, too

NoSQL is more for high throughput

basically, AP from the CAP theorem, instead of CP

In practice, really Big Data is likely to be a mix

text files, NoSQL, and SQL

128 of 214

The 4^thV: Veracity

https://twitter.com/devops_borat

“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”

Daniel Borstin, in The Discoverers (1983)

95% of time, when is clean Big Data is get Little Data

129 of 214

Data quality

A huge problem in practice

any manually entered data is suspect
most data sets are in practice deeply problematic

Even automatically gathered data can be a problem

systematic problems with sensors
errors causing data loss
incorrect metadata about the sensor

Never, never, never trust the data without checking it!

garbage in, garbage out, etc

130 of 214

Approaches to learning

130

Supervised

we have training data with correct answers
use training data to prepare the algorithm
then apply it to data without a correct answer

Unsupervised

no training data
throw data into the algorithm, hope it makes some kind of sense out of the data

131 of 214

Approaches to learning

131

Prediction

predicting a variable from data

Classification

assigning records to predefined groups

Clustering

splitting records into groups based on similarity

Association learning

seeing what often appears together with what

132 of 214

Issues

132

Data is usually noisy in some way

imprecise input values
hidden/latent input values

Inductive bias

basically, the shape of the algorithm we choose
may not fit the data at all
may induce underfitting or overfitting

Machine learning without inductive bias is not possible

133 of 214

Underfitting

Using an algorithm that cannot capture the full complexity of the data

133

134 of 214

Overfitting

Tuning the algorithm so carefully it starts matching the noise in the training data

134

135 of 214

“What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of machine learning. When your learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on test data, when in fact it could have output one that is 75% accurate on both,

it has overfit.”

http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

136 of 214

Testing

136

When doing this for real, testing is crucial
Testing means splitting your data set

training data (used as input to algorithm)
test data (used for evaluation only)

Need to compute some measure of performance

precision/recall
root mean square error

A huge field of theory here

will not go into it in this course
very important in practice

137 of 214

Missing values

137

Usually, there are missing values in the data set

that is, some records have some NULL values

These cause problems for many machine learning algorithms
Need to solve somehow

remove all records with NULLs
use a default value
estimate a replacement value

– ...

138 of 214

Terminology

138

Vector

one-dimensional array

Matrix

two-dimensional array

Linear algebra

algebra with vectors and matrices
addition, multiplication, transposition, ...

139 of 214

Top 10 algorithms

139

140 of 214

Top 10 machine learning algs

1. C4.5

2. k-means clustering

3. Support vector machines

4. the Apriori algorithm

5. the EM algorithm

6. PageRank

7. AdaBoost

k-nearest neighbours class.
Naïve Bayes
CART

From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006. “Top 10 algorithms in data mining”, by X. Wu et al

141 of 214

C4.5

141

Algorithm for building decision trees

basically trees of boolean expressions
each node split the data set in two
leaves assign items to classes

Decision trees are useful not just for classification

they can also teach you something about the classes

C4.5 is a bit involved to learn

the ID3 algorithm is much simpler

CART (#10) is another algorithm for learning decision trees

142 of 214

Support Vector Machines

142

A way to do binary classification on matrices
Support vectors are the data points nearest to the hyperplane that divides the classes
SVMs maximize the distance between SVs and the boundary
Particularly valuable because of “the kernel trick”

– using a transformation to a higher dimension to

handle more complex class boundaries

A bit of work to learn, but manageable

143 of 214

Apriori

An algorithm for “frequent itemsets”

basically, working out which items frequently

appear together

for example, what goods are often bought together in the supermarket?
used for Amazon’s “customers who bought this...”

Can also be used to find association rules

that is, “people who buy X often buy Y” or similar

Apriori is slow

a faster, further development is FP-growth

http://www.dssresources.com/newsletters/66.php

144 of 214

Expectation Maximization

144

A deeply interesting algorithm I’ve seen used in a number of contexts

very hard to understand what it does
very heavy on the maths

Essentially an iterative algorithm

skips between “expectation” step and

“maximization” step

tries to optimize the output of a function

Can be used for

clustering
a number of more specialized examples, too

145 of 214

PageRank

145

Basically a graph analysis algorithm

identifies the most prominent nodes
used for weighting search results on Google

Can be applied to any graph

for example an RDF data set

Basically works by simulating random walk

estimating the likelihood that a walker would be

on a given node at a given time

actual implementation is linear algebra

The basic algorithm has some issues

“spider traps”
graph must be connected
straightforward solutions to these exist

146 of 214

AdaBoost

146

Algorithm for “ensemble learning”
That is, for combining several algorithms

and training them on the same data

Combining more algorithms can be very effective

usually better than a single algorithm

AdaBoost basically weights training samples

giving the most weight to those which are classified the worst

147 of 214

Naïve Bayes

147

148 of 214

Bayes’s Theorem

148

Basically a theorem for combining probabilities

I’ve observed A, which indicates H is true with probability 70%
I’ve also observed B, which indicates H is true with probability 85%
what should I conclude?

Naïve Bayes is basically using this theorem

with the assumption that A and B are indepedent
this assumption is nearly always false, hence

“naïve”

149 of 214

Simple example

http://www.bbc.co.uk/news/magazine-22310186

Is the coin fair or not?

we throw it 10 times, get 9 heads and one tail

we try again, get 8 heads and two tails

What do we know now?

can combine data and recompute
or just use Bayes’s Theorem directly

>>> compute_bayes([0.92, 0.84])

0.9837067209775967

150 of 214

Ways I’ve used Bayes

Duke

record deduplication engine
estimate probability of duplicate for each property
combine probabilities with Bayes

Whazzup

news aggregator that finds relevant news
works essentially like spam classifier on next slide

Tine recommendation prototype

recommends recipes based on previous choices
also like spam classifier

Classifying expenses

using export from my bank
also like spam classifier

151 of 214

Bayes against spam

Take a set of emails, divide it into spam and non-spam (ham)

count the number of times a feature appears in each of the two sets
a feature can be a word or anything you please

To classify an email, for each feature in it

consider the probability of email being spam given that feature to be (spam count) / (spam count + ham count)
ie: if “viagra” appears 99 times in spam and 1 in ham, the probability is 0.99

Then combine the probabilities with Bayes

http://www.paulgraham.com/spam.html

152 of 214

Running the script

I pass it

1000 emails from my Bouvet folder
1000 emails from my Spam folder

Then I feed it

1 email from another Bouvet folder
1 email from another Spam folder

153 of 214

Code

# scan spam

for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(spam):

corpus.spam(token)

# scan ham

for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(ham):

corpus.ham(token)

# compute probability for email in sys.argv[3 : ]:

print email

p = classify(email) if p < 0.2:

print ' Spam', p else:

print ' Ham', p

https://github.com/larsga/py-snippets/tree/master/machine-learning/spam

154 of 214

Classify

154

class Feature:

def init (self, token): self._token = token self._spam = 0

self._ham = 0

def spam(self): self._spam += 1

def ham(self): self._ham += 1

def spam_probability(self):

return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))

def compute_bayes(probs):

product = reduce(operator.mul, probs)

lastpart = reduce(operator.mul, map(lambda x: 1-x, probs)) if product + lastpart == 0:

return 0 # happens rarely, but happens

else:

return product / (product + lastpart)

def classify(email):

return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])

155 of 214

Ham output

155

Ham 1.0
Received:2013	0.00342935528121
Date:2013	0.00624219725343
<br	0.0291715285881
background-color:	0.03125
background-color:	0.03125
background-color:	0.03125
background-color:	0.03125
background-color:	0.03125
Received:Mar	0.0332667997339
Date:Mar	0.0362756952842
...
Postboks	0.998107494322
Postboks	0.998107494322
Postboks	0.998107494322
+47	0.99787414966
+47	0.99787414966
+47	0.99787414966
+47	0.99787414966
Lars	0.996863237139
Lars	0.996863237139
23	0.995381062356

So, clearly most of the spam is from March 2013...

156 of 214

Spam output

156

Spam 2.92798502037e-16
Received:-0400	0.0115646258503
Received:-0400	0.0115646258503
Received-SPF:(ontopia.virtual.vps-host.net:	0.0135823429542
Received-SPF:receiver=ontopia.virtual.vps-host.net;	0.0135823429542
Received:<larsga@ontopia.net>;	0.0139318885449
Received:<larsga@ontopia.net>;	0.0139318885449
Received:ontopia.virtual.vps-host.net	0.0170863309353
Received:(8.13.1/8.13.1)	0.0170863309353
Received:ontopia.virtual.vps-host.net	0.0170863309353
Received:(8.13.1/8.13.1)	0.0170863309353
...
Received:2012	0.986111111111
Received:2012	0.986111111111
$	0.983193277311
Received:Oct	0.968152866242
Received:Oct	0.968152866242
Date:2012	0.959459459459
20	0.938864628821
+	0.936526946108
+	0.936526946108
+	0.936526946108

...and the ham from October 2012

157 of 214

More solid testing

Using the SpamAssassin public corpus
Training with 500 emails from

spam
easy_ham (2002)

Test results

spam_2: 1128 spam, 269 misclassified as ham
easy_ham 2003: 2283 ham, 217 spam

Results are pretty good for 30 minutes of effort...

http://spamassassin.apache.org/publiccorpus/

158 of 214

Linear regression

159 of 214

Linear regression

Let’s say we have a number of numerical parameters for an object
We want to use these to predict some other value
Examples

estimating real estate prices
predicting the rating of a beer

– ...

160 of 214

Estimating real estate prices

Take parameters

x₁square meters
x₂number of rooms
x₃number of floors
x₄energy cost per year
x₅meters to nearest subway station
x₆years since built
x₇years since last refurbished

– ...

a x₁+ b x₂+ c x₃+ _...= price

strip out the x-es and you have a vector
collect N samples of real flats with prices = matrix
welcome to the world of linear algebra

161 of 214

Our data set: beer ratings

Ratebeer.com

a web site for rating beer
scale of 0.5 to 5.0

For each beer we know

alcohol %
country of origin
brewery
beer style (IPA, pilsener, stout, ...)

But ... only one attribute is numeric!

how to solve?

162 of 214

Example

ABV	.se	.nl	.us	.uk	IIPA	Black IPA	Pale ale	Bitter	Rating
8.5	1.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	3.5
8.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	0.0	3.7
6.2	0.0	0.0	1.0	0.0	0.0	0.0	1.0	0.0	3.2
4.4	0.0	0.0	0.0	1.0	0.0	0.0	0.0	1.0	3.2
...	...	...	...	...	...	...	...	...	...

Basically, we turn each category into a column of 0.0 or 1.0 values.

163 of 214

Normalization

If some columns have much bigger values than the others they will automatically dominate predictions
We solve this by normalization
Basically, all values get resized into the 0.0-1.0 range
For ABV we set a ceiling of 15%

– compute with min(15.0, abv) / 15.0

164 of 214

Adding more data

To get a bit more data, I added manually a description of each beer style
Each beer style got a 0.0-1.0 rating on

colour (pale/dark)
sweetness
hoppiness
sourness

These ratings are kind of coarse because all beers of the same style get the same value

165 of 214

Making predictions

We’re looking for a formula

a * abv + b * .se + c * .nl + d * .us + ... = rating

We have n examples

– a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5

We have one unknown per column

as long as we have more rows than columns we can

solve the equation

Interestingly, matrix operations can be used to solve this easily

166 of 214

Matrix formulation

Let’s say

x is our data matrix
y is a vector with the ratings and
w is a vector with the a, b, c, ... values

That is: x * w = y

this is the same as the original equation
a x₁+ b x₂+ c x₃+ _...= rating

If we solve this, we get

167 of 214

Enter Numpy

Numpy is a Python library for matrix operations
It has built-in types for vectors and matrices
Means you can very easily work with matrices in Python
Why matrices?

much easier to express what we want to do
library written in C and very fast
takes care of rounding errors, etc

168 of 214

Quick Numpy example

[[0, 1, 2, 3, 4, 5, 6, 7, 8,	9], [0, 1, 2, 3, 4, 5, 6, 7, 8,	9], [0, 1, 2, 3, 4, 5,
6, 7, 8, 9], [0, 1, 2, 3, 4,	5, 6, 7, 8, 9], [0, 1, 2, 3, 4,	5, 6, 7, 8, 9], [0, 1,

>>> from numpy import *

>>> range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> [range(10)] * 10

2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8,

9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

>>> m = mat([range(10)] * 10)

>>> m

matrix([[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9],
[0,	1,	2,	3,	4,	5,	6,	7,	8,	9]])
>>> m.T
matrix([[0,	0,	0,	0,	0,	0,	0,	0,	0,	0],
[1,	1,	1,	1,	1,	1,	1,	1,	1,	1],
[2,	2,	2,	2,	2,	2,	2,	2,	2,	2],
[3,	3,	3,	3,	3,	3,	3,	3,	3,	3],
[4,	4,	4,	4,	4,	4,	4,	4,	4,	4],
[5,	5,	5,	5,	5,	5,	5,	5,	5,	5],
[6,	6,	6,	6,	6,	6,	6,	6,	6,	6],
[7,	7,	7,	7,	7,	7,	7,	7,	7,	7],
[8,	8,	8,	8,	8,	8,	8,	8,	8,	8],
[9,	9,	9,	9,	9,	9,	9,	9,	9,	9]])

169 of 214

Numpy solution

We load the data into

a list: scores
a list of lists: parameters

Then:

x_mat = mat(parameters) y_mat = mat(scores).T

x_tx = x_mat.T * x_mat

assert linalg.det(x_tx)

ws = x_tx.I * (x_mat.T * y_mat)

170 of 214

Does it work?

We only have very rough information about each beer (abv, country, style)

so very detailed prediction isn’t possible
but we should get some indication

Here are the results based on my ratings

10% imperial stout from US 3.9
4.5% pale lager from Ukraine 2.8
5.2% German schwarzbier 3.1
7.0% German doppelbock 3.5

http://www.ratebeer.com/user/15206/ratings/

171 of 214

Beyond prediction

171

We can use this for more than just prediction
We can also use it to see which columns contribute the most to the rating

– that is, which aspects of a beer best predict the rating

If we look at the w vector we see the following

– Aspect	LMG	grove
– ABV	0.56	1.1
– colour	0.46	0.42
– sweetness	0.25	0.51
– hoppiness	0.45	0.41
– sourness	0.29	0.87

Could also use correlation

172 of 214

Did we underfit?

172

Who says the relationship between ABV and the rating is linear?

perhaps very low and very high ABV are both negative?
we cannot capture that with linear regression

Solution

add computed columns for parameters raised to higher powers
abv², abv³, abv⁴, ...
beware of overfitting...

173 of 214

Scatter plot

Freeze-distilled Brewdog beers

Rating

ABV in %

Code in Github, requires matplotlib

174 of 214

Trying again

174

175 of 214

Matrix factorization

175

Another way to do recommendations is matrix factorization

basically, make a user/item matrix with ratings
try to find two smaller matrices that, when multiplied together, give you the original matrix
that is, original with missing values filled in

Why that works?

I don’t know
I tried it, couldn’t get it to work
therefore we’re not covering it
known to be a very good method, however

176 of 214

Clustering

176

177 of 214

Clustering

177

Basically, take a set of objects and sort them into groups

– objects that are similar go into the same group

The groups are not defined beforehand
Sometimes the number of groups to create is input to the algorithm
Many, many different algorithms for this

178 of 214

Sample data

178

Our sample data set is data about aircraft from DBpedia
For each aircraft model we have

name
length (m)
height (m)
wingspan (m)
number of crew members
operational ceiling, or max height (m)
max speed (km/h)
empty weight (kg)

We use a subset of the data

149 aircraft models which all have values for all of these properties

Also, all values normalized to the 0.0-1.0 range

179 of 214

Distance

179

All clustering algorithms require a distance function

that is, a measure of similarity between two objects

Any kind of distance function can be used

generally, lower values mean more similar

Examples of distance functions

metric distance
vector cosine
RMSE

– ...

180 of 214

k-means clustering

180

Input: the number of clusters to create (k)
Pick k objects

these are your initial clusters

For all objects, find nearest cluster

assign the object to that cluster

For each cluster, compute mean of all properties

use these mean values to compute distance to clusters
the mean is often referred to as a “centroid”
go back to previous step

Continue until no objects change cluster

181 of 214

First attempt at aircraft

181

We leave out name and number built when doing comparison
We use RMSE as the distance measure
We set k = 5
What happens?

first iteration: all 149 assigned to a cluster
second: 11 models change cluster
third: 7 change
fourth: 5 change
fifth: 5 change
sixth: 2
seventh: 1
eighth: 0

182 of 214

3 jet bombers, one propeller bomber. Not too bad.

Cluster 5

cluster5, 4 models

ceiling : 13400.0

maxspeed : 1149.7

crew : 7.5

length : 47.275

height : 11.65

emptyweight : 69357.5

wingspan : 47.18

The Myasishchev M-50 was a Soviet prototype four-engine supersonic bomber which never attained service

The Tupolev Tu-16 was a twin-engine jet bomber used by the Soviet Union.

The Myasishchev M-4 Molot is a four-engined strategic bomber

The Convair B-36 "Peacemaker” was a strategic bomber built by Convair and operated solely by the United States Air Force (USAF) from 1949 to 1959

182

183 of 214

Small, slow propeller aircraft. Not too bad.

Cluster 4

102

cluster4, 56 models

ceiling : 5898.2

maxspeed : 259.8

crew : 2.2

length : 10.0

height : 3.3

emptyweight : 2202.5

wingspan : 13.8

The Avia B.135 was a Czechoslovak cantilever monoplane fighter aircraft

The North American B-25 Mitchell was an American twin-engined medium bomber

The Yakovlev UT-1 was a single-seater trainer aircraft

The Yakovlev UT-2 was a single-seater trainer aircraft

The Siebel Fh 104 Hallore was a small German twin-engined transport, communications and liaison aircraft

The Messerschmitt Bf 108 Taifun was a German single-engine sports and touring

aircraft

The Airco DH.2 was a single-seat biplane "pusher" aircraft

184 of 214

Small, very fast jet planes. Pretty good.

Cluster 3

103

cluster3, 12 models

ceiling : 16921.1

maxspeed : 2456.9

crew : 2.67

length : 17.2

height : 4.92

emptyweight : 9941

wingspan : 10.1

The Mikoyan MiG-29 is a fourth- generation jet fighter aircraft

The Vought F-8 Crusader was a single-engine, supersonic [fighter] aircraft

The English Electric Lightning is a supersonic jet fighter aircraft of the Cold War era, noted for its great speed.

The Dassault Mirage 5 is a supersonic attack aircraft

The Northrop T-38 Talon is a two- seat, twin-engine supersonic jet trainer

The Mikoyan MiG-35 is a further development of the MiG-29

185 of 214

Biggish, kind of slow planes. Some oddballs in this group.

Cluster 2

cluster2, 27 models

ceiling : 6447.5

maxspeed : 435

crew : 5.4

length : 24.4

height : 6.7

emptyweight : 16894

wingspan : 32.8

The Bartini Beriev VVA-14 (vertical

take-off amphibious aircraft)

The Aviation Traders ATL-98 Carvair was a large piston-engine transport aircraft.

The Fokker 50 is a turboprop- powered airliner

The PB2Y Coronado was a large flying boat patrol bomber

The Junkers Ju 89 was a heavy bomber

The Beriev Be-200 Altair is a

multipurpose amphibious aircraft

₁₀₄The Junkers Ju 290 was a long-range transport, maritime patrol aircraft and heavy bomber

186 of 214

Small, fast planes. Mostly good, though the Canberra is a poor fit.

Cluster 1

cluster1, 50 models

ceiling : 11612

maxspeed : 726.4

crew : 1.6

length : 11.9

height : 3.8

emptyweight : 5303

wingspan : 13

The Adam A700 AdamJet was a proposed six-seat civil utility aircraft

The Learjet 23 is a ... twin-engine, high-speed business jet

The Learjet 24 is a ... twin-engine,

¹⁰⁵high-speed business jet

The Curtiss P-36 Hawk was an American- designed and built fighter aircraft

The Kawasaki Ki-61 Hien was a Japanese World War II fighter aircraft

The Grumman F3F was the last American biplane fighter aircraft

The English Electric Canberra is a first-generation jet-powered light bomber

The Heinkel He 100 was a German pre- World War II fighter aircraft

187 of 214

Clusters, summarizing

Cluster 1: small, fast aircraft (750 km/h)
Cluster 2: big, slow aircraft (450 km/h)
Cluster 3: small, very fast jets (2500 km/h)
Cluster 4: small, very slow planes (250 km/h)
Cluster 5: big, fast jet planes (1150 km/h)

106

For a first attempt to sort through the data, this is not bad at all

https://github.com/larsga/py-snippets/tree/master/machine-learning/aircraft

188 of 214

Agglomerative clustering

Put all objects in a pile
Make a cluster of the two objects closest to one another

– from here on, treat clusters like objects

Repeat second step until satisfied

107

There is code for this, too, in the Github sample

189 of 214

Principal component analysis

189

190 of 214

PCA

190

Basically, using eigenvalue analysis to find out which variables contain the most information

the maths are pretty involved
and I’ve forgotten how it works
and I’ve thrown out my linear algebra book
and ordering a new one from Amazon takes too long
...so we’re going to do this intuitively

191 of 214

An example data set

Two variables
Three classes
What’s the longest line we could draw through the data?
That line is a vector in two dimensions
What dimension dominates?

that’s right: the horizontal
this implies the horizontal contains most of the information in the data set

PCA identifies the most significant

variables

110

192 of 214

Dimensionality reduction

After PCA we know which dimensions matter

based on that information we can decide to throw out less important dimensions

Result

smaller data set
faster computations
easier to understand

192

193 of 214

Trying out PCA

193

Let’s try it on the Ratebeer data
We know ABV has the most information

because it’s the only value specified for each individual beer

We also include a new column: alcohol

this is the amount of alcohol in a pint glass of the

beer, measured in centiliters

this column basically contains no information at all; it’s computed from the abv column

194 of 214

Complete code

194

import rblib

from numpy import *

def eigenvalues(data, columns):

covariance = cov(data - mean(data, axis = 0), rowvar = 0) eigvals = linalg.eig(mat(covariance))[0]

indices = list(argsort(eigvals))

indices.reverse() # so we get most significant first

return [(columns[ix], float(eigvals[ix])) for ix in indices]

(scores, parameters, columns) = rblib.load_as_matrix('ratings.txt')

for (col, ev) in eigenvalues(parameters, columns):

print "%40s %s" % (col, float(ev))

195 of 214

Output

195

abv colour sweet hoppy sour alcohol

United States

Eisbock Belarus Vietnam

0.184770392185

0.13154093951

0.121781685354

0.102241100597

0.0961537687655

0.0893502031589

0.0677552513387

....

-3.73028421245e-18

-1.68514561515e-17

196 of 214

MapReduce

196

197 of 214

University pre-lecture, 1991

197

My first meeting with university was Open University Day, in 1991
Professor Bjørn Kirkerud gave the computer science talk
His subject

some day processors will stop becoming faster
we’re already building machines with many processors
what we need is a way to parallelize software
preferably automatically, by feeding in normal source code and getting it parallelized back

MapReduce is basically the state of the art on that today

198 of 214

MapReduce

A framework for writing massively parallel code
Simple, straightforward model
Based on “map” and “reduce” functions from functional programming (LISP)

198

199 of 214

http://research.google.com/archive/mapreduce.html

Appeared in:

OSDI'04: Sixth Symposium on Operating System Design and

Implementation,

San Francisco, CA, December, 2004.

199

200 of 214

map and reduce

200

>>> "1 2 3 4 5 6 7 8".split()

['1', '2', '3', '4', '5', '6', '7', '8']

>>> l = map(int, "1 2 3 4 5 6 7 8".split())

>>> l

[1, 2, 3, 4, 5, 6, 7, 8]

>>> import operator

>>> reduce(operator.add, l) 36

201 of 214

MapReduce

120

Split data into fragments
Create a Map task for each fragment

the task outputs a set of (key, value) pairs

Group the pairs by key
Call Reduce once for each key

all pairs with same key passed in together
reduce outputs new (key, value) pairs

Tasks get spread out over worker nodes

Master node keeps track of completed/failed tasks Failed tasks are restarted

Failed nodes are detected and avoided

Also scheduling tricks to deal with slow nodes

202 of 214

Communications

202

HDFS

Hadoop Distributed File System
input data, temporary results, and results are stored as files here
Hadoop takes care of making files available to nodes

Hadoop RPC

how Hadoop communicates between nodes
used for scheduling tasks, heartbeat etc

Most of this is in practice hidden from the developer

203 of 214

Does anyone need MapReduce?

203

I tried to do book recommendations with linear algebra
Basically, doing matrix multiplication to produce the full user/item matrix with blanks filled in
My Mac wound up freezing
185,973 books x 77,805 users = 14,469,629,265

– assuming 2 bytes per float = 28 GB of RAM

So it doesn’t necessarily take that much to have some use for MapReduce

204 of 214

The word count example

204

Classic example of using MapReduce
Takes an input directory of text files
Processes them to produce word frequency counts
To start up, copy data into HDFS

bin/hadoop dfs -mkdir <hdfs-dir>
bin/hadoop dfs -copyFromLocal <local-dir> <hdfs- dir>

205 of 214

WordCount – the mapper

public static class Map extends Mapper<LongWritable, ext, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context)
{
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());
context.write(word, one);
}
}

205

}

206 of 214

WordCount – the reducer

206

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) {

int sum = 0;

for (IntWritable val : values) sum += val.get();

context.write(key, new IntWritable(sum));

}

207 of 214

The Hadoop ecosystem

207

dataflow language for setting up MR jobs

HBase

NoSQL database to store MR input in

Hive

SQL-like query language on top of Hadoop

Mahout

machine learning library on top of Hadoop

Hadoop Streaming

utility for writing mappers and reducers as

command-line tools in other languages

208 of 214

Word count in HiveQL

208

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE

input;

-- temporary table to hold words... CREATE TABLE words (word STRING);

add file splitter.py;

INSERT OVERWRITE TABLE words SELECT TRANSFORM(text)

USING 'python splitter.py' AS word

FROM input;

SELECT word, COUNT(*)

FROM input

LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

209 of 214

Word count in Pig

209

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag

-- datatype, then flatten the bag to get one word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

210 of 214

Applications of MapReduce

210

Linear algebra operations

easily mapreducible

SQL queries over heterogeneous data

basically requires only a mapping to tables
relational algebra easy to do in MapReduce

PageRank

basically one big set of matrix multiplications
the original application of MapReduce

Recommendation engines

the SON algorithm

211 of 214

Apache Mahout

Has three main application areas

others are welcome, but this is mainly what’s there now

Recommendation engines

several different similarity measures
collaborative filtering
Slope-one algorithm

Clustering

k-means and fuzzy k-means
Latent Dirichlet Allocation

Classification

stochastic gradient descent
Support Vector Machines
Naïve Bayes

130

212 of 214

SQL to relational algebra

select lives.person_name, city from works, lives

where company_name = ’FBC’ and works.person_name = lives.person_name

212

213 of 214

Translation to MapReduce

213

σ(company_name=‘FBC’, works)

map: for each record r in works, verify the condition,

and pass (r, r) if it matches

reduce: receive (r, r) and pass it on unchanged

π(person_name, σ(...))

map: for each record r in input, produce a new record r’ with only wanted columns, pass (r’, r’)
reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)

⋈(π(...), lives)

map:

for each record r in π(...), output (person_name, r)
for each record r in lives, output (person_name, r)

reduce: receive (key, [record, record, ...]), and perform

the actual join

214 of 214

Lots of SQL-on-MapReduce tools

214

Tenzing
Hive
YSmart
SQL-MR

Google

Apache Hadoop Ohio State AsterData

HadoopDB Hadapt

Microsoft RainStor Inc. ParAccel Inc. Cloudera

Polybase
RainStor
ParAccel
Impala
...