1 of 214

Miss Sadhana S Kekan

Information Technology

Modern College of Engineering, Shivajinagar, Pune

Email:sadhanakekan@moderncoe.edu.in

DATA SCIENCE AND BIG DATA Analytics

2 of 214

Books :

  1. Krish Krishnan, Data warehousing in the age of Big Data, Elsevier, ISBN: 9780124058910, 1st Edition

Unit Objectives

  1. To introduce basic need of Big Data and Data science to handle huge amount of data.
  2. To understand the application and impact of Big Data.

INTRODUCTION: DATA SCIENCE AND BIG DATA

Unit outcomes:

  1. To understand Big Data primitives.
  2. To learn different programming platforms for big data analytics.

Outcome Mapping: PEO: I,V , PEO c, e CO: 1, 2, PSO: 3,4

3 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

  • Data science and Big Data
  • Defining Data science and Big Data
  • Big Data examples
  • Data explosion
  • Data volume
  • Data Velocity
  • Big data infrastructure and challenges
  • Big Data Processing Architectures
  • Data Warehouse
  • Re-Engineering the Data Warehouse
  • Shared everything and shared nothing Architecture
  • Big data learning approaches.

4 of 214

  • Is a process which examined that:

- from where the information can be taken

- what it signifies

- how it can be converted into a useful resource in the creation of business & IT strategies.

  • Goal : extract value from data in all its forms.

  • Manipulates the mathematics, statistics, & computer science regulations.

  • Includes methods like machine learning, cluster analysis, data mining & visualization.

INTRODUCTION: DATA SCIENCE AND BIG DATA

5 of 214

  • With the help of mining huge quantity of structured & unstructured data, organizations can:

- reduce costs

-raise efficiencies

- identifies new market opportunities.

- enhances organization’s competitive benefit.

INTRODUCTION: DATA SCIENCE AND BIG DATA

6 of 214

�Fig: Data Science

Data Science

Hacker Mindset

Statistics

Advanced Computing

Visualization

Math

Domain Expertise

Data Engineering

Scientific Method

INTRODUCTION: DATA SCIENCE AND BIG DATA

7 of 214

Data Scientists

  • Convert the organization’s raw data into the useful information.

  • Managing & understanding large amounts of data.

  • Create data visualization models that facilitates demonstrating the business value of digital information.

  • Can illustrates digital information easily with the help of smart phones, Internet of Things devices , Social media.

8 of 214

Fig: The Data Science Pipeline

9 of 214

Data and its structure

  • Data comes in many forms, but at a high level, it falls into three categories: structured, semi-structured, and unstructured.
  • Structured data :

- highly organized data

- exists within a repository such as a database (or a comma- separated values [CSV] file).

- easily accessible.

- format of the data makes it appropriate for queries and computation (by using languages such as Structured Query Language (SQL)).

  • Unstructured data : lacks any content structure at all (for example, an audio stream or natural language text).
  • Semi-structure data: Include metadata or data that can be more easily processed than unstructured data by using semantic tagging.

10 of 214

Data and its structure

Figure : Models of data

11 of 214

Data engineering

Data wrangling:

  • Process of manipulating raw data to make it useful for data analytics or to train a machine learning model.

  • Include:

- sourcing the data from one or more data sets (in addition to reducing the set to the required data),

- normalizing the data so that data merged from multiple data sets is consistent.

- parsing data into some structure or storage for further use.

  • process by which you identify, collect, merge, and preprocess one or more data sets in preparation for data cleansing.

12 of 214

Data cleansing�

  • After you have collected and merged your data set, the next step is cleansing. 

  • Data sets in the wild are typically messy and infected with any number of common issues.

  • Common issues, including missing values (or too many values), bad or incorrect delimiters (which segregate the data), inconsistent records, or insufficient parameters.

  • When data set is syntactically correct, the next step is to ensure that it is semantically correct.

13 of 214

Data preparation/preprocessing�

  • final step in data engineering.

  • This step assumes that you have a cleansed data set that might not be ready for processing by a machine learning algorithm.

  • Using normalization, you transform an input feature to distribute the data evenly into an acceptable range for the machine learning algorithm.

14 of 214

Machine learning���

  • Create and validate a machine learning model.

  • Sometimes, the machine learning model is the product, which is deployed in the context of an application to provide some capability (such as classification or prediction).

  • In other cases, the product isn’t the trained machine learning algorithm but rather the data that it produces.

15 of 214

Model learning

  • In one model, the algorithm process the data, & create new data product as the result.

  • But, in a production sense, the machine learning model is the product itself, deployed to provide insight or add value (such as the deployment of a neural network to provide prediction capabilities for an insurance market).

16 of 214

Machine learning approaches:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning
  • Supervised learning:

- algorithm is trained to produce the correct class and alter the model when it fails to do so.

- The model is trained until it reaches some level of accuracy.

  1. Unsupervised learning:

- has no class; instead, it inspects the data and groups it based on some structure that is hidden within the data.

- these types of algorithms can be used in recommendation systems by grouping customers based on the viewing or purchasing history.

Machine learning approaches

17 of 214

Reinforcement learning

- is a semi-supervised learning algorithm.

- provides a reward after the model makes some number of decisions that lead to a satisfactory result.

Model validation

  • used to understand how model behave in production after a model is trained.
  • for that purpose it reserve a small amount of available training data to be tested against final model.(called as test data)
  • training data is used to train machine learning model.
  • Test data is used when the model is complete to validate how well it generalizes to unseen data.

18 of 214

Reinforcement learning

Operations:

  • end goal of the data science pipeline.
  • creating a visualization for data product.
  • Deploying machine learning model in a production environment to operate on unseen data to provide prediction or classification.

Model deployment:

  • When the product of the machine learning phase is a model then it will be deployed into some production environment to apply to new data.
  • This model could be a prediction system. 

Prediction System

I /P

O/P

Historical Financial data eg.Sales & Revenue

Classification of whether a company is a reasonable acquisition target.

19 of 214

Reinforcement learning

Model visualization:

  • In smaller scale data science , the product is data ;instead of model produced in the machine learning phase.
  • Data product answers some questions about the original data set.
  • Options for visualization are vast and can be produced from the R programming language.

20 of 214

Summary- Definitions of Data Science

  • Is a field of Big data, which searches for providing meaningful information from huge amounts of complex data.

  • Is a system used for retrieving the information in different forms either in structured or unstructured.

  • It combines different fields of work in statistics & computation in order to understand the data for the purpose of decision making.

21 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

  • Data science and Big Data
  • Defining Data science and Big Data
  • Big Data examples
  • Data explosion
  • Data volume
  • Data Velocity
  • Big data infrastructure and challenges
  • Big Data Processing Architectures
  • Data Warehouse
  • Re-Engineering the Data Warehouse
  • Shared everything and shared nothing Architecture
  • Big data learning approaches.

22 of 214

Introduction to Big Data

  • Is huge amount of data.

  • Organizations use data generated through various sources to run their business.

  • They analyze the data to understand & interpret market trends, study customer behavior & take financial decisions.

  • Consists of large datasets that cannot be managed efficiently by the common DBMS.

  • These datasets range from terabytes to exabytes.

23 of 214

Introduction to Big Data

  • Mobile phones, credit cards, Radio Frequency Identification (RFID) devices & Social Networking platforms create huge amounts of data that may reside unutilized at unknown servers for many years.

  • With the evolution of Big data, this data can be accessed & analyzed on a regular basis to generate useful information.

  • The Sheer Volume, Variety, Velocity & Veracity of data is signified by the term “ Big Data “

  • Is structured, Unstructured, Semi-structured or heterogeneous in nature.

24 of 214

Introduction to Big Data

  • Traditional DBMS, warehousing & analysis systems fizzle to analyze huge amount of data.

  • Big data is stored in Distributed architecture file system.

  • Hadoop by Apache is widely used for storing & managing Bigdata.

  • Sort, organize, analyze & after this critical data in a systematic manner is nothing but Big data.

  • The process of capturing or collecting Big data is known as “datafication”.

  • Bigdata is datafied so that it can be used productively.

25 of 214

Introduction to Big Data

  • Big data can be made useful by:

- organizing it

- determining what we can do with it.

Big Data

Is a new data challenge that requires leveraging existing systems differently.

Is classified in terms of 4Vs:Volume ,Variety, Velocity, Veracity

Is usually unstructured & qualitative in nature.

26 of 214

Real-world Examples of Big Data

  • Consumer product companies & retail organizations are observing data on social media websites such as FB & Twitter. These sites help them to analyze customer behavior, preferences & product perception. Accordingly, the companies can line up their upcoming products to gain profits called as social media analytics.

  • Sports teams are using data for tracking ticket sales & even for tracking team strategies.

27 of 214

  • Big data is a pool of huge amounts of data of all types , shapes and formats collected from different sources.

Evolution of Bigdata

  • Bigdata is the new term of data evolution directed by the velocity, variety & volume of data.

  • Velocity: implies the speed with which the data flows in an organizations.

  • Variety: varied forms of data; such as structured, semi-structured, unstructured.

  • Volume: amount of data an organization has to deal with.

Big Data

28 of 214

Big Data

Structuring Big data

  • Arranging the available data in a manner such that it becomes easy to study, analyze & derive conclusion from it.

  • Structuring data helps in understanding user behaviors, requirements & preferences to make personalized recommendations for every individual.

  • Eg: when user regularly visits or purchases from online shopping sites , each time he logs in, the system can present a recommended list of products that may interest the user on the basis of his earlier purchases or searches.

  • Different types of data (eg. images, text, audio) can be structured only if it is sorted & organized in some logical pattern.

29 of 214

Parallel Processing

Data Science

Artificial Intelligence

Data mining

Distributed System

Data Storage

Analysis

Big Data

Fig: Concepts of Big Data

Big Data

30 of 214

Big Data

Semi-Structured Data

Structured Data

Unstructured Data

Fig: Types of Data

Big Data

31 of 214

Elements of Big Data

  • Volume
  • Velocity
  • Variety
  • Veracity

32 of 214

Elements of Big Data

  • Amount of data generated by organizations.

  • Volume of data in most organizations is exabytes .

  • Organizations are doing best to handle this ever-increasing volume of data.

  • Internet alone generates a huge amount of data.

  • Eg: Internet has around 14.3 trillion live pages , 48 billion web pages are indexed by Google Inc, 14 Billion web pages are indexed by Microsoft Bing.

33 of 214

Velocity

  • Rate at which data is generated, captured & shared.

  • Is flow of data from various sources such as networks, human resources, social media etc.

  • The data can be huge & flows in continuous manner.

  • Enterprises can capitalize on data only if it is captured & shared in real time.

  • Information processing systems such as CRM & ERP face problems associated with data, which keeps adding up but cannot be processed quickly.

34 of 214

Velocity

  • These systems are able to attend data in batches every few hours.

  • Eg: eBay analyzes around 5 million transactions per day in real time to detect & prevent frauds arising from the use of PayPal.

  • Sources of high velocity data includes:
  • IT devices: routers , switches, firewalls.
  • Social media: Facebook posts , tweets etc.
  • Portable devices: Mobile

  • Examples of data velocity :

Amazon, FB, Yahoo, Google, Sensor data, Mobile networks etc.

35 of 214

Variety

  • Data generated from different types of sources such as Internal, External, Social etc.

  • Data comes in different formats(images, text, videos)

  • Single source can generate data in varied formats.

  • Eg: GPS & Social networking sites such as Facebook produce data of all types, including text, images, videos.

36 of 214

Veracity

  • Refers to the uncertainty of data; that is whether the obtained data is correct or consistent.

  • Out of huge amount of data, correct & consistent data can be used for further analysis.

  • Unstructured & semi-structured data take lots of efforts to clean the data & make it suitable for analysis.

37 of 214

Data Explosion

  • Is rapid growth of the data.

  • One reason to this explosion is innovation.

  • Innovation has transformed the way we engage in business, provide services, and the associated measurement of value and profitability.

  • 3 basic trends to build up the data:
  • Business model transformation
  • Globalization
  • Personalization of services

38 of 214

�Business model transformation

  • Modern companies can be moved towards the service oriented technologies rather than product oriented.

  • In service oriented, the value of the organization from customers point of view is measured by how much the service is effective instead of how much product is useful.

  • The amount of data produced &consumed by every organization today exceeds what the same organization produced prior to the business transformation.

  • Higher priority data are kept in center & the supporting data which is required but not available or accessible previously now can be available & accessible with the help of multiple channels.

39 of 214

�Globalization

  • Is a key trend that has radically changed the commerce of the world, starting from manufacturing to customer service.

Personalization of services

  • Business transformation’s maturity index is measured by the extent of personalization of services and the value perceived by their customers from such transformation.

40 of 214

Big Data Processing Architectures�

  • Big data architecture is designed to handle the processing & analysis of data that is too large or complex for traditional DBMS.

41 of 214

Fig: Components of Big data architecture:

42 of 214

Big Data Processing Architectures�

Data sources.

  • All big data solutions start with one or more data sources.
  • Examples include:
  • Application data stores, such as relational databases.
  • Static files produced by applications, such as web server log files.
  • Real-time data sources, such as IoT devices.

Data storage.

  • Data for batch processing operations is typically stored in a distributed file store that can hold high volumes of large files in various formats.
  • This kind of store is often called a data lake.

43 of 214

Big Data Processing Architectures�

Batch processing.

  • data files are processed using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
  • Usually these jobs involve reading source files, processing them, and writing the output to new files.

Real-time message ingestion.

  • If the solution includes real-time sources, the architecture must include a way to capture and store real-time messages for stream processing.

Stream processing.

After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink.

44 of 214

Big Data Processing Architectures�

Analytical data store.

Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.

Analysis and reporting.

The goal of most big data solutions is to provide insights into the data through analysis and reporting.

Orchestration:

  • Most big data solutions consist of repeated data processing operations, encapsulated in workflows, that transform source data, move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results straight to a report or dashboard.
  • To automate these workflows, you can use an orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.

45 of 214

INTRODUCTION: DATA SCIENCE AND BIG DATA

  • Data science and Big Data
  • Defining Data science and Big Data
  • Big Data examples
  • Data explosion
  • Data volume
  • Data Velocity
  • Big data infrastructure and challenges
  • Big Data Processing Architectures
  • Data Warehouse
  • Re-Engineering the Data Warehouse
  • Shared everything and shared nothing Architecture
  • Big data learning approaches.

46 of 214

Data processing infrastructure challenges

47 of 214

Data processing infrastructure challenges

Storage

  • The first & major problem to big data is storage.

  • As Big data is increased rapidly, there is need to process this huge data as well as to store it.

  • We need the additional 0.5 times storage to process & store the intermediate result set.

  • Storage has been a problem in the world of transaction processing and data warehousing.

  • Due to the design of the underlying software, we do not consume all the storage that is available on a disk.

  • Another problem with storage is the cost per byte.

48 of 214

Data processing infrastructure challenges

Transportation

  • One of the biggest issue is moving data between different systems and then storing it or loading it into memory for manipulation.

  • This continuous movement of data has been one of the reasons that structured data processing evolved to be restrictive in nature, (where the data had to be transported between the compute and storage layers. )

  • Network technologies facilitated the bandwidth of the transport layers to be much bigger and more scalable.

49 of 214

Data processing infrastructure challenges

Processing

  • Is to combine some form of logical and mathematical calculations together in one cycle of operation.

  • Divided into the 3 main areas:

1. CPU or processor.

2. Memory

3. Software

50 of 214

Data processing infrastructure challenges

CPU or processor.

  • With each generation:

- the computing speed and processing power have increased

-leading to more processing capabilities

- access to wider memory.

- architecture evolution within the software layers.

Memory.

  • While the storage of data to disk for offline processing proved the need for storage evolution and data management.

  • As the processor evaluations improving the capability of the processor, Memory has becomes cheaper and faster in terms of speed.

  • According to the allocated memory to system, the process resides within a system, has changed significantly.

51 of 214

Data processing infrastructure challenges

Software

  • Main component of data processing.

  • used to develop the programs to transform and process the data.

  • Software across different layers from operating systems to programming languages has evolved generationally.

  • Translates sequenced instruction sets into machine language that is used to process data with the infrastructure layers of CPU + memory + storage.

52 of 214

Data processing infrastructure challenges

Speed or throughput

  • The biggest continuing challenge.

  • Speed is a combination of various architecture layers: hardware, software, networking, and storage.

53 of 214

Big Data Processing Architectures

54 of 214

Big Data Processing Architectures

Centralized Processing Architecture

    • All the data is collected to a single centralized storage area and processed by a single computer

    • Evolved with transaction processing and are well suited for small organizations with one location of service.

Advantages :

  • requires minimal resources both from people & system perspectives.

      • Centralized processing is very successful when the collection and consumption of data occurs at the same location.

55 of 214

Big Data Processing Architectures

Distributed Processing Architecture

Data and its processing are distributed across geographies or data centers

Types:

1. Client –Server Architecture

Client : Collection and Presentation

Sever : Processing and Management

2. Three tier architecture

Client ,Server ,Middle tier

Middle Tier : Processing Logic

3.n-tier Architecture

clients, middleware, applications, and servers are isolated into tiers.

Any tier can be scaled independently

56 of 214

Big Data Processing Architectures

4. Cluster architecture.

  • Machines are connected in a network architecture .
  • Both software or hardware work together to process data or compute requirements in parallel.
      • Each machine in a cluster is associated with a task that is processed locally and the result sets are collected to a master server that returns it back to the user.

5. Peer-to-peer architecture.

  • No dedicated servers and clients; instead, all the processing responsibilities are allocated among all machines, known as peers.
  • Each machine can perform the role of a client or server or just process data.

57 of 214

Big Data Processing Architectures

Distributed processing advantages :

Scalability of systems and resources can be achieved based on isolated needs.

Processing and management of information can be architected based on desired unit of operation.

– Parallel processing of data reducing time latencies.

Distributed processing Disadvantages:

– Data redundancy

– Process redundancy

– Resource overhead

– Volumes

58 of 214

  • Big Data Processing Architectures

  • When working with very large data sets, it can take a long time to run the sort of queries that clients need.

  • These queries can't be performed in real time, and often require algorithms such as MapReduce that operate in parallel across the entire data set.

  • The results are then stored separately from the raw data and used for querying.

  • One drawback to this approach is that it introduces latency — if processing takes a few hours, a query may return results that are several hours old.

  • Ideally, you would like to get some results in real time (perhaps with some loss of accuracy), and combine these results with the results from the batch analytics.

59 of 214

  • Big Data Processing Architectures

  • The lambda architecture, first proposed by Nathan Marz, addresses this problem by creating two paths for data flow.

  • All data coming into the system goes through these two paths:

  • batch layer (cold path) stores all of the incoming data in its raw form and performs batch processing on the data. The result of this processing is stored as a batch view.

  • speed layer (hot path) analyzes data in real time. This layer is designed for low latency, at the expense of accuracy.

60 of 214

Big Data Processing Architectures

Lamda Architecture

The batch layer feeds into a serving layer that indexes the batch view for efficient querying.

The speed layer updates the serving layer with incremental updates based on the most recent data.

61 of 214

  • Big Data Processing Architectures
  • The hot and cold paths converge at the analytics client application.

  • If the client needs to display timely, yet potentially less accurate data in real time, it will acquire its result from the hot path.

  • Otherwise, it will select results from the cold path to display less timely but more accurate data.

  • In other words, the hot path has data for a relatively small window of time, after which the results can be updated with more accurate data from the cold path.

62 of 214

  • Big Data Processing Architectures
  • The raw data stored at the batch layer is immutable.

  • Incoming data is always appended to the existing data, and the previous data is never overwritten.

  • Any changes to the value of a particular data are stored as a new timestamped event record.

  • This allows for recomputation at any point in time across the history of the data collected.

  • The ability to recompute the batch view from the original raw data is important, because it allows for new views to be created as the system evolves.

63 of 214

Big Data Processing Architectures

Lamda Architecture

Batch Layer (Cold Path)

Stores all incoming data & perform a batch processing

Managing all historical data

Recomputing the result using machine learning model

Results come at high latency due to computational cost

Data can be only appended not updated or deleted

Data is stored using memory databases or long term persistent like no-SQL storages

Uses Map-reduce

Speed Layer

Provide low –latency result

Data is processed in real-time

Incremental Algorithms

Create ,delete dataset is possible

64 of 214

Big Data Processing Architectures

Lamda Architecture

Serving Layer :

User fires query

Applications:

Ad-hoc queries

Netflix,Twitter,Yahoo

Pros:

Batch layer manages historical data so low error when system crashes

Good speed, reliability

Fault tolerance and scalable processing

Cons:

Caching overhead , complexity ,duplicate computation

Difficult to migrate or reorganize

65 of 214

Kappa Architecture

  • A drawback to the lambda architecture is its complexity. Processing logic appears in two different places — the cold and hot paths — using different frameworks. This leads to duplicate computation logic and the complexity of managing the architecture for both paths.

  • The kappa architecture was proposed by Jay Kreps as an alternative to the lambda architecture.

  • It has the same basic goals as the lambda architecture, but with an important distinction: All data flows through a single path, using a stream processing system.

Big Data Processing Architectures

66 of 214

  • Kappa Architecture

Big Data Processing Architectures

67 of 214

Big Data Processing Architectures

  • There are some similarities to the lambda architecture's batch layer, in that the event data is immutable and all of it is collected, instead of a subset.

  • The data is ingested as a stream of events into a distributed and fault tolerant unified log.

  • These events are ordered, and the current state of an event is changed only by a new event being appended.

  • Similar to a lambda architecture's speed layer, all event processing is performed on the input stream and persisted as a real-time view.

  • If you need to recompute the entire data set (equivalent to what the batch layer does in lambda), you simply replay the stream, typically using parallelism to complete the computation in a timely fashion.

68 of 214

Kappa Architecture

  • Simple lambda architecture with batch layer removed
  • Speed layer is capable of both real and batch data
  • Only two layers : Stream processing and Serving
  • All event processing is performed on the input stream and persisted as a real-time view
  • Speed layer is designed using Apache Storm,Spark

Big Data Processing Architectures

69 of 214

70 of 214

�Zeta architecture

  • This is the next generation Enterprise architecture cultivated by Jim Scott. 

  • This is a pluggable architecture which consists of Distributed file system, Real-time data storage, Pluggable compute model/execution engine, Deployment/container management system, Solution architecture, Enterprise applications and Dynamic and global resource management.

Big Data Processing Architectures

71 of 214

�Zeta architecture diagram

Big Data Processing Architectures

72 of 214

The Traditional Research Approach

Source

Source

Source

. . .

Integration System

. . .

Metadata

Clients

Wrapper

Wrapper

Wrapper

  • Query-driven (lazy, on-demand)

73 of 214

  • Delay in query processing
    • Slow or unavailable information sources
    • Complex filtering and integration
  • Inefficient and potentially expensive for frequent queries
  • Competes with local processing at sources
  • Hasn’t caught on in industry

Big Data Processing Architectures

74 of 214

Data Warehouse

75 of 214

The Warehousing Approach

Data

Warehouse

Clients

Source

Source

Source

. . .

Extractor/

Monitor

Integration System

. . .

Metadata

Extractor/

Monitor

Extractor/

Monitor

  • Information integrated in advance
  • Stored in wh for direct querying and analysis

76 of 214

The Warehousing Approach

  • Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible
  • A decision support database maintained separately from the organization’s operational database

77 of 214

The Warehousing Approach

Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]

  • By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise.

  • OLTP systems are usually designed independently of each other and it is difficult for them to share information.

78 of 214

Characteristics of Data Warehouse

  • Subject oriented. Data are organized based on how the users refer to them.
  • Integrated. All inconsistencies regarding naming convention and value representations are removed.
  • Nonvolatile. Data are stored in read-only format and do not change over time.
  • Time variant. Data are not current but normally time series.

79 of 214

Characteristics of Data Warehouse

  • Summarized Operational data are mapped into a decision-usable format
  • Large volume. Time series data sets are normally quite large.
  • Not normalized. DW data can be, and often are, redundant.
  • Metadata. Data about data are stored.
  • Data sources. Data come from internal and external unintegrated operational systems.

80 of 214

Need of Data Warehouses

  • Consolidation of information resources
  • Improved query performance
  • Separate research and decision support functions from the operational systems
  • Foundation for data mining, data visualization, advanced reporting and OLAP tools

81 of 214

Data Warehouse

A subject-oriented ,integrated, time-variant and non-volatile collection of data in support of managements decision making process is called as data warehouse.

Subject-oriented

DWH organized around major subjects of enterprise (e.g. customer ,product sales ) rather than application areas (customer invoicing ,stock control,product sale)

Integrated : Data coming from enterprise wide applications in different formats.

Time –Variant

DWH behave differently at different time interval

Non-Volatile

New data is always added in existing one rather than replacement

82 of 214

Merits Data Warehouse

  1. Delivers enhanced Business Intelligence
  2. Ensures Data quality and consistency

DWH supports data conversion into common & standard format

No discrepancy

3. Saves time and money

Saves user’s time data is at one place

DWH execution doesnot require IT support & higher no of channels

4. Tracks historically Intelligent data

updates about changing trends

5. Generates high revenue

83 of 214

Advantages of Warehousing Approach

  • High query performance
    • But not necessarily most current information
  • Doesn’t interfere with local processing at sources
    • Complex queries at warehouse
    • OLTP at information sources
  • Information copied at warehouse
    • Can modify, annotate, summarize, restructure, etc.
    • Can store historical information
    • Security, no auditing
  • Has caught on in industry

84 of 214

Which are our� lowest/highest margin �customers ?

Who are my customers �and what products �are they buying?

Which customers� are most likely to go �to the competition ?

What impact will �new products/services

have on revenue �and margins?

What product prom-�-otions have the biggest �impact on revenue?

What is the most �effective distribution �channel?

85 of 214

  • Knowledge discovery
    • Making consolidated reports
    • Finding relationships and correlations
    • Data mining
    • Examples
      • Banks identifying credit risks
      • Insurance companies searching for fraud
      • Medical research

86 of 214

Comparison Chart of Database Types

Data warehouse

Operational system

Subject oriented

Transaction oriented

Large (hundreds of GB up to several TB)

Small (MB up to several GB)

Historic data

Current data

De-normalized table structure (few tables, many columns per table)

Normalized table structure (many tables, few columns per table)

Batch updates

Continuous updates

Usually very complex queries

Simple to complex queries

87 of 214

Comparison Chart of Database Types

Data warehouse

Big Data

Extracts data from varieties of SQL based data sources(relational databases) & help for generating analytic reports.

Handle huge data coming from various heterogeneous recourses including social media.

Mainly handle structured data.

Can handle structured , unstructured , semi-structured data.

Helps to analytic on informed(specific) information.

Has a lot of data so analytic provides information by extracting useful information from data.

Don’t use Distributed file system.

Use Distributed file system.

Never erase previous data when new data is added.

Also Never erase previous data when new data is added but sometimes real-time data streams are processed.

Timing of fetching simultaneously is more.

Timing of fetching simultaneously is small using Hadoop File System.

88 of 214

Reengineering the data warehouse

89 of 214

Reengineering the Data Warehouse

Enterprise data warehouse platform

There are several layers of infrastructure that make the platform for the EDW:

1. The hardware platform:

● Database server:

– Processor

– Memory

– BUS architecture

● Storage server:

– Class of disk

– Controller

● Network

2. Operating system

3. Application software:

● Database

● Utilities

90 of 214

�Data distribution in a data warehouse

Operational data store

Reengineering the Data Warehouse

91 of 214

�Choices for reengineering the data warehouse

Replatforming

  • Replatform the data warehouse to a new platform including all hardware & infrastructure.

  • There are several new technology options and depending on the requirement of the organization, any of these technologies can be deployed.

  • The choices include data warehouse appliances, commodity platforms, tiered storage, private cloud, and in-memory technologies.

Reengineering the Data Warehouse

92 of 214

Reengineering the Data Warehouse

Benefits:

● move the data warehouse to a scalable and reliable platform.

● The underlying infrastructure and the associated application software layers can be architected

● to provide security, lower maintenance, and increase reliability.

● Optimize the application and database code.

● Provide some additional opportunities to use new functionality.

● Makes it possible to rearchitect things in a different/better way, which is almost impossible to do in an existing setup.

93 of 214

Reengineering the Data Warehouse

Disadvantages:

Takes a long cycle time to complete, leading to disruption of business activities.

● Replatforming often means reverse engineering complex business processes and rules that may be undocumented or custom developed in the current platform.

● May not be feasible for certain aspects of data processing or there may be complex calculations that need to be rewritten if they cannot be directly supported by the functionality of the new platform.

● Replatforming is not economical in environments that have large legacy platforms, as it consumes too many business process cycles to reverse engineer logic and documenting the same.

94 of 214

Data warehouse platform.

95 of 214

Platform engineering

  • Modify parts of the infrastructure and get great gains in scalability and performance.

  • Used in the automotive industry where the focus was on improving quality, reducing costs, and delivering services and products to end users in a highly cost-efficient manner.

  • Applied to the data warehouse can translate to:

● Reduce the cost of the data warehouse.

● Increase efficiencies of processing.

● Simplify the complexities in the acquisition, processing, and delivery of data.

● Reduce redundancies.

96 of 214

Platform engineering

  • Platform reengineering can be done at multiple layers:
  • Storage level: storage layer of the data is engineered to process data at very high speeds for high or low volumes.
  • Server reengineering: hardware and its components can be replaced with more modern components that can be supported in the configuration
  • Network reengineering: In this approach the network layout and the infrastructure are reengineered.
  • Data warehouse appliances: In this approach the entire data warehouse or datamart can be ported to the data warehouse appliance .The data warehouse appliance is an integrated stack of hardware, software, and network components designed and engineered to handle data warehouse rigors.
  • Application server: In this approach the application server is customized to process reports and analytic layers across a clustered architecture.

97 of 214

Platform engineering

Data engineering

  • The data structures are reengineered to create better performance.

  • The data model developed as a part of the initial data warehouse is

often scrubbed and new additions are made to the data model.

Typical changes include:

  • Partitioning—a table can be vertically partitioned depending on the usage of columns, thus reducing the span of I/O operations. Another partition technique is horizontal partitioning where the table is partitioned by date or numeric ranges into smaller slices.

  • Colocation—a table and all its associated tables can be colocated in the same storage region.

98 of 214

Platform engineering

  • Distribution—a large table can be broken into a distributed set of smaller tables and used.

  • New data types—several new data types like geospatial and temporal data can be used in the data architecture and current workarounds for such data can be retired. This will provide a significant performance boost.

  • New database functions—several new databases provide native functions like scalar tables and indexed views, and can be utilized to create performance boosts.

99 of 214

Architectures

100 of 214

Shared-everything architecture

  • Is a system architecture where all resources are shared including storage, memory, and the processer.

  • Two variations of shared-everything architecture are:

  • Symmetric multiprocessing (SMP)

  • Distributed shared memory (DSM).

101 of 214

Symmetric multiprocessing (SMP)

Distributed shared memory (DSM).

Share a single pool of memory for read–write access concurrently and uniformly without latency.

addresses the scalability problem by providing multiple pools of memory

for processors to use.

Referred to as uniform memory access (UMA) architecture.

Referred to as non uniform memory access (NUMA) architecture.

The drawback is when multiple processors are present & share a single system bus, which results in choking of the bandwidth for simultaneous memory access , therefore, the scalability of such system is very limited.

latency to access memory depends on the relative

distances of the processors and their dedicated memory pools.

102 of 214

  • Both SMP and DSM architectures have been deployed for many transaction processing systems , where the transactional data is small in size and has a short burst cycle of resource requirements.

  • Data warehouses have been deployed on the shared-everything architecture for many years.

  • Due to the intrinsic architecture limitations, the direct impact has been on cost and performance.

  • Analytical applications and Big Data cannot be processed on a shared-everything architecture.

Shared-everything architecture

103 of 214

Fig: Shared-everything architecture

104 of 214

Shared-everything architecture

  • Is a distributed computing architecture where multiple systems (called nodes) are networked to form a scalable system.

  • Each node has its own private memory, disks, and storage devices independent of any other node in the configuration.

  • None of the nodes share memory or disk storage.

  • Each processor has its own local memory & local disk.

  • Intercommunication channel is used by the processors to communicate.

  • Processors can independently act as a server to serve the data of local disk.

105 of 214

Fig: Shared-nothing architecture

106 of 214

Shared-everything architecture

  • The flexibility of the architecture is its scalability.

  • This is the underlying architecture for data warehouse appliances and large data processing.

  • The extensibility and infinite scalability of this architecture makes it the platform architecture for Internet & web applications.

  • The key feature is that the operating system not the application server owns responsibility for controlling and sharing hardware resources.

  • A system can assign dedicated applications or partition its data among the different nodes to handle a particular task.

107 of 214

Shared-everything architecture�Advantages of Shared-nothing architecture

  • Scalable.
  • When node gets added transmission capacity increases.
  • Failure is local.(failure of one node cannot affect to other node)

  • Cost of communication is higher than shared memory architecture.
  • Data sending involves the software interaction.
  • More coordination is required.

Disadvantages of Shared-nothing architecture

108 of 214

108

109 of 214

6

110 of 214

What is big data?

Big Data is any thing which is crash Excel.

Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM.

110

Or, in other words, Big Data is data in volumes too great to process by traditional methods.

https://twitter.com/devops_borat

111 of 214

Data accumulation

111

  • Today, data is accumulating at tremendous rates
    • click streams from web visitors
    • supermarket transactions
    • sensor readings
    • video camera footage
    • GPS trails
    • social media interactions

...

  • It really is becoming a challenge to store and process it all in a meaningful way

112 of 214

From WWW to VVV

112

  • Volume
    • data volumes are becoming unmanageable
  • Variety
    • data complexity is growing
    • more types of data captured than previously
  • Velocity
    • some data is arriving so rapidly that it must either be processed instantly, or lost
    • this is a whole subfield called “stream processing”

113 of 214

The promise of Big Data

  • Data contains information of great business value
  • If you can extract those insights you can make far better decisions
  • ...but is data really that valuable?

114 of 214

114

115 of 214

115

116 of 214

“quadrupling the average cow's milk production since your parents were born”

116

"When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)."

117 of 214

Some more examples

14

  • Sports
    • basketball increasingly driven by data analytics
    • soccer beginning to follow
  • Entertainment
    • House of Cards designed based on data analysis
    • increasing use of similar tools in Hollywood
  • “Visa Says Big Data Identifies Billions of Dollars in Fraud”
    • new Big Data analytics platform on Hadoop
  • “Facebook is about to launch Big Data play”
    • starting to connect Facebook with real life

https://delicious.com/larsbot/big-data

118 of 214

Ok, ok, but ... does it apply to ourcustomers?

118

  • Norwegian Food Safety Authority
    • accumulates data on all farm animals
    • birth, death, movements, medication, samples, ...
  • Hafslund
    • time series from hydroelectric dams, power prices, meters of individual customers, ...
  • Social Security Administration
    • data on individual cases, actions taken, outcomes...
  • Statoil
    • massive amounts of data from oil exploration, operations, logistics, engineering, ...
  • Retailers
    • see Target example above
    • also, connection between what people buy, weather forecast, logistics, ...

119 of 214

How to extract insight from data?

Monthly Retail Sales in New South Wales (NSW) Retail Department Stores

119

120 of 214

Types of algorithms

  • Clustering
  • Association learning
  • Parameter estimation
  • Recommendation engines
  • Classification
  • Similarity matching
  • Neural networks
  • Bayesian networks
  • Genetic algorithms

120

121 of 214

Basically, it’s all maths...

  • Linear algebra
  • Calculus
  • Probability theory
  • Graph theory
  • ...

18

https://twitter.com/devops_borat

Only 10% in devops are know how of work with Big Data.

Only 1% are realize they are need 2 Big Data for fault tolerance

122 of 214

Big data skills gap

  • Hardly anyone knows this stuff
  • It’s a big field, with lots and lots of theory
  • And it’s all maths, so it’s tricky to learn

19

http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap

http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap

123 of 214

Two orthogonal aspects

20

  • Analytics / machine learning
    • learning insights from data
  • Big data
    • handling massive data volumes
  • Can be combined, or used separately

124 of 214

Data science?

21

125 of 214

How to process Big Data?

22

  • If relational databases are not enough, what is?

https://twitter.com/devops_borat

Mining of Big Data is problem solve in 2013 with zgrep

126 of 214

MapReduce

  • A framework for writing massively parallel code
  • Simple, straightforward model
  • Based on “map” and “reduce” functions from functional programming (LISP)

126

127 of 214

NoSQL and Big Data

127

  • Not really that relevant
  • Traditional databases handle big data sets, too
  • NoSQL databases have poor analytics
  • MapReduce often works from text files
    • can obviously work from SQL and NoSQL, too
  • NoSQL is more for high throughput
    • basically, AP from the CAP theorem, instead of CP
  • In practice, really Big Data is likely to be a mix
    • text files, NoSQL, and SQL

128 of 214

The 4th V: Veracity

25

https://twitter.com/devops_borat

“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”

Daniel Borstin, in The Discoverers (1983)

95% of time, when is clean Big Data is get Little Data

129 of 214

Data quality

  • A huge problem in practice
    • any manually entered data is suspect
    • most data sets are in practice deeply problematic
  • Even automatically gathered data can be a problem
    • systematic problems with sensors
    • errors causing data loss
    • incorrect metadata about the sensor
  • Never, never, never trust the data without checking it!
    • garbage in, garbage out, etc

26

130 of 214

Approaches to learning

130

  • Supervised
    • we have training data with correct answers
    • use training data to prepare the algorithm
    • then apply it to data without a correct answer
  • Unsupervised
    • no training data
    • throw data into the algorithm, hope it makes some kind of sense out of the data

131 of 214

Approaches to learning

131

  • Prediction
    • predicting a variable from data
  • Classification
    • assigning records to predefined groups
  • Clustering
    • splitting records into groups based on similarity
  • Association learning
    • seeing what often appears together with what

132 of 214

Issues

132

  • Data is usually noisy in some way
    • imprecise input values
    • hidden/latent input values
  • Inductive bias
    • basically, the shape of the algorithm we choose
    • may not fit the data at all
    • may induce underfitting or overfitting
  • Machine learning without inductive bias is not possible

133 of 214

Underfitting

  • Using an algorithm that cannot capture the full complexity of the data

133

134 of 214

Overfitting

  • Tuning the algorithm so carefully it starts matching the noise in the training data

134

135 of 214

“What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of machine learning. When your learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on test data, when in fact it could have output one that is 75% accurate on both,

it has overfit.”

35

136 of 214

Testing

136

  • When doing this for real, testing is crucial
  • Testing means splitting your data set
    • training data (used as input to algorithm)
    • test data (used for evaluation only)
  • Need to compute some measure of performance
    • precision/recall
    • root mean square error
  • A huge field of theory here
    • will not go into it in this course
    • very important in practice

137 of 214

Missing values

137

  • Usually, there are missing values in the data set
    • that is, some records have some NULL values
  • These cause problems for many machine learning algorithms
  • Need to solve somehow
    • remove all records with NULLs
    • use a default value
    • estimate a replacement value

...

138 of 214

Terminology

138

  • Vector
    • one-dimensional array
  • Matrix
    • two-dimensional array
  • Linear algebra
    • algebra with vectors and matrices
    • addition, multiplication, transposition, ...

139 of 214

Top 10 algorithms

139

140 of 214

Top 10 machine learning algs

1. C4.5

2. k-means clustering

3. Support vector machines

4. the Apriori algorithm

5. the EM algorithm

6. PageRank

7. AdaBoost

  1. k-nearest neighbours class.
  2. Naïve Bayes
  3. CART

40

From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006. “Top 10 algorithms in data mining”, by X. Wu et al

141 of 214

C4.5

141

  • Algorithm for building decision trees
    • basically trees of boolean expressions
    • each node split the data set in two
    • leaves assign items to classes
  • Decision trees are useful not just for classification
    • they can also teach you something about the classes
  • C4.5 is a bit involved to learn
    • the ID3 algorithm is much simpler
  • CART (#10) is another algorithm for learning decision trees

142 of 214

Support Vector Machines

142

  • A way to do binary classification on matrices
  • Support vectors are the data points nearest to the hyperplane that divides the classes
  • SVMs maximize the distance between SVs and the boundary
  • Particularly valuable because of “the kernel trick”

using a transformation to a higher dimension to

handle more complex class boundaries

  • A bit of work to learn, but manageable

143 of 214

Apriori

43

  • An algorithm for “frequent itemsets”
    • basically, working out which items frequently

appear together

    • for example, what goods are often bought together in the supermarket?
    • used for Amazon’s “customers who bought this...”
  • Can also be used to find association rules
    • that is, “people who buy X often buy Y” or similar
  • Apriori is slow
    • a faster, further development is FP-growth

144 of 214

Expectation Maximization

144

  • A deeply interesting algorithm I’ve seen used in a number of contexts
    • very hard to understand what it does
    • very heavy on the maths
  • Essentially an iterative algorithm
    • skips between “expectation” step and

“maximization” step

    • tries to optimize the output of a function
  • Can be used for
    • clustering
    • a number of more specialized examples, too

145 of 214

PageRank

145

  • Basically a graph analysis algorithm
    • identifies the most prominent nodes
    • used for weighting search results on Google
  • Can be applied to any graph
    • for example an RDF data set
  • Basically works by simulating random walk
    • estimating the likelihood that a walker would be

on a given node at a given time

    • actual implementation is linear algebra
  • The basic algorithm has some issues
    • “spider traps”
    • graph must be connected
    • straightforward solutions to these exist

146 of 214

AdaBoost

146

  • Algorithm for “ensemble learning”
  • That is, for combining several algorithms
    • and training them on the same data
  • Combining more algorithms can be very effective
    • usually better than a single algorithm
  • AdaBoost basically weights training samples
    • giving the most weight to those which are classified the worst

147 of 214

Naïve Bayes

147

148 of 214

Bayes’s Theorem

148

  • Basically a theorem for combining probabilities
    • I’ve observed A, which indicates H is true with probability 70%
    • I’ve also observed B, which indicates H is true with probability 85%
    • what should I conclude?
  • Naïve Bayes is basically using this theorem
    • with the assumption that A and B are indepedent
    • this assumption is nearly always false, hence

“naïve”

149 of 214

Simple example

68

  • Is the coin fair or not?
    • we throw it 10 times, get 9 heads and one tail

    • we try again, get 8 heads and two tails

  • What do we know now?
    • can combine data and recompute
    • or just use Bayes’s Theorem directly

>>> compute_bayes([0.92, 0.84])

0.9837067209775967

150 of 214

Ways I’ve used Bayes

69

  • Duke
    • record deduplication engine
    • estimate probability of duplicate for each property
    • combine probabilities with Bayes
  • Whazzup
    • news aggregator that finds relevant news
    • works essentially like spam classifier on next slide
  • Tine recommendation prototype
    • recommends recipes based on previous choices
    • also like spam classifier
  • Classifying expenses
    • using export from my bank
    • also like spam classifier

151 of 214

Bayes against spam

70

  • Take a set of emails, divide it into spam and non-spam (ham)
    • count the number of times a feature appears in each of the two sets
    • a feature can be a word or anything you please
  • To classify an email, for each feature in it
    • consider the probability of email being spam given that feature to be (spam count) / (spam count + ham count)
    • ie: if “viagra” appears 99 times in spam and 1 in ham, the probability is 0.99
  • Then combine the probabilities with Bayes

152 of 214

Running the script

71

  • I pass it
    • 1000 emails from my Bouvet folder
    • 1000 emails from my Spam folder
  • Then I feed it
    • 1 email from another Bouvet folder
    • 1 email from another Spam folder

153 of 214

Code

72

# scan spam

for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(spam):

corpus.spam(token)

# scan ham

for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(ham):

corpus.ham(token)

# compute probability for email in sys.argv[3 : ]:

print email

p = classify(email) if p < 0.2:

print ' Spam', p else:

print ' Ham', p

https://github.com/larsga/py-snippets/tree/master/machine-learning/spam

154 of 214

Classify

154

class Feature:

def init (self, token): self._token = token self._spam = 0

self._ham = 0

def spam(self): self._spam += 1

def ham(self): self._ham += 1

def spam_probability(self):

return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))

def compute_bayes(probs):

product = reduce(operator.mul, probs)

lastpart = reduce(operator.mul, map(lambda x: 1-x, probs)) if product + lastpart == 0:

return 0 # happens rarely, but happens

else:

return product / (product + lastpart)

def classify(email):

return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])

155 of 214

Ham output

155

Ham 1.0

Received:2013

0.00342935528121

Date:2013

0.00624219725343

<br

0.0291715285881

background-color:

0.03125

background-color:

0.03125

background-color:

0.03125

background-color:

0.03125

background-color:

0.03125

Received:Mar

0.0332667997339

Date:Mar

0.0362756952842

...

Postboks

0.998107494322

Postboks

0.998107494322

Postboks

0.998107494322

+47

0.99787414966

+47

0.99787414966

+47

0.99787414966

+47

0.99787414966

Lars

0.996863237139

Lars

0.996863237139

23

0.995381062356

So, clearly most of the spam is from March 2013...

156 of 214

Spam output

156

Spam 2.92798502037e-16

Received:-0400

0.0115646258503

Received:-0400

0.0115646258503

Received-SPF:(ontopia.virtual.vps-host.net:

0.0135823429542

Received-SPF:receiver=ontopia.virtual.vps-host.net;

0.0135823429542

0.0139318885449

0.0139318885449

Received:ontopia.virtual.vps-host.net

0.0170863309353

Received:(8.13.1/8.13.1)

0.0170863309353

Received:ontopia.virtual.vps-host.net

0.0170863309353

Received:(8.13.1/8.13.1)

0.0170863309353

...

Received:2012

0.986111111111

Received:2012

0.986111111111

$

0.983193277311

Received:Oct

0.968152866242

Received:Oct

0.968152866242

Date:2012

0.959459459459

20

0.938864628821

+

0.936526946108

+

0.936526946108

+

0.936526946108

...and the ham from October 2012

157 of 214

More solid testing

76

  • Using the SpamAssassin public corpus
  • Training with 500 emails from
    • spam
    • easy_ham (2002)
  • Test results
    • spam_2: 1128 spam, 269 misclassified as ham
    • easy_ham 2003: 2283 ham, 217 spam
  • Results are pretty good for 30 minutes of effort...

158 of 214

Linear regression

159 of 214

Linear regression

  • Let’s say we have a number of numerical parameters for an object
  • We want to use these to predict some other value
  • Examples
    • estimating real estate prices
    • predicting the rating of a beer

...

160 of 214

Estimating real estate prices

  • Take parameters
    • x1 square meters
    • x2 number of rooms
    • x3 number of floors
    • x4 energy cost per year
    • x5 meters to nearest subway station
    • x6 years since built
    • x7 years since last refurbished

...

  • a x1 + b x2 + c x3 + ... = price
    • strip out the x-es and you have a vector
    • collect N samples of real flats with prices = matrix
    • welcome to the world of linear algebra

161 of 214

Our data set: beer ratings

  • Ratebeer.com
    • a web site for rating beer
    • scale of 0.5 to 5.0
  • For each beer we know
    • alcohol %
    • country of origin
    • brewery
    • beer style (IPA, pilsener, stout, ...)
  • But ... only one attribute is numeric!
    • how to solve?

162 of 214

Example

ABV

.se

.nl

.us

.uk

IIPA

Black

IPA

Pale

ale

Bitter

Rating

8.5

1.0

0.0

0.0

0.0

1.0

0.0

0.0

0.0

3.5

8.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

0.0

3.7

6.2

0.0

0.0

1.0

0.0

0.0

0.0

1.0

0.0

3.2

4.4

0.0

0.0

0.0

1.0

0.0

0.0

0.0

1.0

3.2

...

...

...

...

...

...

...

...

...

...

Basically, we turn each category into a column of 0.0 or 1.0 values.

163 of 214

Normalization

  • If some columns have much bigger values than the others they will automatically dominate predictions
  • We solve this by normalization
  • Basically, all values get resized into the 0.0-1.0 range
  • For ABV we set a ceiling of 15%

compute with min(15.0, abv) / 15.0

164 of 214

Adding more data

  • To get a bit more data, I added manually a description of each beer style
  • Each beer style got a 0.0-1.0 rating on
    • colour (pale/dark)
    • sweetness
    • hoppiness
    • sourness
  • These ratings are kind of coarse because all beers of the same style get the same value

165 of 214

Making predictions

  • We’re looking for a formula
    • a * abv + b * .se + c * .nl + d * .us + ... = rating
  • We have n examples

a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5

  • We have one unknown per column
    • as long as we have more rows than columns we can

solve the equation

  • Interestingly, matrix operations can be used to solve this easily

166 of 214

Matrix formulation

  • Let’s say
    • x is our data matrix
    • y is a vector with the ratings and
    • w is a vector with the a, b, c, ... values
  • That is: x * w = y
    • this is the same as the original equation
    • a x1 + b x2 + c x3 + ... = rating
  • If we solve this, we get

167 of 214

Enter Numpy

  • Numpy is a Python library for matrix operations
  • It has built-in types for vectors and matrices
  • Means you can very easily work with matrices in Python
  • Why matrices?
    • much easier to express what we want to do
    • library written in C and very fast
    • takes care of rounding errors, etc

168 of 214

Quick Numpy example

[[0, 1, 2, 3, 4, 5, 6, 7, 8,

9], [0, 1, 2, 3, 4, 5, 6, 7, 8,

9], [0, 1, 2, 3, 4, 5,

6, 7, 8, 9], [0, 1, 2, 3, 4,

5, 6, 7, 8, 9], [0, 1, 2, 3, 4,

5, 6, 7, 8, 9], [0, 1,

>>> from numpy import *

>>> range(10)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> [range(10)] * 10

2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8,

9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]

>>> m = mat([range(10)] * 10)

>>> m

matrix([[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9],

[0,

1,

2,

3,

4,

5,

6,

7,

8,

9]])

>>> m.T

matrix([[0,

0,

0,

0,

0,

0,

0,

0,

0,

0],

[1,

1,

1,

1,

1,

1,

1,

1,

1,

1],

[2,

2,

2,

2,

2,

2,

2,

2,

2,

2],

[3,

3,

3,

3,

3,

3,

3,

3,

3,

3],

[4,

4,

4,

4,

4,

4,

4,

4,

4,

4],

[5,

5,

5,

5,

5,

5,

5,

5,

5,

5],

[6,

6,

6,

6,

6,

6,

6,

6,

6,

6],

[7,

7,

7,

7,

7,

7,

7,

7,

7,

7],

[8,

8,

8,

8,

8,

8,

8,

8,

8,

8],

[9,

9,

9,

9,

9,

9,

9,

9,

9,

9]])

169 of 214

Numpy solution

  • We load the data into
    • a list: scores
    • a list of lists: parameters
  • Then:

x_mat = mat(parameters) y_mat = mat(scores).T

x_tx = x_mat.T * x_mat

assert linalg.det(x_tx)

ws = x_tx.I * (x_mat.T * y_mat)

170 of 214

Does it work?

89

  • We only have very rough information about each beer (abv, country, style)
    • so very detailed prediction isn’t possible
    • but we should get some indication
  • Here are the results based on my ratings
    • 10% imperial stout from US 3.9
    • 4.5% pale lager from Ukraine 2.8
    • 5.2% German schwarzbier 3.1
    • 7.0% German doppelbock 3.5

171 of 214

Beyond prediction

171

  • We can use this for more than just prediction
  • We can also use it to see which columns contribute the most to the rating

that is, which aspects of a beer best predict the rating

  • If we look at the w vector we see the following

Aspect

LMG

grove

ABV

0.56

1.1

colour

0.46

0.42

sweetness

0.25

0.51

hoppiness

0.45

0.41

sourness

0.29

0.87

  • Could also use correlation

172 of 214

Did we underfit?

172

  • Who says the relationship between ABV and the rating is linear?
    • perhaps very low and very high ABV are both negative?
    • we cannot capture that with linear regression
  • Solution
    • add computed columns for parameters raised to higher powers
    • abv2, abv3, abv4, ...
    • beware of overfitting...

173 of 214

Scatter plot

92

Freeze-distilled Brewdog beers

Rating

ABV in %

Code in Github, requires matplotlib

174 of 214

Trying again

174

175 of 214

Matrix factorization

175

  • Another way to do recommendations is matrix factorization
    • basically, make a user/item matrix with ratings
    • try to find two smaller matrices that, when multiplied together, give you the original matrix
    • that is, original with missing values filled in
  • Why that works?
    • I don’t know
    • I tried it, couldn’t get it to work
    • therefore we’re not covering it
    • known to be a very good method, however

176 of 214

Clustering

176

177 of 214

Clustering

177

  • Basically, take a set of objects and sort them into groups

objects that are similar go into the same group

  • The groups are not defined beforehand
  • Sometimes the number of groups to create is input to the algorithm
  • Many, many different algorithms for this

178 of 214

Sample data

178

  • Our sample data set is data about aircraft from DBpedia
  • For each aircraft model we have
    • name
    • length (m)
    • height (m)
    • wingspan (m)
    • number of crew members
    • operational ceiling, or max height (m)
    • max speed (km/h)
    • empty weight (kg)
  • We use a subset of the data
    • 149 aircraft models which all have values for all of these properties
  • Also, all values normalized to the 0.0-1.0 range

179 of 214

Distance

179

  • All clustering algorithms require a distance function
    • that is, a measure of similarity between two objects
  • Any kind of distance function can be used
    • generally, lower values mean more similar
  • Examples of distance functions
    • metric distance
    • vector cosine
    • RMSE

...

180 of 214

k-means clustering

180

  • Input: the number of clusters to create (k)
  • Pick k objects
    • these are your initial clusters
  • For all objects, find nearest cluster
    • assign the object to that cluster
  • For each cluster, compute mean of all properties
    • use these mean values to compute distance to clusters
    • the mean is often referred to as a “centroid”
    • go back to previous step
  • Continue until no objects change cluster

181 of 214

First attempt at aircraft

181

  • We leave out name and number built when doing comparison
  • We use RMSE as the distance measure
  • We set k = 5
  • What happens?
    • first iteration: all 149 assigned to a cluster
    • second: 11 models change cluster
    • third: 7 change
    • fourth: 5 change
    • fifth: 5 change
    • sixth: 2
    • seventh: 1
    • eighth: 0

182 of 214

3 jet bombers, one propeller bomber. Not too bad.

Cluster 5

cluster5, 4 models

ceiling : 13400.0

maxspeed : 1149.7

crew : 7.5

length : 47.275

height : 11.65

emptyweight : 69357.5

wingspan : 47.18

The Myasishchev M-50 was a Soviet prototype four-engine supersonic bomber which never attained service

The Tupolev Tu-16 was a twin-engine jet bomber used by the Soviet Union.

The Myasishchev M-4 Molot is a four-engined strategic bomber

The Convair B-36 "Peacemaker” was a strategic bomber built by Convair and operated solely by the United States Air Force (USAF) from 1949 to 1959

182

183 of 214

Small, slow propeller aircraft. Not too bad.

Cluster 4

102

cluster4, 56 models

ceiling : 5898.2

maxspeed : 259.8

crew : 2.2

length : 10.0

height : 3.3

emptyweight : 2202.5

wingspan : 13.8

The Avia B.135 was a Czechoslovak cantilever monoplane fighter aircraft

The North American B-25 Mitchell was an American twin-engined medium bomber

The Yakovlev UT-1 was a single-seater trainer aircraft

The Yakovlev UT-2 was a single-seater trainer aircraft

The Siebel Fh 104 Hallore was a small German twin-engined transport, communications and liaison aircraft

The Messerschmitt Bf 108 Taifun was a German single-engine sports and touring

aircraft

The Airco DH.2 was a single-seat biplane "pusher" aircraft

184 of 214

Small, very fast jet planes. Pretty good.

Cluster 3

103

cluster3, 12 models

ceiling : 16921.1

maxspeed : 2456.9

crew : 2.67

length : 17.2

height : 4.92

emptyweight : 9941

wingspan : 10.1

The Mikoyan MiG-29 is a fourth- generation jet fighter aircraft

The Vought F-8 Crusader was a single-engine, supersonic [fighter] aircraft

The English Electric Lightning is a supersonic jet fighter aircraft of the Cold War era, noted for its great speed.

The Dassault Mirage 5 is a supersonic attack aircraft

The Northrop T-38 Talon is a two- seat, twin-engine supersonic jet trainer

The Mikoyan MiG-35 is a further development of the MiG-29

185 of 214

Biggish, kind of slow planes. Some oddballs in this group.

Cluster 2

cluster2, 27 models

ceiling : 6447.5

maxspeed : 435

crew : 5.4

length : 24.4

height : 6.7

emptyweight : 16894

wingspan : 32.8

The Bartini Beriev VVA-14 (vertical

take-off amphibious aircraft)

The Aviation Traders ATL-98 Carvair was a large piston-engine transport aircraft.

The Fokker 50 is a turboprop- powered airliner

The PB2Y Coronado was a large flying boat patrol bomber

The Junkers Ju 89 was a heavy bomber

The Beriev Be-200 Altair is a

multipurpose amphibious aircraft

104The Junkers Ju 290 was a long-range transport, maritime patrol aircraft and heavy bomber

186 of 214

Small, fast planes. Mostly good, though the Canberra is a poor fit.

Cluster 1

cluster1, 50 models

ceiling : 11612

maxspeed : 726.4

crew : 1.6

length : 11.9

height : 3.8

emptyweight : 5303

wingspan : 13

The Adam A700 AdamJet was a proposed six-seat civil utility aircraft

The Learjet 23 is a ... twin-engine, high-speed business jet

The Learjet 24 is a ... twin-engine,

105 high-speed business jet

The Curtiss P-36 Hawk was an American- designed and built fighter aircraft

The Kawasaki Ki-61 Hien was a Japanese World War II fighter aircraft

The Grumman F3F was the last American biplane fighter aircraft

The English Electric Canberra is a first-generation jet-powered light bomber

The Heinkel He 100 was a German pre- World War II fighter aircraft

187 of 214

Clusters, summarizing

  • Cluster 1: small, fast aircraft (750 km/h)
  • Cluster 2: big, slow aircraft (450 km/h)
  • Cluster 3: small, very fast jets (2500 km/h)
  • Cluster 4: small, very slow planes (250 km/h)
  • Cluster 5: big, fast jet planes (1150 km/h)

106

For a first attempt to sort through the data, this is not bad at all

https://github.com/larsga/py-snippets/tree/master/machine-learning/aircraft

188 of 214

Agglomerative clustering

  • Put all objects in a pile
  • Make a cluster of the two objects closest to one another

from here on, treat clusters like objects

  • Repeat second step until satisfied

107

There is code for this, too, in the Github sample

189 of 214

Principal component analysis

189

190 of 214

PCA

190

  • Basically, using eigenvalue analysis to find out which variables contain the most information
    • the maths are pretty involved
    • and I’ve forgotten how it works
    • and I’ve thrown out my linear algebra book
    • and ordering a new one from Amazon takes too long
    • ...so we’re going to do this intuitively

191 of 214

An example data set

  • Two variables
  • Three classes
  • What’s the longest line we could draw through the data?
  • That line is a vector in two dimensions
  • What dimension dominates?
    • that’s right: the horizontal
    • this implies the horizontal contains most of the information in the data set
  • PCA identifies the most significant

variables

110

192 of 214

Dimensionality reduction

  • After PCA we know which dimensions matter
    • based on that information we can decide to throw out less important dimensions
  • Result
    • smaller data set
    • faster computations
    • easier to understand

192

193 of 214

Trying out PCA

193

  • Let’s try it on the Ratebeer data
  • We know ABV has the most information
    • because it’s the only value specified for each individual beer
  • We also include a new column: alcohol
    • this is the amount of alcohol in a pint glass of the

beer, measured in centiliters

    • this column basically contains no information at all; it’s computed from the abv column

194 of 214

Complete code

194

import rblib

from numpy import *

def eigenvalues(data, columns):

covariance = cov(data - mean(data, axis = 0), rowvar = 0) eigvals = linalg.eig(mat(covariance))[0]

indices = list(argsort(eigvals))

indices.reverse() # so we get most significant first

return [(columns[ix], float(eigvals[ix])) for ix in indices]

(scores, parameters, columns) = rblib.load_as_matrix('ratings.txt')

for (col, ev) in eigenvalues(parameters, columns):

print "%40s %s" % (col, float(ev))

195 of 214

Output

195

abv colour sweet hoppy sour alcohol

United States

Eisbock Belarus Vietnam

0.184770392185

0.13154093951

0.121781685354

0.102241100597

0.0961537687655

0.0893502031589

0.0677552513387

....

-3.73028421245e-18

-3.73028421245e-18

-1.68514561515e-17

196 of 214

MapReduce

196

197 of 214

University pre-lecture, 1991

197

  • My first meeting with university was Open University Day, in 1991
  • Professor Bjørn Kirkerud gave the computer science talk
  • His subject
    • some day processors will stop becoming faster
    • we’re already building machines with many processors
    • what we need is a way to parallelize software
    • preferably automatically, by feeding in normal source code and getting it parallelized back
  • MapReduce is basically the state of the art on that today

198 of 214

MapReduce

  • A framework for writing massively parallel code
  • Simple, straightforward model
  • Based on “map” and “reduce” functions from functional programming (LISP)

198

199 of 214

http://research.google.com/archive/mapreduce.html

Appeared in:

OSDI'04: Sixth Symposium on Operating System Design and

Implementation,

San Francisco, CA, December, 2004.

199

200 of 214

map and reduce

200

>>> "1 2 3 4 5 6 7 8".split()

['1', '2', '3', '4', '5', '6', '7', '8']

>>> l = map(int, "1 2 3 4 5 6 7 8".split())

>>> l

[1, 2, 3, 4, 5, 6, 7, 8]

>>> import operator

>>> reduce(operator.add, l) 36

201 of 214

MapReduce

120

  1. Split data into fragments
  2. Create a Map task for each fragment
    • the task outputs a set of (key, value) pairs
  3. Group the pairs by key
  4. Call Reduce once for each key
    • all pairs with same key passed in together
    • reduce outputs new (key, value) pairs

Tasks get spread out over worker nodes

Master node keeps track of completed/failed tasks Failed tasks are restarted

Failed nodes are detected and avoided

Also scheduling tricks to deal with slow nodes

202 of 214

Communications

202

  • HDFS
    • Hadoop Distributed File System
    • input data, temporary results, and results are stored as files here
    • Hadoop takes care of making files available to nodes
  • Hadoop RPC
    • how Hadoop communicates between nodes
    • used for scheduling tasks, heartbeat etc
  • Most of this is in practice hidden from the developer

203 of 214

Does anyone need MapReduce?

203

  • I tried to do book recommendations with linear algebra
  • Basically, doing matrix multiplication to produce the full user/item matrix with blanks filled in
  • My Mac wound up freezing
  • 185,973 books x 77,805 users = 14,469,629,265

assuming 2 bytes per float = 28 GB of RAM

  • So it doesn’t necessarily take that much to have some use for MapReduce

204 of 214

The word count example

204

  • Classic example of using MapReduce
  • Takes an input directory of text files
  • Processes them to produce word frequency counts
  • To start up, copy data into HDFS
    • bin/hadoop dfs -mkdir <hdfs-dir>
    • bin/hadoop dfs -copyFromLocal <local-dir> <hdfs- dir>

205 of 214

WordCount – the mapper

  • public static class Map extends Mapper<LongWritable, ext, Text, IntWritable>
  • {
  • private final static IntWritable one = new IntWritable(1);
  • private Text word = new Text();

  • public void map(LongWritable key, Text value, Context context)
  • {
  • String line = value.toString();
  • StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());
  • context.write(word, one);
  • }
  • }

205

}

206 of 214

WordCount – the reducer

206

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context) {

int sum = 0;

for (IntWritable val : values) sum += val.get();

context.write(key, new IntWritable(sum));

}

}

207 of 214

The Hadoop ecosystem

207

  • Pig
    • dataflow language for setting up MR jobs
  • HBase
    • NoSQL database to store MR input in
  • Hive
    • SQL-like query language on top of Hadoop
  • Mahout
    • machine learning library on top of Hadoop
  • Hadoop Streaming
    • utility for writing mappers and reducers as

command-line tools in other languages

208 of 214

Word count in HiveQL

208

CREATE TABLE input (line STRING);

LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE

input;

-- temporary table to hold words... CREATE TABLE words (word STRING);

add file splitter.py;

INSERT OVERWRITE TABLE words SELECT TRANSFORM(text)

USING 'python splitter.py' AS word

FROM input;

SELECT word, COUNT(*)

FROM input

LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;

209 of 214

Word count in Pig

209

input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);

-- Extract words from each line and put them into a pig bag

-- datatype, then flatten the bag to get one word on each row

words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;

-- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+';

-- create a group for each word

word_groups = GROUP filtered_words BY word;

-- count the entries in each group

word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;

-- order the records by count

ordered_word_count = ORDER word_count BY count DESC;

STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';

210 of 214

Applications of MapReduce

210

  • Linear algebra operations
    • easily mapreducible
  • SQL queries over heterogeneous data
    • basically requires only a mapping to tables
    • relational algebra easy to do in MapReduce
  • PageRank
    • basically one big set of matrix multiplications
    • the original application of MapReduce
  • Recommendation engines
    • the SON algorithm
  • ...

211 of 214

Apache Mahout

  • Has three main application areas
    • others are welcome, but this is mainly what’s there now
  • Recommendation engines
    • several different similarity measures
    • collaborative filtering
    • Slope-one algorithm
  • Clustering
    • k-means and fuzzy k-means
    • Latent Dirichlet Allocation
  • Classification
    • stochastic gradient descent
    • Support Vector Machines
    • Naïve Bayes

130

212 of 214

SQL to relational algebra

select lives.person_name, city from works, lives

where company_name = ’FBC’ and works.person_name = lives.person_name

212

213 of 214

Translation to MapReduce

213

  • σ(company_name=‘FBC’, works)
    • map: for each record r in works, verify the condition,

and pass (r, r) if it matches

    • reduce: receive (r, r) and pass it on unchanged
  • π(person_name, σ(...))
    • map: for each record r in input, produce a new record r’ with only wanted columns, pass (r’, r’)
    • reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)
  • (π(...), lives)
    • map:
      • for each record r in π(...), output (person_name, r)
      • for each record r in lives, output (person_name, r)
    • reduce: receive (key, [record, record, ...]), and perform

the actual join

  • ...

214 of 214

Lots of SQL-on-MapReduce tools

214

  • Tenzing
  • Hive
  • YSmart
  • SQL-MR

Google

Apache Hadoop Ohio State AsterData

  • HadoopDB Hadapt

Microsoft RainStor Inc. ParAccel Inc. Cloudera

  • Polybase
  • RainStor
  • ParAccel
  • Impala
  • ...