Miss Sadhana S Kekan
Information Technology
Modern College of Engineering, Shivajinagar, Pune
Email:sadhanakekan@moderncoe.edu.in
DATA SCIENCE AND BIG DATA Analytics
Books :
Unit Objectives
INTRODUCTION: DATA SCIENCE AND BIG DATA
Unit outcomes:
Outcome Mapping: PEO: I,V , PEO c, e CO: 1, 2, PSO: 3,4
INTRODUCTION: DATA SCIENCE AND BIG DATA
�
- from where the information can be taken
- what it signifies
- how it can be converted into a useful resource in the creation of business & IT strategies.
INTRODUCTION: DATA SCIENCE AND BIG DATA
�
- reduce costs
-raise efficiencies
- identifies new market opportunities.
- enhances organization’s competitive benefit.
INTRODUCTION: DATA SCIENCE AND BIG DATA
�Fig: Data Science
Data Science
Hacker Mindset
Statistics
Advanced Computing
Visualization
Math
Domain Expertise
Data Engineering
Scientific Method
INTRODUCTION: DATA SCIENCE AND BIG DATA
Data Scientists
Fig: The Data Science Pipeline
Data and its structure�
- highly organized data
- exists within a repository such as a database (or a comma- separated values [CSV] file).
- easily accessible.
- format of the data makes it appropriate for queries and computation (by using languages such as Structured Query Language (SQL)).
Data and its structure�
Figure : Models of data
Data engineering�
Data wrangling:
- sourcing the data from one or more data sets (in addition to reducing the set to the required data),
- normalizing the data so that data merged from multiple data sets is consistent.
- parsing data into some structure or storage for further use.
Data cleansing�
Data preparation/preprocessing�
Machine learning���
Model learning��
Machine learning approaches:
- algorithm is trained to produce the correct class and alter the model when it fails to do so.
- The model is trained until it reaches some level of accuracy.
- has no class; instead, it inspects the data and groups it based on some structure that is hidden within the data.
- these types of algorithms can be used in recommendation systems by grouping customers based on the viewing or purchasing history.
Machine learning approaches
Reinforcement learning
- is a semi-supervised learning algorithm.
- provides a reward after the model makes some number of decisions that lead to a satisfactory result.
Model validation
Reinforcement learning
Operations:
Model deployment:
Prediction System
I /P
O/P
Historical Financial data eg.Sales & Revenue
Classification of whether a company is a reasonable acquisition target.
Reinforcement learning
Model visualization:
�Summary- Definitions of Data Science
INTRODUCTION: DATA SCIENCE AND BIG DATA
Introduction to Big Data �
Introduction to Big Data�
Introduction to Big Data
Introduction to Big Data�
- organizing it
- determining what we can do with it.
Big Data
Is a new data challenge that requires leveraging existing systems differently.
Is classified in terms of 4Vs:Volume ,Variety, Velocity, Veracity
Is usually unstructured & qualitative in nature.
Real-world Examples of Big Data
Evolution of Bigdata
Big Data
Big Data
Structuring Big data
Parallel Processing
Data Science
Artificial Intelligence
Data mining
Distributed System
Data Storage
Analysis
Big Data
Fig: Concepts of Big Data
Big Data
Big Data
Semi-Structured Data
Structured Data
Unstructured Data
₊
₊
₌
Fig: Types of Data
Big Data
Elements of Big Data
Elements of Big Data
Velocity
Velocity
Amazon, FB, Yahoo, Google, Sensor data, Mobile networks etc.
Variety
Veracity
Data Explosion
�Business model transformation
�Globalization
Personalization of services
Big Data Processing Architectures�
Fig: Components of Big data architecture:
Big Data Processing Architectures�
Data sources.
Data storage.
Big Data Processing Architectures�
Batch processing.
Real-time message ingestion.
Stream processing.
After capturing real-time messages, the solution must process them by filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data is then written to an output sink.
Big Data Processing Architectures�
Analytical data store.
Many big data solutions prepare data for analysis and then serve the processed data in a structured format that can be queried using analytical tools.
Analysis and reporting.
The goal of most big data solutions is to provide insights into the data through analysis and reporting.
Orchestration:
INTRODUCTION: DATA SCIENCE AND BIG DATA
Data processing infrastructure challenges
Data processing infrastructure challenges
Storage
Data processing infrastructure challenges
Transportation
Data processing infrastructure challenges
Processing
1. CPU or processor.
2. Memory
3. Software
Data processing infrastructure challenges
CPU or processor.
- the computing speed and processing power have increased
-leading to more processing capabilities
- access to wider memory.
- architecture evolution within the software layers.
Memory.
Data processing infrastructure challenges
Software
Data processing infrastructure challenges
Speed or throughput
Big Data Processing Architectures
Big Data Processing Architectures
Centralized Processing Architecture
Advantages :
Big Data Processing Architectures
Distributed Processing Architecture
Data and its processing are distributed across geographies or data centers
Types:
1. Client –Server Architecture
Client : Collection and Presentation
Sever : Processing and Management
2. Three tier architecture
Client ,Server ,Middle tier
Middle Tier : Processing Logic
3.n-tier Architecture
clients, middleware, applications, and servers are isolated into tiers.
Any tier can be scaled independently
Big Data Processing Architectures
4. Cluster architecture.
5. Peer-to-peer architecture.
Big Data Processing Architectures
Distributed processing advantages :
– Scalability of systems and resources can be achieved based on isolated needs.
– Processing and management of information can be architected based on desired unit of operation.
– Parallel processing of data reducing time latencies.
Distributed processing Disadvantages:
– Data redundancy
– Process redundancy
– Resource overhead
– Volumes
Big Data Processing Architectures
Lamda Architecture
The batch layer feeds into a serving layer that indexes the batch view for efficient querying.
The speed layer updates the serving layer with incremental updates based on the most recent data.
Big Data Processing Architectures
Lamda Architecture
Batch Layer (Cold Path)
Stores all incoming data & perform a batch processing
Managing all historical data
Recomputing the result using machine learning model
Results come at high latency due to computational cost
Data can be only appended not updated or deleted
Data is stored using memory databases or long term persistent like no-SQL storages
Uses Map-reduce
Speed Layer
Provide low –latency result
Data is processed in real-time
Incremental Algorithms
Create ,delete dataset is possible
Big Data Processing Architectures
Lamda Architecture
Serving Layer :
User fires query
Applications:
Ad-hoc queries
Netflix,Twitter,Yahoo
Pros:
Batch layer manages historical data so low error when system crashes
Good speed, reliability
Fault tolerance and scalable processing
Cons:
Caching overhead , complexity ,duplicate computation
Difficult to migrate or reorganize
Kappa Architecture
Big Data Processing Architectures
Big Data Processing Architectures
Big Data Processing Architectures
Kappa Architecture
Big Data Processing Architectures
�Zeta architecture
Big Data Processing Architectures
�Zeta architecture diagram
Big Data Processing Architectures
The Traditional Research Approach
Source
Source
Source
. . .
Integration System
. . .
Metadata
Clients
Wrapper
Wrapper
Wrapper
Big Data Processing Architectures
Data Warehouse
The Warehousing Approach
Data
Warehouse
Clients
Source
Source
Source
. . .
Extractor/
Monitor
Integration System
. . .
Metadata
Extractor/
Monitor
Extractor/
Monitor
The Warehousing Approach
The Warehousing Approach
Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin]
Characteristics of Data Warehouse
Characteristics of Data Warehouse
Need of Data Warehouses
Data Warehouse
A subject-oriented ,integrated, time-variant and non-volatile collection of data in support of managements decision making process is called as data warehouse.
Subject-oriented
DWH organized around major subjects of enterprise (e.g. customer ,product sales ) rather than application areas (customer invoicing ,stock control,product sale)
Integrated : Data coming from enterprise wide applications in different formats.
Time –Variant
DWH behave differently at different time interval
Non-Volatile
New data is always added in existing one rather than replacement
Merits Data Warehouse
DWH supports data conversion into common & standard format
No discrepancy
3. Saves time and money
Saves user’s time data is at one place
DWH execution doesnot require IT support & higher no of channels
4. Tracks historically Intelligent data
updates about changing trends
5. Generates high revenue
Advantages of Warehousing Approach
Which are our� lowest/highest margin �customers ?
Who are my customers �and what products �are they buying?
Which customers� are most likely to go �to the competition ?
What impact will �new products/services
have on revenue �and margins?
What product prom-�-otions have the biggest �impact on revenue?
What is the most �effective distribution �channel?
Comparison Chart of Database Types
Data warehouse | Operational system |
Subject oriented | Transaction oriented |
Large (hundreds of GB up to several TB) | Small (MB up to several GB) |
Historic data | Current data |
De-normalized table structure (few tables, many columns per table) | Normalized table structure (many tables, few columns per table) |
Batch updates | Continuous updates |
Usually very complex queries | Simple to complex queries |
Comparison Chart of Database Types
Data warehouse | Big Data |
Extracts data from varieties of SQL based data sources(relational databases) & help for generating analytic reports. | Handle huge data coming from various heterogeneous recourses including social media. |
Mainly handle structured data. | Can handle structured , unstructured , semi-structured data. |
Helps to analytic on informed(specific) information. | Has a lot of data so analytic provides information by extracting useful information from data. |
Don’t use Distributed file system. | Use Distributed file system. |
Never erase previous data when new data is added. | Also Never erase previous data when new data is added but sometimes real-time data streams are processed. |
Timing of fetching simultaneously is more. | Timing of fetching simultaneously is small using Hadoop File System. |
Reengineering the data warehouse
Reengineering the Data Warehouse
Enterprise data warehouse platform
There are several layers of infrastructure that make the platform for the EDW:
1. The hardware platform:
● Database server:
– Processor
– Memory
– BUS architecture
● Storage server:
– Class of disk
– Controller
● Network
2. Operating system
3. Application software:
● Database
● Utilities
�Data distribution in a data warehouse
Operational data store
Reengineering the Data Warehouse
�Choices for reengineering the data warehouse
Replatforming
Reengineering the Data Warehouse
Reengineering the Data Warehouse
Benefits:
● move the data warehouse to a scalable and reliable platform.
● The underlying infrastructure and the associated application software layers can be architected
● to provide security, lower maintenance, and increase reliability.
● Optimize the application and database code.
● Provide some additional opportunities to use new functionality.
● Makes it possible to rearchitect things in a different/better way, which is almost impossible to do in an existing setup.
Reengineering the Data Warehouse
Disadvantages:
● Takes a long cycle time to complete, leading to disruption of business activities.
● Replatforming often means reverse engineering complex business processes and rules that may be undocumented or custom developed in the current platform.
● May not be feasible for certain aspects of data processing or there may be complex calculations that need to be rewritten if they cannot be directly supported by the functionality of the new platform.
● Replatforming is not economical in environments that have large legacy platforms, as it consumes too many business process cycles to reverse engineer logic and documenting the same.
Data warehouse platform.
Platform engineering
● Reduce the cost of the data warehouse.
● Increase efficiencies of processing.
● Simplify the complexities in the acquisition, processing, and delivery of data.
● Reduce redundancies.
Platform engineering
Platform engineering
Data engineering
often scrubbed and new additions are made to the data model.
Typical changes include:
Platform engineering
Architectures
Shared-everything architecture
Symmetric multiprocessing (SMP) | Distributed shared memory (DSM). |
Share a single pool of memory for read–write access concurrently and uniformly without latency. | addresses the scalability problem by providing multiple pools of memory for processors to use. |
Referred to as uniform memory access (UMA) architecture. | Referred to as non uniform memory access (NUMA) architecture. |
The drawback is when multiple processors are present & share a single system bus, which results in choking of the bandwidth for simultaneous memory access , therefore, the scalability of such system is very limited. | latency to access memory depends on the relative distances of the processors and their dedicated memory pools. |
Shared-everything architecture
Fig: Shared-everything architecture
Shared-everything architecture�
Fig: Shared-nothing architecture
Shared-everything architecture��
Shared-everything architecture�Advantages of Shared-nothing architecture
Disadvantages of Shared-nothing architecture
108
6
What is big data?
Big Data is any thing which is crash Excel.
Small Data is when is fit in RAM. Big Data is when is crash because is not fit in RAM.
110
Or, in other words, Big Data is data in volumes too great to process by traditional methods.
https://twitter.com/devops_borat
Data accumulation
111
– ...
From WWW to VVV
112
The promise of Big Data
114
115
“quadrupling the average cow's milk production since your parents were born”
116
"When Freddie [as he is known] had no daughter records our equations predicted from his DNA that he would be the best bull," USDA research geneticist Paul VanRaden emailed me with a detectable hint of pride. "Now he is the best progeny tested bull (as predicted)."
Some more examples
14
https://delicious.com/larsbot/big-data
Ok, ok, but ... does it apply to our�customers?
118
How to extract insight from data?
Monthly Retail Sales in New South Wales (NSW) Retail Department Stores
119
Types of algorithms
120
Basically, it’s all maths...
18
https://twitter.com/devops_borat
Only 10% in devops are know how of work with Big Data.
Only 1% are realize they are need 2 Big Data for fault tolerance
Big data skills gap
19
http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
Two orthogonal aspects
20
Data science?
21
How to process Big Data?
22
https://twitter.com/devops_borat
Mining of Big Data is problem solve in 2013 with zgrep
MapReduce
126
NoSQL and Big Data
127
The 4th V: Veracity
25
https://twitter.com/devops_borat
“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.”
Daniel Borstin, in The Discoverers (1983)
95% of time, when is clean Big Data is get Little Data
Data quality
26
Approaches to learning
130
Approaches to learning
131
Issues
132
Underfitting
133
Overfitting
134
“What if the knowledge and data we have are not sufficient to completely determine the correct classifier? Then we run the risk of just hallucinating a classifier (or parts of it) that is not grounded in reality, and is simply encoding random quirks in the data. This problem is called overfitting, and is the bugbear of machine learning. When your learner outputs a classifier that is 100% accurate on the training data but only 50% accurate on test data, when in fact it could have output one that is 75% accurate on both,
it has overfit.”
35
Testing
136
Missing values
137
– ...
Terminology
138
Top 10 algorithms
139
Top 10 machine learning algs
1. C4.5
2. k-means clustering
3. Support vector machines
4. the Apriori algorithm
5. the EM algorithm
6. PageRank
7. AdaBoost
40
From a survey at IEEE International Conference on Data Mining (ICDM) in December 2006. “Top 10 algorithms in data mining”, by X. Wu et al
C4.5
141
Support Vector Machines
142
– using a transformation to a higher dimension to
handle more complex class boundaries
Apriori
43
appear together
Expectation Maximization
144
“maximization” step
PageRank
145
on a given node at a given time
AdaBoost
146
Naïve Bayes
147
Bayes’s Theorem
148
“naïve”
Simple example
68
>>> compute_bayes([0.92, 0.84])
0.9837067209775967
Ways I’ve used Bayes
69
Bayes against spam
70
Running the script
71
Code
72
# scan spam
for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(spam):
corpus.spam(token)
# scan ham
for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]: for token in featurize(ham):
corpus.ham(token)
# compute probability for email in sys.argv[3 : ]:
print email
p = classify(email) if p < 0.2:
print ' Spam', p else:
print ' Ham', p
https://github.com/larsga/py-snippets/tree/master/machine-learning/spam
Classify
154
class Feature:
def init (self, token): self._token = token self._spam = 0
self._ham = 0
def spam(self): self._spam += 1
def ham(self): self._ham += 1
def spam_probability(self):
return (self._spam + PADDING) / float(self._spam + self._ham + (PADDING * 2))
def compute_bayes(probs):
product = reduce(operator.mul, probs)
lastpart = reduce(operator.mul, map(lambda x: 1-x, probs)) if product + lastpart == 0:
return 0 # happens rarely, but happens
else:
return product / (product + lastpart)
def classify(email):
return compute_bayes([corpus.spam_probability(f) for f in featurize(email)])
Ham output
155
Ham 1.0 | |
Received:2013 | 0.00342935528121 |
Date:2013 | 0.00624219725343 |
<br | 0.0291715285881 |
background-color: | 0.03125 |
background-color: | 0.03125 |
background-color: | 0.03125 |
background-color: | 0.03125 |
background-color: | 0.03125 |
Received:Mar | 0.0332667997339 |
Date:Mar | 0.0362756952842 |
... | |
Postboks | 0.998107494322 |
Postboks | 0.998107494322 |
Postboks | 0.998107494322 |
+47 | 0.99787414966 |
+47 | 0.99787414966 |
+47 | 0.99787414966 |
+47 | 0.99787414966 |
Lars | 0.996863237139 |
Lars | 0.996863237139 |
23 | 0.995381062356 |
So, clearly most of the spam is from March 2013...
Spam output
156
Spam 2.92798502037e-16 | |
Received:-0400 | 0.0115646258503 |
Received:-0400 | 0.0115646258503 |
Received-SPF:(ontopia.virtual.vps-host.net: | 0.0135823429542 |
Received-SPF:receiver=ontopia.virtual.vps-host.net; | 0.0135823429542 |
Received:<larsga@ontopia.net>; | 0.0139318885449 |
Received:<larsga@ontopia.net>; | 0.0139318885449 |
Received:ontopia.virtual.vps-host.net | 0.0170863309353 |
Received:(8.13.1/8.13.1) | 0.0170863309353 |
Received:ontopia.virtual.vps-host.net | 0.0170863309353 |
Received:(8.13.1/8.13.1) | 0.0170863309353 |
... | |
Received:2012 | 0.986111111111 |
Received:2012 | 0.986111111111 |
$ | 0.983193277311 |
Received:Oct | 0.968152866242 |
Received:Oct | 0.968152866242 |
Date:2012 | 0.959459459459 |
20 | 0.938864628821 |
+ | 0.936526946108 |
+ | 0.936526946108 |
+ | 0.936526946108 |
...and the ham from October 2012
More solid testing
76
Linear regression
Linear regression
– ...
Estimating real estate prices
– ...
Our data set: beer ratings
Example
ABV | .se | .nl | .us | .uk | IIPA | Black IPA | Pale ale | Bitter | Rating |
8.5 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.5 |
8.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 3.7 |
6.2 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.2 |
4.4 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3.2 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Basically, we turn each category into a column of 0.0 or 1.0 values.
Normalization
– compute with min(15.0, abv) / 15.0
Adding more data
Making predictions
– a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5
solve the equation
Matrix formulation
Enter Numpy
Quick Numpy example
[[0, 1, 2, 3, 4, 5, 6, 7, 8, | 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, | 9], [0, 1, 2, 3, 4, 5, |
6, 7, 8, 9], [0, 1, 2, 3, 4, | 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, | 5, 6, 7, 8, 9], [0, 1, |
>>> from numpy import *
>>> range(10)
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> [range(10)] * 10
2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8,
9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]]
>>> m = mat([range(10)] * 10)
>>> m
matrix([[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9], |
[0, | 1, | 2, | 3, | 4, | 5, | 6, | 7, | 8, | 9]]) |
>>> m.T | | | | | | | | | |
matrix([[0, | 0, | 0, | 0, | 0, | 0, | 0, | 0, | 0, | 0], |
[1, | 1, | 1, | 1, | 1, | 1, | 1, | 1, | 1, | 1], |
[2, | 2, | 2, | 2, | 2, | 2, | 2, | 2, | 2, | 2], |
[3, | 3, | 3, | 3, | 3, | 3, | 3, | 3, | 3, | 3], |
[4, | 4, | 4, | 4, | 4, | 4, | 4, | 4, | 4, | 4], |
[5, | 5, | 5, | 5, | 5, | 5, | 5, | 5, | 5, | 5], |
[6, | 6, | 6, | 6, | 6, | 6, | 6, | 6, | 6, | 6], |
[7, | 7, | 7, | 7, | 7, | 7, | 7, | 7, | 7, | 7], |
[8, | 8, | 8, | 8, | 8, | 8, | 8, | 8, | 8, | 8], |
[9, | 9, | 9, | 9, | 9, | 9, | 9, | 9, | 9, | 9]]) |
Numpy solution
x_mat = mat(parameters) y_mat = mat(scores).T
x_tx = x_mat.T * x_mat
assert linalg.det(x_tx)
ws = x_tx.I * (x_mat.T * y_mat)
Does it work?
89
Beyond prediction
171
– that is, which aspects of a beer best predict the rating
– Aspect | LMG | grove |
– ABV | 0.56 | 1.1 |
– colour | 0.46 | 0.42 |
– sweetness | 0.25 | 0.51 |
– hoppiness | 0.45 | 0.41 |
– sourness | 0.29 | 0.87 |
Did we underfit?
172
Scatter plot
92
Freeze-distilled Brewdog beers
Rating
ABV in %
Code in Github, requires matplotlib
Trying again
174
Matrix factorization
175
Clustering
176
Clustering
177
– objects that are similar go into the same group
Sample data
178
Distance
179
– ...
k-means clustering
180
First attempt at aircraft
181
3 jet bombers, one propeller bomber. Not too bad.
Cluster 5
cluster5, 4 models
ceiling : 13400.0
maxspeed : 1149.7
crew : 7.5
length : 47.275
height : 11.65
emptyweight : 69357.5
wingspan : 47.18
The Myasishchev M-50 was a Soviet prototype four-engine supersonic bomber which never attained service
The Tupolev Tu-16 was a twin-engine jet bomber used by the Soviet Union.
The Myasishchev M-4 Molot is a four-engined strategic bomber
The Convair B-36 "Peacemaker” was a strategic bomber built by Convair and operated solely by the United States Air Force (USAF) from 1949 to 1959
182
Small, slow propeller aircraft. Not too bad.
Cluster 4
102
cluster4, 56 models
ceiling : 5898.2
maxspeed : 259.8
crew : 2.2
length : 10.0
height : 3.3
emptyweight : 2202.5
wingspan : 13.8
The Avia B.135 was a Czechoslovak cantilever monoplane fighter aircraft
The North American B-25 Mitchell was an American twin-engined medium bomber
The Yakovlev UT-1 was a single-seater trainer aircraft
The Yakovlev UT-2 was a single-seater trainer aircraft
The Siebel Fh 104 Hallore was a small German twin-engined transport, communications and liaison aircraft
The Messerschmitt Bf 108 Taifun was a German single-engine sports and touring
aircraft
The Airco DH.2 was a single-seat biplane "pusher" aircraft
Small, very fast jet planes. Pretty good.
Cluster 3
103
cluster3, 12 models
ceiling : 16921.1
maxspeed : 2456.9
crew : 2.67
length : 17.2
height : 4.92
emptyweight : 9941
wingspan : 10.1
The Mikoyan MiG-29 is a fourth- generation jet fighter aircraft
The Vought F-8 Crusader was a single-engine, supersonic [fighter] aircraft
The English Electric Lightning is a supersonic jet fighter aircraft of the Cold War era, noted for its great speed.
The Dassault Mirage 5 is a supersonic attack aircraft
The Northrop T-38 Talon is a two- seat, twin-engine supersonic jet trainer
The Mikoyan MiG-35 is a further development of the MiG-29
Biggish, kind of slow planes. Some oddballs in this group.
Cluster 2
cluster2, 27 models
ceiling : 6447.5
maxspeed : 435
crew : 5.4
length : 24.4
height : 6.7
emptyweight : 16894
wingspan : 32.8
The Bartini Beriev VVA-14 (vertical
take-off amphibious aircraft)
The Aviation Traders ATL-98 Carvair was a large piston-engine transport aircraft.
The Fokker 50 is a turboprop- powered airliner
The PB2Y Coronado was a large flying boat patrol bomber
The Junkers Ju 89 was a heavy bomber
The Beriev Be-200 Altair is a
multipurpose amphibious aircraft
104The Junkers Ju 290 was a long-range transport, maritime patrol aircraft and heavy bomber
Small, fast planes. Mostly good, though the Canberra is a poor fit.
Cluster 1
cluster1, 50 models
ceiling : 11612
maxspeed : 726.4
crew : 1.6
length : 11.9
height : 3.8
emptyweight : 5303
wingspan : 13
The Adam A700 AdamJet was a proposed six-seat civil utility aircraft
The Learjet 23 is a ... twin-engine, high-speed business jet
The Learjet 24 is a ... twin-engine,
105 high-speed business jet
The Curtiss P-36 Hawk was an American- designed and built fighter aircraft
The Kawasaki Ki-61 Hien was a Japanese World War II fighter aircraft
The Grumman F3F was the last American biplane fighter aircraft
The English Electric Canberra is a first-generation jet-powered light bomber
The Heinkel He 100 was a German pre- World War II fighter aircraft
Clusters, summarizing
106
For a first attempt to sort through the data, this is not bad at all
https://github.com/larsga/py-snippets/tree/master/machine-learning/aircraft
Agglomerative clustering
– from here on, treat clusters like objects
107
There is code for this, too, in the Github sample
Principal component analysis
189
PCA
190
An example data set
variables
110
Dimensionality reduction
192
Trying out PCA
193
beer, measured in centiliters
Complete code
194
import rblib
from numpy import *
def eigenvalues(data, columns):
covariance = cov(data - mean(data, axis = 0), rowvar = 0) eigvals = linalg.eig(mat(covariance))[0]
indices = list(argsort(eigvals))
indices.reverse() # so we get most significant first
return [(columns[ix], float(eigvals[ix])) for ix in indices]
(scores, parameters, columns) = rblib.load_as_matrix('ratings.txt')
for (col, ev) in eigenvalues(parameters, columns):
print "%40s %s" % (col, float(ev))
Output
195
abv colour sweet hoppy sour alcohol
United States
Eisbock Belarus Vietnam
0.184770392185
0.13154093951
0.121781685354
0.102241100597
0.0961537687655
0.0893502031589
0.0677552513387
....
-3.73028421245e-18
-3.73028421245e-18
-1.68514561515e-17
MapReduce
196
University pre-lecture, 1991
197
MapReduce
198
http://research.google.com/archive/mapreduce.html
Appeared in:
OSDI'04: Sixth Symposium on Operating System Design and
Implementation,
San Francisco, CA, December, 2004.
199
map and reduce
200
>>> "1 2 3 4 5 6 7 8".split()
['1', '2', '3', '4', '5', '6', '7', '8']
>>> l = map(int, "1 2 3 4 5 6 7 8".split())
>>> l
[1, 2, 3, 4, 5, 6, 7, 8]
>>> import operator
>>> reduce(operator.add, l) 36
MapReduce
120
Tasks get spread out over worker nodes
Master node keeps track of completed/failed tasks Failed tasks are restarted
Failed nodes are detected and avoided
Also scheduling tricks to deal with slow nodes
Communications
202
Does anyone need MapReduce?
203
– assuming 2 bytes per float = 28 GB of RAM
The word count example
204
WordCount – the mapper
205
}
WordCount – the reducer
206
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values) sum += val.get();
context.write(key, new IntWritable(sum));
}
}
The Hadoop ecosystem
207
command-line tools in other languages
Word count in HiveQL
208
CREATE TABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTO TABLE
input;
-- temporary table to hold words... CREATE TABLE words (word STRING);
add file splitter.py;
INSERT OVERWRITE TABLE words SELECT TRANSFORM(text)
USING 'python splitter.py' AS word
FROM input;
SELECT word, COUNT(*)
FROM input
LATERAL VIEW explode(split(text, ' ')) lTable as word GROUP BY word;
Word count in Pig
209
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet' AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word;
-- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
Applications of MapReduce
210
Apache Mahout
130
SQL to relational algebra
select lives.person_name, city from works, lives
where company_name = ’FBC’ and works.person_name = lives.person_name
212
Translation to MapReduce
213
and pass (r, r) if it matches
the actual join
Lots of SQL-on-MapReduce tools
214
Apache Hadoop Ohio State AsterData
Microsoft RainStor Inc. ParAccel Inc. Cloudera