1 of 23

Introduction to Data Science

S.V.V.D.Jagadeesh

Sr. Assistant Professor

Dept of Artificial Intelligence & Data Science

LAKIREDDY BALI REDDY COLLEGE OF ENGINEERING

2 of 23

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Previous Class Discussions

Session Outcomes
The Data Science Process
Setting the Research Goal
Retrieving Data
Data Preparation
Data Exploration
Data Modeling or Model Building
Presentation and Automation
An Iterative Process

LBRCE

IDS

3 of 23

At the end of this session, Student will be able to:

CO1: Understand the big data eco-system and data science process. (Understand- L2)

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Session Outcomes

LBRCE

IDS

4 of 23

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Big Data Eco-System

Currently many big data tools and frameworks exist, and it’s easy to get lost because new technologies appear rapidly.
It’s much easier once you realize that the big data ecosystem can be grouped into technologies that have similar goals and functionalities.
Distributed File System
Distributed Programming Framework
Data Integration Framework
Machine Learning Frameworks
NoSQL Databases
Scheduling Tools

LBRCE

IDS

5 of 23

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Big Data Eco-System

7. Benchmarking Tools

8. System Deployment

9. Service Programming

10. Security

LBRCE

IDS

6 of 23

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Distributed File Systems

A distributed file system is similar to a normal file system, except that it runs on multiple servers at once.
Because it’s a file system, you can do almost all the same things you’d do on a normal file system.
Actions such as storing, reading, and deleting files and adding security to files are at the core of every file system, including the distributed one.

LBRCE

IDS

7 of 23

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Distributed File Systems

Distributed file systems have significant advantages:

■ They can store files larger than any one computer disk.

■ Files get automatically replicated across multiple servers for redundancy or parallel operations while hiding the complexity of doing so from the user.

■ The system scales easily: you’re no longer bound by the memory or storage restrictions of a single server

LBRCE

IDS

8 of 23

Once you have the data stored on the distributed file system, you want to exploit it.
One important aspect of working on a distributed hard disk is that you won’t move your data to your program, but rather you’ll move your program to the data.
When you start from scratch with a normal general-purpose programming language such as C, Python, or Java, you need to deal with the complexities that come with distributed programming, such as restarting jobs that have failed, tracking the results from the different subprocesses, and so on.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Distributed Programming Framework

LBRCE

IDS

9 of 23

Luckily, the open source community has developed many frameworks to handle this for you, and these give you a much better experience working with distributed data and dealing with many of the challenges it carries.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Distributed Programming Framework

LBRCE

IDS

10 of 23

Once you have a distributed file system in place, you need to add data.
You need to move data from one source to another, and this is where the data integration frameworks such as Apache Sqoop and Apache Flume excel.
The process is similar to an extract, transform, and load process in a traditional data warehouse

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Data Integration Framework

LBRCE

IDS

11 of 23

When you have the data in place, it’s time to extract the coveted insights.
This is where you rely on the fields of machine learning, statistics, and applied mathematics.
A single computer could do all the counting and calculations and a world of opportunities opened.
Ever since this breakthrough, people only need to derive the mathematical formulas, write them in an algorithm, and load their data.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Machine Learning Framework

LBRCE

IDS

12 of 23

Libraries Available for ML Framework

■ PyBrain for neural networks—Neural networks are learning algorithms that mimic the human brain in learning mechanics and complexity. Neural networks are often regarded as advanced and black box

■ NLTK or Natural Language Toolkit—As the name suggests, its focus is working with natural language. It’s an extensive library that comes bundled with a number of text corpuses to help you model your own data.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Machine Learning Framework

LBRCE

IDS

13 of 23

■ Pylearn2—Another machine learning toolbox but a bit less mature than Scikit-learn.

■ TensorFlow—A Python library for deep learning provided by Google.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Machine Learning Framework

LBRCE

IDS

14 of 23

If you need to store huge amounts of data, you require software that’s specialized in managing and querying this data.
Traditionally this has been the playing field of relational databases such as Oracle SQL, MySQL, Sybase IQ, and others.
While they’re still the go-to technology for many use cases, new types of databases have emerged under the grouping of NoSQL databases.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

NoSQL Databases

LBRCE

IDS

15 of 23

Column databases—Data is stored in columns, which allows algorithms to perform much faster queries. Newer technologies use cell-wise storage. Table-like structures are still important.
Document stores—Document stores no longer use tables, but store every observation in a document. This allows for a much more flexible data scheme.
Streaming data—Data is collected, transformed, and aggregated not in batches but in real time. Although we’ve categorized it here as a database to help you in tool selection, it’s more a particular type of problem that drove creation of technologies such as Storm

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Types of NoSQL Databases

LBRCE

IDS

16 of 23

Key-value stores—Data isn’t stored in a table; rather you assign a key for every value, such as org.marketing.sales.2015: 20000. This scales well but places almost all the implementation on the developer.
SQL on Hadoop—Batch queries on Hadoop are in a SQL-like language that uses the map-reduce framework in the background.
New SQL—This class combines the scalability of NoSQL databases with the advantages of relational databases. They all have a SQL interface and a relational data model.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Types of NoSQL Databases

LBRCE

IDS

17 of 23

Graph databases—Not every problem is best stored in a table. Particular problems are more naturally translated into graph theory and stored in graph databases. A classic example of this is a social network

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Types of NoSQL Databases

LBRCE

IDS

18 of 23

Scheduling tools help you automate repetitive tasks and trigger jobs based on events such as adding a new file to a folder.
These are similar to tools such as CRON on Linux but are specifically developed for big data.
You can use them, for instance, to start a MapReduce task whenever a new dataset is available in a directory

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Scheduling Tools

LBRCE

IDS

19 of 23

This class of tools was developed to optimize your big data installation by providing standardized profiling suites.
A profiling suite is taken from a representative set of big data jobs.
Benchmarking and optimizing the big data infrastructure and configuration aren’t often jobs for data scientists themselves but for a professional specialized in setting up IT infrastructure; thus they aren’t covered in this book.
Using an optimized infrastructure can make a big cost difference.
For example, if you can gain 10% on a cluster of 100 servers, you save the cost of 10 servers

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

BenchMarking Tools

LBRCE

IDS

20 of 23

Setting up a big data infrastructure isn’t an easy task and assisting engineers in deploying new applications into the big data cluster is where system deployment tools shine.
They largely automate the installation and configuration of big data components.
This isn’t a core task of a data scientist.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

System Deployment

LBRCE

IDS

21 of 23

Suppose that you’ve made a world-class soccer prediction application on Hadoop, and you want to allow others to use the predictions made by your application.
However, you have no idea of the architecture or technology of everyone keen on using your predictions.
Service tools excel here by exposing big data applications to other applications as a service.
Data scientists sometimes need to expose their models through services.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Service Programming

LBRCE

IDS

22 of 23

We probably need to have fine-grained control over the access to data but don’t want to manage this on an application-by-application basis.
Big data security tools allow you to have central and fine-grained control over access to the data.
Big data security has become a topic in its own right, and data scientists are usually only confronted with it as data consumers; seldom will they implement the security themselves.

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Security

LBRCE

IDS

23 of 23

Session Outcomes
Big Data Eco-System
Distributed File System
Distributed Programming Framework
Data Integration Framework
Machine Learning Framework
NoSQL Databases
Scheduling Tools
BenchMarking Tools
System Deployment
Service Programming
Security

S.V.V.D.Jagadeesh

Saturday, December 21, 2024

Summary

LBRCE

IDS