1 of 35

DATA SCIENCE USING R

VIII SEMESTER

DS-427T

UNIT-3

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

1

12/4/2024

2 of 35

Linear Algebra For Data Science

Linear algebra, a fundamental branch of mathematics, involves the study of vectors, matrices, and linear transformations. In data science, linear algebra provides the backbone for various techniques used to analyze and interpret data. It helps in modeling relationships, optimizing algorithms, and performing complex calculations.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

2

12/4/2024

3 of 35

For instance, data scientists use linear algebra to manipulate datasets, build machine learning models, and perform dimensionality reduction. Understanding these mathematical principles is crucial for leveraging data effectively and developing advanced analytical solutions

Importance of Linear Algebra in Data Science

Linear algebra is the bedrock upon which many data science techniques are built. It provides the mathematical framework for understanding and manipulating data, making it essential for various tasks in the field. Let’s break down its importance:

Linear algebra offers powerful tools to represent and manipulate data efficiently, enabling tasks such as data cleaning, transformation, and feature engineering.
Many machine learning algorithms, including linear regression, support vector machines, and neural networks, rely heavily on linear algebra operations for training and prediction.
Techniques like Principal Component Analysis (PCA) leverage linear algebra to reduce the dimensionality of data while preserving essential information, improving computational efficiency and model performance.
Linear algebra concepts enable the identification and extraction of patterns and relationships within data, facilitating tasks such as clustering, classification, and anomaly detection.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

3

12/4/2024

4 of 35

Key Concepts in Linear Algebra

Vectors and Matrices: Vectors are arrays of numbers that represent data points in a multidimensional space, while matrices are rectangular arrays used to organize and manipulate data. These structures are crucial for representing and analyzing datasets in data science.

Matrix Operations: Operations such as addition and multiplication allow data scientists to perform complex calculations and adjustments on datasets. For example, matrix multiplication is used in algorithms to combine and transform data efficiently.

Eigenvalues and Eigenvectors: These concepts help in understanding the structure of data. Eigenvalues and eigenvectors are integral to techniques like PCA, where they are used to reduce dimensionality and highlight the most significant features of a dataset.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

4

12/4/2024

5 of 35

Applications of Linear Algebra in Data Science

Machine Learning Algorithms: Linear algebra is crucial in training and optimizing machine learning models. For instance, linear regression uses matrix operations to fit a model to data, adjusting coefficients to minimize error and make predictions.
Image Processing: Linear algebra techniques are used to transform and enhance images. PCA, for example, can compress images by reducing their dimensionality while preserving essential features, which is useful in image compression and enhancement.
Natural Language Processing (NLP): In NLP, linear algebra is used to represent and analyze text through embeddings. Word2Vec, for example, creates vector representations of words, enabling efficient text analysis and semantic understanding.
Data Fitting and Predictions: Linear algebra is applied in creating predictive models. Polynomial regression, which fits a polynomial function to data, utilizes matrix operations to estimate relationships and make forecasts.
Network Analysis: Linear algebra helps analyze networks and graphs, such as social networks or web links. The PageRank algorithm, used by search engines to rank pages, relies on matrix operations to assess the importance of nodes within a network.
Optimization Problems: Solving complex optimization problems often involves linear algebra. Techniques like gradient descent use matrix operations to find optimal solutions in various applications, from machine learning to resource allocation

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

5

12/4/2024

6 of 35

Advanced Techniques in Linear Algebra for Data Science

Singular Value Decomposition (SVD): SVD decomposes matrices into singular values and vectors, aiding in dimensionality reduction and data compression. It is commonly used in recommendation systems and image processing.
Principal Component Analysis (PCA): PCA simplifies data by reducing its dimensionality while retaining variance. It’s applied in feature reduction and data visualization, making complex datasets more manageable.
Tensor Decompositions: Tensors extend matrices to higher dimensions, allowing for the analysis of multi-dimensional data. Decomposing tensors helps handle complex datasets, such as those involving time-series or multi-view data.
Conjugate Gradient Method: This method is used for solving large linear systems efficiently. It’s often applied in numerical solutions for systems arising in scientific computations and machine learning.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

6

12/4/2024

7 of 35

Graphs in R

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

7

12/4/2024

8 of 35

One of the main reasons data analysts turn to R is for its strong graphic capabilities. R offers a rich set of built-in functions and packages for creating various types of graphs. Graphs are a powerful tool for data visualization, enabling complex data patterns, trends, and relationships to be more comprehensible. With R, users can create simple charts such as pie, bar, and line graphs to more sophisticated plots like scatter plots, box plots, heat maps, and histograms. It supports high-level graphics including generic plotting system, grid graphics, and lattice graphics. The 'ggplot2' package, a part of the tidyverse, has revolutionized the way R users create high-quality and complex plots due to its layering concept, which allows for a step-by-step, intuitive build-up of a plot.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

8

12/4/2024

9 of 35

Using graphs in R often begins with data cleaning and preparation, followed by defining the type of graph, customizing the plot's aesthetics such as colors, scales, and theme, and finally rendering the plot. R's graphing capabilities are not only versatile but also highly customizable, providing control over nearly every graphical parameter. This is especially true with 'ggplot2', which offers a coherent system for describing and building graphs. Despite the learning curve associated with it, mastering graphing in R can help data scientists, statisticians, and researchers effectively communicate their findings and insights, making it a powerful tool in the field of data science and analytics

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

9

12/4/2024

10 of 35

Advanced Graphs

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

10

12/4/2024

11 of 35

Customization

Graphical parameters describes how to change a graph's symbols, fonts, colors, and lines. Axes and text describe how to customize a graph's axes, add reference lines, text annotations and a legend. Combining plots describes how to organize multiple plots into a single graph.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

11

12/4/2024

12 of 35

Basic Concepts of Probability in R

Probability in R is the measure of the likelihood that an event will occur. The probability of an event A, denoted as P(A), lies between 0 and 1, where 0 indicates impossibility and 1 indicates certainty. Some key concepts include:

Sample Space (S): The set of all possible outcomes of a random experiment.
Event: Any subset of the sample space.
Probability of an Event: The likelihood of occurrence of an event, calculated as the ratio of favorable outcomes to the total number of outcomes

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

12

12/4/2024

13 of 35

Calculating Probabilities in R

R offers various functions and packages for calculating Probability in R and performing statistical analyses. Some commonly used functions include:

dbinom(): Computes the probability mass function (PMF) for the binomial distribution.
pnorm(): Calculates the cumulative distribution function (CDF) for the normal distribution.
dpois(): Computes the PMF for the Poisson distribution.
punif(): Calculates the CDF for the uniform distribution.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

13

12/4/2024

14 of 35

basic example of calculating Probability in R

# Define the sample space
sample_space <- c(1, 2, 3, 4, 5, 6)

# Define an event, for example, rolling an even number
event <- c(2, 4, 6)

# Calculate the probability of the event
probability <- length(event) / length(sample_space)
print(probability)

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

14

12/4/2024

15 of 35

Probability of a Compound Event

A compound event is an event that consists of two or more simple events. These simple events can either occur independently or occur depending on each other.
A compound event is a fundamental concept in probability theory and statistics, referring to any event that combines two or more simple events. Unlike simple events, which consist of a single outcome, compound events involve multiple outcomes and can be more complex to analyze.
Types of Compound Events

There are main two types of Compound events:

Independent Events: Two events are considered independent if their occurrences are unrelated to one another.
Dependent Events: Two events are said to be dependent on one another if their occurrence influences their occurrence.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

15

12/4/2024

16 of 35

Calculate Probability of a Compound Event

To calculate the probability of a compound event generally use various concepts that are added below
For Independent Event

For the case of independent event, we can find the probability of both event occurring is the multiplication of their individual probabilities.

P(A and B) = P(A) × P(B)

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

16

12/4/2024

17 of 35

Example

Example: Suppose we roll two fair six-sided dice. What is the probability of rolling a 4 on the first die and a 5 on the second die?

Solution:

Probability of rolling a 4 on the first die, P(A):

P(A) = 1/6

Probability of rolling a 5 on the second die, P(B):

P(B) = 1/6

Probability of both event occurring together:

P( A and B) = 1/6 × 1/6 = 1/36

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

17

12/4/2024

18 of 35

Conditional Probability in R

The probability of occurrence of one event conditioned over the occurrence of another event( i.e., an event occurs depending on the condition of another event) is termed as conditional probability. In simple terms, it means if A and B are two events, then the probability of occurrence of Event B conditioned over the occurrence of Event A is given by P(B|A). In another way, it is also the conditional probability of Event B given that event A has already occurred.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

18

12/4/2024

19 of 35

Similarly, the probability of occurrence of Event A conditioned over the occurrence of Event B is given by P(A|B), which also represents the conditional probability of Event A given that Event B has already occurred.

The formula for conditional probability can be represented as

P(A|B) = P(A ∩ B) / P(A)

This is valid only when P(A)≠ 0 i.e. when event A is not an impossible event.

Similarly,

P(B|A) = P(A ∩ B) / P(B)

This is valid only when P(B)≠ 0 i.e. when the event B is not an impossible event.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

19

12/4/2024

20 of 35

figure depicts the Venn diagram representation

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

20

12/4/2024

21 of 35

Exhaustive Events

Exhaustive Events are a set of events where at least one of the events must occur while performing an experiment. Exhaustive events are a set of events whose union makes up the complete sample space of the experiment.

In this article, we will understand the meaning of exhaustive events, its definition, Venn diagram of exhaustive events, collective exhaustive events, and examples of exhaustive events.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

21

12/4/2024

22 of 35

Exhaustive events, in the context of probability, refer to a set of events that collectively cover all possible outcomes of an experiment or situation. In other words, when we say that a set of events is exhaustive, it means that one of those events must occur. There are no other possible outcomes left.

For example, consider flipping a fair coin. The possible outcomes are heads (H) or tails (T). In this case, "getting heads" and "getting tails" are exhaustive events because they cover all possible outcomes when you flip the coin. The sample space, in this case, is {H, T}, and the events "getting heads" and "getting tails" are exhaustive since there are no other possible outcomes.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

22

12/4/2024

23 of 35

Mathematically, if E1, E2, ......., En are exhaustive events, their union (E1 ∪ E2 ∪ ...... ∪En) equals the entire sample space (S).

Definition of Exhaustive Events

Exhaustive events in probability refer to a collection of events that together cover all possible outcomes of an experiment or situation.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

23

12/4/2024

24 of 35

Events in Probability

In probability theory, events refer to outcomes or occurrences that can happen as a result of an experiment or observation. A simple example of rolling a six-sided die. The possible outcomes when rolling the die are the numbers 1, 2, 3, 4, 5, and 6.

The two events:

Event A: Rolling an even number.
Event B: Rolling a number greater than 4.

For Event A, the possible outcomes are 2, 4, and 6. So, if you roll the die and get any of these numbers, you have experienced Event A.

For Event B, the possible outcomes are 5 and 6. If you roll the die and get either 5 or 6, you have experienced Event B

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

24

12/4/2024

25 of 35

Collectively Exhaustive Events

Collectively exhaustive events, in probability theory, refer to a set of events that cover all possible outcomes and no outcome is counted more than once across the events. Here, the events do not overlap with each other.

Consider flipping a fair coin. The possible outcomes are either heads (H) or tails (T). In this case, the events "getting heads" and "getting tails" are collectively exhaustive because one of these outcomes must happen when the coin is flipped. There are no other possible outcomes besides heads or tails.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

25

12/4/2024

26 of 35

Mutually Exclusive and Exhaustive Events

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

26

12/4/2024

27 of 35

Bayes’ Theorem

Bayes’ Theorem is used to determine the conditional probability of an event. It is used to find the probability of an event, based on prior knowledge of conditions that might be related to that event.

Bayes’ Theorem and Conditional Probability
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the conditional probability of event A when event B has already occurred.

The general statement of Bayes’ theorem is “The conditional probability of an event A, given the occurrence of another event B, is equal to the product of the event of B, given A and the probability of A divided by the probability of event B.” i.e.

P(A|B) = P(B|A)P(A) / P(B)

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

27

12/4/2024

28 of 35

where,

P(A) and P(B) are the probabilities of events A and B
P(A|B) is the probability of event A when event B happens
P(B|A) is the probability of event B when A happens

For example, if we want to find the probability that a white marble drawn at random came from the first bag, given that a white marble has already been drawn, and there are three bags each containing some white and black marbles, then we can use Bayes’ Theorem.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

28

12/4/2024

29 of 35

Random Variable

Random variable is a fundamental concept in statistics that bridges the gap between theoretical probability and real-world data. A Random variable in statistics is a function that assigns a real value to an outcome in the sample space of a random experiment. For example: if you roll a die, you can assign a number to each possible outcome.

There are two basic types of random variables:

Discrete Random Variables (which take on specific values).
Continuous Random Variables (assume any value within a given range).

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

29

12/4/2024

30 of 35

random variable as a function that maps from the sample space of an experiment to the real numbers. Mathematically, Random Variable is expressed as,

X: S →R

where,

X is Random Variable (It is usually denoted using capital letter)

S is Sample Space

R is Set of Real Numbers

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

30

12/4/2024

31 of 35

Example 1

f two unbiased coins are tossed then find the random variable associated with that event.
Solution:

Suppose Two (unbiased) coins are tossed
X = number of heads. [X is a random variable or function]

Here, the sample space S = {HH, HT, TH, TT}

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

31

12/4/2024

32 of 35

Types of Random Variables

Random variables are of two types that are,

Discrete Random Variable
Continuous Random Variable

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

32

12/4/2024

33 of 35

Discrete Random Variable

A Discrete Random Variable takes on a finite number of values. The probability function associated with it is said to be PMF.
PMF(Probability Mass Function)

If X is a discrete random variable and the PMF of X is P(xi), then

0 ≤ pi ≤ 1
∑pi = 1 where the sum is taken over all possible values of x

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

33

12/4/2024

34 of 35

Continuous Random Variable

Continuous Random Variable takes on an infinite number of values. The probability function associated with it is said to be PDF (Probability Density Function).
PDF (Probability Density Function)

If X is a continuous random variable. P (x < X < x + dx) = f(x)dx then,

0 ≤ f(x) ≤ 1; for all x
∫ f(x) dx = 1 over all values of x

Then P (X) is said to be a PDF of the distribution.
Continuous Random Variables Example

Find the value of P (1 < X < 2)
Such that,

f(x) = kx3; 0 ≤ x ≤ 3 = 0

Otherwise f(x) is a density function.

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

34

12/4/2024

35 of 35

THANKS….

Department of Computer Science and Engineering, BVCOE New Delhi Subject: Data Science Using R , Instructor: Ms Rachna Narula

35

12/4/2024