1 of 73

Data Science and GoNum

Juan E. Vargas

https://bit.ly/4glzCMx

2 of 73

Data Science in Go and Beyond

Topics in this session include

  1. Deeper look at GoNum
  2. GoNum Mat and Stat packages
  3. How to use GoNum Mat and Stat for Data Science
  4. Deeper look at regression, from the perspective of linear algebra and statistical analysis

3 of 73

GoNum

By now we know a little about GoNum, which is the “Go version” of Python’s numpy.

Gonum is a set of packages designed for writing and running numerical and scientific algorithms at scale. Gonum contains libraries for

  1. linear algebra (matrices, vectors),
  2. statistics,
  3. probability distributions,
  4. sampling,
  5. tools for function differentiation, integration, and optimization,
  6. tools for network creation and analysis; and more.

Perhaps the best part about GoNum is that it is open source. The unrestricted access to the sources provides rich information, examples, etc., for those who aspire to write world-class software by learning from the best of the best. https://www.gonum.org/

4 of 73

GoNB: a modern Go kernel for Jupyter notebooks

Some data scientists who work in Python write their code in Jupyter Notebooks.

Can we also write Go from a Jupyter Notebook?

Go is a compiled language with a compilation that is so fast that using it under REPL (Read-Eval-Print-Loop) makes this possible.

GoNB is a tool that leverages Go’s compilation speed to implement a Jupyter notebook kernel.

There is a small difference in "semantic" with respect to the Python kernels because Go and GoNB are compiled (not interpreted). For example, no LHS “independent values” can exist.

A good idea is to read through the tutorial.

We will do parts of the tutorial in the demo

5 of 73

Jupyter Notebooks

IPython was originally developed by Fernando Pérez a Colombian-born physicist, software developer, and free software advocate. Fernando adapted IPYthon to create Jupyter Notebooks

https://jupyter.org/

Project Jupyter is a project to develop open-source software, open standards, and services for interactive computing across multiple programming languages.

Project Jupyter's name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R.

Its name and logo are an homage to Galileo's discovery of the moons of Jupiter, as documented in notebooks attributed to Galileo.

6 of 73

Installing GoNB

There is a live version of GoNB in Google's Colab (which I could not make work) that in principle can be used to interact with GoNB.

Installing the Docker version of GoNB

sudo docker pull janpfeifer/gonb_jupyterlab:latest

sudo docker run -it --rm -p 8888:8888 -v "${PWD}":/notebooks/host janpfeifer/gonb_jupyterlab:latest

Installing GoNB (in Linux):

go install github.com/janpfeifer/gonb@latest && \

go install golang.org/x/tools/cmd/goimports@latest && \

go install golang.org/x/tools/gopls@latest

gonb --install

7 of 73

Using GoNB from VSC

  1. cd $GOPATH
  2. gonb -- install
  3. Open VSC
  4. Create/read file.ipynb
  5. Select jupyter kernel
  6. Select gonb
  7. GoNB_tutorial.ipynb

DEMO

8 of 73

Matrix and Vector Operations

Visit Wikipedia for an intro to the basic ideas of Vector and Matrices

https://en.wikipedia.org/wiki/Matrix_(mathematics)

Including

  1. Definition, notation, size, determinant, trace, row vector, column vector, square matrix, Identity matrix, basic operations (addition, scalar multiplication, subtraction, matrix multiplication), submatrix, linear equations (Ax = b),
  2. Linear Transformations: Inverse, transpose, SVD, … pseudo inverse matrix
  3. Eigen values and eigen vectors

9 of 73

GoNum Matrix and Vector Operations

The mat package provides a set of types and ops for working with Linear Algebra, including:

1. Interfaces for Matrix classes (Matrix, Symmetric, Triangular)

2. Concrete implementations (Dense, SymDense, TriDense, VecDense)

3. Methods and functions for using matrix data (Add, Trace, SymRankOne)

4. Types for constructing and using matrix factorizations (QR, LU, etc.)

5. Complementary types for complex matrices, CMatrix, CSymDense, etc.

All matrix types can perform operations and implement the interface. For Example:

func (m *Dense) Mul(a, b Matrix)

any matrix type

10 of 73

GoNum Matrix and Vector Operations

The input arguments to most functions and methods are interfaces rather than concrete types `func Trace(Matrix)` rather than `func Trace(*Dense)` allowing flexible use of internal and external Matrix types.

When a matrix is the destination or receiver for a function or method, the operation will panic if the matrix is not the correct size. An exception to this is when the destination is empty

Matrix Interface

The Matrix interface is the common link between the concrete types of real matrices. The Matrix interface is defined by three functions: Dims, which returns the dimensions of the Matrix, At, which returns the element in the location specified by the indexes, and T for returning a Transpose.

All matrix types can perform the three operations and therefore implement the basic Matrix interface. Once you get it, you will understand why this is better than OOP traditional inheritance.

11 of 73

GoNum Matrix and Vector Operations

GoNum demo

GoNum demo on GoNB

(LearnGoNumNB)

The list of types, functions, etc, is very comprehensive and it does not make sense to enumerate them.

The best way to learn how to use GoNum, numpy, and for that matter LAPACK and BLAS, is to read the docs and use the libraries

mkcog demo

/drv3

Basic functions, obj creation, basic LinAlg Ops

Show how GoNB works under VSC

Show how to use GoNum mat lib for matrix mul, transposing, inverse

12 of 73

Review: Linear Regression & Pseudo Inverse Matrix

Data is typically organized as a set of P parameters and a set of N records. Hence, data is represented as a matrix having N rows and P columns, or X[N,P]. The values in the cells may be numerical (continuous or discrete) or categorical (binary or multiple categories).

13 of 73

Yes, we will use linear algebra

14 of 73

Yes, Linear Algebra is a must.

Yes, we will use linear algebra

15 of 73

The set of observed data typically is configured as

{ ( x1,y1),

( x2,y2),

. . .

( xn,yn) }

where each row in X[N,P] is configured as a row a vector xi of length p. Y is a column vector of length n

Typical Problem Setup

16 of 73

Given a set of p predictors X = (X1, X2, …, Xp ) we would like to

predict the value of a response Y.

We assume the existence of a function f such as

Y = f(x) + e

where e is a term inserted to account for error.

In principle, f is a formulation that captures the explicit, systematic

nature of the data.

Problem Setup

17 of 73

The figure shows income vs years of education.

The figure suggests that a function fitting the data might exists.

Problem Setup

18 of 73

We need to find the best estimate of f.

Y = f(x) + e

Given that f is a function of x and e, we must also find e.

The accuracy of our predictions depends on the reducible error and the irreducible error.

Reducible errors are inaccuracies that we can try to reduce.

The other element, known as the irreducible error, is part of f and can not be reduced.

Problem Setup

19 of 73

Let f be a function of the data, and Y ’ = f ’ (X) be an estimate. The expected value is simply

reducible error

irreducible error

The focus of statistical learning is to estimate the best possible f, i.e., the function that minimizes the reducible error.

Problem Setup

20 of 73

To be successful, we must determine:

Which predictors are associated to the outcome?

Which predictors are NOT associated to the outcome?

Are some predictors more important than others? If so how can we measure their importance?

What type of models can we use to answer these questions:

Parametric models?

Linear (first degree),

higher degrees?

Non-parametric models?

Prediction accuracy vs Model interpretability

Methods:

Classification vs regression

Supervised Learning ?

Unsupervised Learning?

21 of 73

LinReg is a relatively simple algorithm for supervised learning.

LinReg is useful because it illustrates the premises, assumptions, and working practices of of ML

Linreg is also useful because it shows the connection between statistical learning and linear algebra

Therefore a solid understanding of LinReg and its methods provides a solid foundation for understanding the general ML concepts, processes and steps involved in more complex ML algorithms.

Why Linear Regression?

22 of 73

The book “An Introduction to Statistical Learning” (ISL book) IMHO is one of the best sources of information for anyone who wishes to learn Data Science and Machine Learning.

The ISL BOOK

https://www.statlearning.com/

The book explains setting up data, shows how to choose algorithms, how to apply the chosen algorithms to the data in order to determine good estimates of F(X,Y), and how to use F to perform predictions with previously unseen data sets.

We will use the “Advertising” data set from the ISL book that includes 200 data records collected to assess how different advertising media (TV, Radio, Newspaper) influences sales.

Three input variables and one output variable, or { X1, X2, X3, Y }.

23 of 73

Note: (a) the difference in the X scales (300, 50, 100) and

(b) the different inclinations of the blue lines

24 of 73

In the simple linear regression case, model F is a function of a single explanatory variable X.

Therefore p = 1 and

We could estimate

Sales | TV

or

Sales | Radio

Sales | NewsPaper

25 of 73

Using the residual sum of squares, or RSS

26 of 73

27 of 73

Code in GoLang

28 of 73

29 of 73

30 of 73

31 of 73

In the simple LinReg case, F is just a function of a single explanatory variable X.

Therefore p = 1 and

and using that set up we can estimate

Sales | TV

Sales | Radio

Sales | NewsPaper

Could we examine any combination of X parameters, including

p( Y | X1, X2)

p( Y | X2, X3)

p( Y | X1, X3)

p( Y | X1, X2, X3)

32 of 73

In the multivariate linear regression case, F is a function of the set of p explanatory variables

X = [ x1 x2 , . . . , xp ]

nMulti-Variable Linear Regression

Now we have an expression in which X[n,p]

is a matrix of n rows and p columns.

Y is a now column vector of size n

Beta is a row vector of size p

33 of 73

Using gradient descent we find the parameters of the model (the B’s ).

The solution in the matrix form and the summation form are:

By expressing the problem in terms of linear algebra, we can compute the B’s using the linear algebra libraries in Python’s numpy, Go’s Gonum, R, Julia, etc.

34 of 73

The solution involves using linear algebra to obtain a “pseudo-inverse” or Moore-Penrose Matrix (MPM) X+ matrix, which is a (n,p+1) matrix:

GoNum Mat to the rescue

35 of 73

GoNum Mat to the rescue

36 of 73

DEMO

37 of 73

Interpreting Results

The equation relating the variables is Sales = b0 + b1*TV + b2*Radio + b3*Newspaper + e

We must test which parameters in X’s are significant in predicting the dependent variable Y.

Our “Null Hypothesis” H0 is

The p-value or Pr( > | t ) tells us if there are any relations between Y and the X predictors.

A common threshold for the P-value is 0.05. It means that 5% of the time we will falsely reject the null hypothesis, i.e., we might falsely conclude that there is a relationship.

  • A small P-value (< 0.05) means that the coefficient is likely not to equal zero.
  • A large P-value (> 0.05) means that we cannot conclude that the explanatory variable affects the dependent variable Y

38 of 73

Interpreting Results

The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect).

A low p-value (< 0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to the model because changes in the predictor's value affect changes in the response variable.

Table 3.4 suggests that the Newspaper coefficient is not a meaningful addition to the model, because its p value is high ( 0.8599 ).

39 of 73

Interpreting Results

RSS (Sum of Squared Residuals)

RSE (Residual Standard Error) is an absolute measure of the lack of fit of a model.

Total Sum of Squares (TTS) measures the total variance in the response Y

40 of 73

Interpreting Results

R2 (proportion of variance)

SSM (Corrected Sum of Squares for Model). This is also called Sum of Squares for Regression

DFM (Degrees of Freedom of Model). p is the number of parameters in the model, excluding X0

41 of 73

Interpreting Results

DFE Degrees of Freedom for Error

F-Test (F is Fisher Statistic)

The F-Statistic is used to determine if any independent variables in a multiple regression model are significant. To calculate the F-Statistic:

DFE = n - p

DEMO

42 of 73

Interpreting Results

I have a demo in Julia that goes through the calculations discussed in slides 39..41.

We will return to these calculations once we have some intro to Julia

/hm2/code/julia/DsjL/src/LinearRegression.jl

43 of 73

Heteroscedasticity

Homoscedasticity is defined as data having the same finite variance. This is also known as homogeneity of variance.

The complementary notion is called heteroscedasticity. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors

This may also result in overestimating the goodness of fit as measured by the Pearson coefficient.

The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modeling errors have the same variance. This condition exists when the variation in data changes in sections of the data, as shown in the figure in the next slide. The variation increases in the higher values of the x-axis.

44 of 73

Heteroscedasticity

While the ordinary least squares estimator is still unbiased in the presence of heteroscedasticity, it is inefficient and generalized least squares should be used instead.

The F test can be used to assess heteroscedasticity under the assumption that the errors are independent and identically distributed (i.i.d.). We can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables.

The variation increases in the higher values of the x-axis.

45 of 73

Heteroscedasticity

The ISL book has also an excellent presentation of the subject that includes exercises in R and in Python

Wonderful page in Wikipedia about Homoscedasticity and Heteroscedascity

https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity

46 of 73

F Test

An F-test can be used to compare the variances of two samples or the ratio of variances between multiple samples. The F test can determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term.

After several statistical models have been fitted to a data set, the F test can help to determine which model is the better fit for the sampled population.

More specifically, F-tests can determine if the means of a given set of normally distributed populations, all having the same standard deviation, are equal.

This plays an important role in the analysis of variance (ANOVA), which follows these assumptions:

  1. Normality (statistics)
  2. Homogeneity of variance
  3. Independence of errors and random sampling

Wikipedia

https://en.wikipedia.org/wiki/F-test#:~:text=An%20F%2Dtest%20is%20any,the%20error%20term%20(%CE%B5).

47 of 73

Leverage and Influence

Leverage is a measure of how far away some independent observations are with respect to the general trend that most observations follow. High-leverage points are often outliers, such as the red dot in the figure:

48 of 73

Leverage and Influence

Knowing that certain points have leverage, we could assess their level of influence by estimating the linear regression twice, first with all points included, then without the points with high leverage. For the data set in this example:

y = 2.96 + 5.04 x Including all points

y = 1.73 + 5.12 x excluding the red point

49 of 73

Leverage and Influence

Doing a full regression analysis will confirm that the red point is an outlier. Doing a regression will also reveal that its influence is not significant. That said, there might be situations in which a point may exert significant influence without being an outlier.

The plot illustrates a point (in red) exerting significant influence. By examining the plot we observe that the red point does not follow the general trend (hence it is an outlier) as it takes an extreme X value.

For reference, the data set is here

50 of 73

Leverage and Influence

The regression analysis reveals the following models, with and without the red point:

y = 8.51 + 3.32x including the red point

y = 1.73 + 5.12x excluding the red point

Therefore we can (visually) conclude that the red point is significant.

Can we determine quantitatively the level of influence that certain points exert on a model?

51 of 73

Leverage and Influence

How can we determine quantitatively the level of influence that certain points exert on a model?

Let’s use another data set, available from UCLA. The data, located here, consists of 3 columns, (labeled number, rain, wheat), with n=26 rows. Rain is the independent variable (X) and wheat is the dependent variable (Y). Hence we are looking for a model of the form:

with rain as X and wheat as Y. For reasons that will be clear when covering the topic

about Hat Matrix, for now the leverage for a point i is given by

var(X) * (n-1)

52 of 73

Leverage and Influence

In order to estimate the leverage of any point we need to obtain first the mean and the variance of X, which we can easily calculate as

Applying the leverage formula for point 1, whose rain value = 12:

53 of 73

Leverage and Influence

And so on for each point as shown in column I of this URL .

We also need to perform linear regression. Doing so yields the following values for the betas:

Using these betas we can calculate the residual errors for each point. The residuals for all points in the data set are shown in column H of the same URL. In matrix form:

54 of 73

Leverage and Influence

We can now use h1 and the residual error to calculate the standardized error. Column J in the URL shows those values, calculated as

Note that se is the std dev of the residuals. If we plot the data we would note that points 25 and 26 look like outliers (the index in the URL is i+1 due to the header)

55 of 73

Leverage and Influence

Point 26 appears to have significant leverage. In fact its leverage is almost 10x the avr leverage for all the other points. We can now use h’s and the residual error to calculate the standardized error as

The standardized error for points 25 and 26 are one order of magnitude larger (10x) than the average of the standardized error, clearly suggesting that these points have a high influence in the model.

56 of 73

Hat Matrix

The Hat Matrix is said to put a “hat” onto the independent variable (y). The Hat Matrix is also called the projection matrix because, conceptually, it projects the y variable onto the plane of independent variables (X’s).

The hMat is also called the influence matrix because it captures the influence that each independent variable exerts on the fitted values of y.

The h-Matrix is discussed in depth in Chap 03, pp 45..47 of the ESL book:

https://hastie.su.domains/ElemStatLearn/

57 of 73

Hat Matrix

Recall that n is the number of rows in X. Hence even for slightly large values of n, e.g.,

n > 50, visualizing/printing the elements within H[n,n] would not be so useful.

Recall that we found the beta parameters using the Moore-Penrose pseudo matrix:

once we have the betas we can find the best estimate of y as

The y-Hat is a matrix H[n,n] where n is the number of rows in X. The H[n,n] matrix can be obtained as

58 of 73

Hat Matrix

Note how nicely the equations for the betas and the Hat Matrix evolve from X+

We can also write

Once we have y-Hat we can estimate the error residuals.

59 of 73

Hat Matrix

Once we have y-Hat we can estimate the error residuals

An interesting property of H[n,n] is that the identity minus the difference with H, expressed as ( I - H ) is symmetric. For reference, a symmetric matrix looks like:

0

1

2

3

1

0

4

5

2

4

0

6

3

5

6

0

Another important role which H plays is related to the covariance matrix of y

The value in hij has a direct interpretation as the amount of leverage or influence exerted on the values on yi by yj. Thus a look at the y-Hat matrix can reveal sensitive points in the design, points at which the value of y has a large impact on the fit

60 of 73

Hat Matrix

Example. Data set is at this URL. There are 4 columns labeled beam, gravity, moisture and strength. Strength is the independent variable, hence we are looking at a model of the form

with gravity and moisture as the X’s and strength as Y. When we perform linear regression the betas are { 10.3015, 8.94, -0.266 }. Hence

H[n,n] is

61 of 73

Hat Matrix

Note that the range of values in the Hat Matrix is 0.0 < hij < 1.0

According to the experts, the average value of the matrix diagonal elements ( h[i,i] ) should be about nCols/nRows.

The experts suggest that a reasonable rule of thumb to identify large values h[i,i] is

h[i,i] > 2*nCols/nRows

In looking at the numbers from h in the previous slide, we can see that h[3,3] = 0.604 is the largest value and it is also slightly larger than 2*3/10 =0.6. Therefore the data point i=3 exerts the larger leverage on the data.

You can verify that conclusion by plotting the gravity vs moisture data. When we calculate the residuals, we find that the first point (i=0) has more error than the 4th point (i=3),and hence exerts more influence.

DEMO

62 of 73

function hatMatrix()

df = ReadCSVFile("/hm2/Data/ML_Data/ISL_Data/AdvertisingData.csv", "NO")

df1 = DfConfig( df, [1] ) # Ones

dft = DfConfig( df, [2] ) # TV

dfr = DfConfig( df, [3] ) # Radio

dfn = DfConfig( df, [4] ) # Newspaper

dfs = DfConfig( df, [5] ) # Sales

xDf = DfConfig( df, [2,3,4])

X = Matrix( xDf)

H = X * inv(X' * X) * X'

println("typeof H", typeof(H) )

n = size(H, 1)

p = size(H, 2)

println("H Matrix dimensions rows= ", n, " cols = ", p)

# println("First 10 rows and 10 cols of the Hat matrix:")

# Find the max value and its linear index

max_value, linear_index = findmax(H)

println("Maximum value: ", max_value)

println("linear_index : ", linear_index)

#for i in 1:n

# println(H[i,i])

#end

CSV.write("/hm2/Data/ML_Data/ISL_Data/AdvertisingData_H_Matrix.csv",Tables.table(H), writeheader=false)

end

typeof HMatrix{Float64}

H Matrix dimensions rows= 200 cols = 200

Maximum value: 0.08330536038966604

linear_index : CartesianIndex(17, 17

63 of 73

More Estimators: Variance-Covariance Matrix

https://en.wikipedia.org/wiki/Covariance_matrix

http://stattrek.com/matrix-algebra/covariance-matrix.aspx

The covariance matrix is also known as the dispersion matrix or the variance-covariance matrix. Cov is of size [n,n]. The elements in Cov[ i, j ] contain the covariance between the ith row and the jth column.

Because the covariance of a random variable with itself is the variable’s variance, each element in the principal diagonal, i.e., Cov[i,i] is the variance of i.

Since Cov[i,j] == Cov[j,i] the Covariance Matrix is square and symmetric. Let X be defined as the collection of n random variables, each containing p elements

64 of 73

Variance-Covariance Matrix

The inverse of Cov[n,n] is known as the concentration matrix or precision matrix.

The precision matrix is used to determine conditional independence among variables in a jpd, especially when normality assumptions are made. It is also important for constructing undirected graphical network models such as Bayesian networks.

In the context of univariate distributions, the precision matrix is a scalar precision, which is the reciprocal of the variance

65 of 73

Conditional Probability Example:

66 of 73

Variance-Covariance Matrix

The inverse of Cov[n,n] is known as the concentration matrix or precision matrix.

Using a data set comprised of 6 independent variables contained in this URL the Variance Covariance Matrix is:

lcavol lweight age lbph svi lcp�

lcavol 1.39 0.14 1.97 0.05 0.26 1.11

lweight 0.14 0.18 1.11 0.27 0.03 0.10

age 1.97 1.11 55.43 3.78 0.36 1.33

lbph 0.05 0.27 3.78 2.10 -0.05 -0.01

svi 0.26 0.03 0.36 -0.05 0.17 0.39

lcp 1.11 0.10 1.33 -0.01 0.39 1.96

The elements in the main diagonal (in bold) are the variance, while all the others are the covariance. From the table we observe that age has the highest variance and svi the lowest. Of course every analysis will be different. The data here is just to illustrate the ideas.

demo (if I have it)

67 of 73

Variance-Covariance Matrix

// VarianceCovarianceMatrixFunc calculates the variance-covariance matrix of the betas.

// This is part of the calculation of the standard error (SE) of the model, for all betas

// I do calculations starting from the X NxP matrix where [x0,x1,x2,...xp] and the nRows yVector

func (lrm *RegressionMatrix) VarianceCovarianceMatrixFunc() {

var xdMat DataMatrix

xdMat.GetXDataMatrix(lrm)

_, nCols := xdMat.data.Dims()

vccM := mat.NewSymDense(nCols, nil)

stat.CovarianceMatrix(vccM, xdMat.data, nil)

nr, nc := vccM.Dims()

fmt.Println("nr = ", nr, " nc = ", nc)

for i := 0; i < nr; i++ {

fmt.Println("")

for j := 0; j < nc; j++ {

fmt.Printf("%6.2f ", vccM.At(i, j))

}

}

fmt.Println("DONE ")

}

demo (if I have it)

68 of 73

  1. Correlation is a method to estimate the degree of dependence between two or more variables.
  2. In many cases variables have linear relations e.g., the height of parents and their children.
  3. Correlations can reveal predictive patterns and relationships for practical purposes. For example, electric bills may have a correlation between demand and weather.
  4. While it could be argued that a causal relationship may exist between extreme weather and higher electrical consumption, in general the presence or absence of correlations are not sufficient to infer causal relationships, because … ,
  5. Correlation does not imply causation.
  6. There are several correlation coefficients.
    1. Pearson coefficient, which unfortunately is sensitive to linear relationships between two variables.
    2. More robust coefficients have been developed with better sensitivity to nonlinear relationships.
    3. Personally I am a fan of the Mutual Information (distance) as suggested by Kullback and Leibler.

Correlation: Key Points

demo (if I have it)

69 of 73

Pearson coefficient is calculated using covariance. The formula is:

Pearson Correlation Coefficient

where E is expected value and cov is covariance. Using a data set comprised of 6 independent variables contained in this URL the Correlation Matrix is:

lcavol lweight age lbph svi lcp

lcavol 1.00 0.28 0.22 0.03 0.54 0.68

lweight 0.28 1.00 0.35 0.44 0.16 0.16

age 0.22 0.35 1.00 0.35 0.12 0.13

lbph 0.03 0.44 0.35 1.00 -0.09 -0.01

svi 0.54 0.16 0.12 -0.09 1.00 0.67

lcp 0.68 0.16 0.13 -0.01 0.67 1.00

The values range is -1.0 < < 1.0. The values in the diagonal are 1.0. More importantly, the pair of variables with the strongest correlation are (lcavol,lcp) and the weakest are (lbph,lvacol).

If we were to do the correlations again, this time including (Xi,Y), the matrix could help us identify the Xi’s with weak correlations with Y, suggesting that perhaps those Xi’s should be taken out of the model.

demo (if I have it)

70 of 73

Classification Methods in Data Science

Presentation : https://bit.ly/4dpeF0u

71 of 73

Dimensionality Reduction

The Curse of Dimensionality

Strictly speaking purely linear modeling is not always sufficient, because often the data have high dimensionsionality. The methods we have seen so far feel “ad-hoc” for dealing with the higher dimensionality in the data.

By definition, dimensionality reduction is the transformation of data from a high-dimensional space into a lower dimensional space.

High-dimensional data is difficult to process and visualize. There are several approaches including Principal Component Analysis (PCA), Independent Component Analysis, (IDA) SVD, Factor Analysis, cluster analysis, manifold learning (see Manifold_learning_algorithms), autodecoders, restricted Boltzmann machines.

We will cover using SVD for PCA

72 of 73

SVD and PCA

Singular Value Decomposition (SVD) is one of the most fundamental matrix transformations for data science.

73 of 73

Go MLX

https://github.com/gomlx/gomlx