Data Science and GoNum
Juan E. Vargas
https://bit.ly/4glzCMx
Data Science in Go and Beyond
Topics in this session include
GoNum
By now we know a little about GoNum, which is the “Go version” of Python’s numpy.
Gonum is a set of packages designed for writing and running numerical and scientific algorithms at scale. Gonum contains libraries for
Perhaps the best part about GoNum is that it is open source. The unrestricted access to the sources provides rich information, examples, etc., for those who aspire to write world-class software by learning from the best of the best. https://www.gonum.org/
GoNB: a modern Go kernel for Jupyter notebooks
Some data scientists who work in Python write their code in Jupyter Notebooks.
Can we also write Go from a Jupyter Notebook?
Go is a compiled language with a compilation that is so fast that using it under REPL (Read-Eval-Print-Loop) makes this possible.
GoNB is a tool that leverages Go’s compilation speed to implement a Jupyter notebook kernel.
There is a small difference in "semantic" with respect to the Python kernels because Go and GoNB are compiled (not interpreted). For example, no LHS “independent values” can exist.
A good idea is to read through the tutorial.
We will do parts of the tutorial in the demo
Jupyter Notebooks
IPython was originally developed by Fernando Pérez a Colombian-born physicist, software developer, and free software advocate. Fernando adapted IPYthon to create Jupyter Notebooks
https://jupyter.org/
Project Jupyter is a project to develop open-source software, open standards, and services for interactive computing across multiple programming languages.
Project Jupyter's name is a reference to the three core programming languages supported by Jupyter, which are Julia, Python and R.
Its name and logo are an homage to Galileo's discovery of the moons of Jupiter, as documented in notebooks attributed to Galileo.
Installing GoNB
There is a live version of GoNB in Google's Colab (which I could not make work) that in principle can be used to interact with GoNB.
Installing the Docker version of GoNB
sudo docker pull janpfeifer/gonb_jupyterlab:latest
sudo docker run -it --rm -p 8888:8888 -v "${PWD}":/notebooks/host janpfeifer/gonb_jupyterlab:latest
Installing GoNB (in Linux):
go install github.com/janpfeifer/gonb@latest && \
go install golang.org/x/tools/cmd/goimports@latest && \
go install golang.org/x/tools/gopls@latest
gonb --install
Using GoNB from VSC
DEMO
Matrix and Vector Operations
Visit Wikipedia for an intro to the basic ideas of Vector and Matrices
https://en.wikipedia.org/wiki/Matrix_(mathematics)
Including
GoNum Matrix and Vector Operations
The mat package provides a set of types and ops for working with Linear Algebra, including:
1. Interfaces for Matrix classes (Matrix, Symmetric, Triangular)
2. Concrete implementations (Dense, SymDense, TriDense, VecDense)
3. Methods and functions for using matrix data (Add, Trace, SymRankOne)
4. Types for constructing and using matrix factorizations (QR, LU, etc.)
5. Complementary types for complex matrices, CMatrix, CSymDense, etc.
All matrix types can perform operations and implement the interface. For Example:
func (m *Dense) Mul(a, b Matrix)
any matrix type
GoNum Matrix and Vector Operations
The input arguments to most functions and methods are interfaces rather than concrete types `func Trace(Matrix)` rather than `func Trace(*Dense)` allowing flexible use of internal and external Matrix types.
When a matrix is the destination or receiver for a function or method, the operation will panic if the matrix is not the correct size. An exception to this is when the destination is empty
Matrix Interface
The Matrix interface is the common link between the concrete types of real matrices. The Matrix interface is defined by three functions: Dims, which returns the dimensions of the Matrix, At, which returns the element in the location specified by the indexes, and T for returning a Transpose.
All matrix types can perform the three operations and therefore implement the basic Matrix interface. Once you get it, you will understand why this is better than OOP traditional inheritance.
GoNum Matrix and Vector Operations
GoNum demo
GoNum demo on GoNB
(LearnGoNumNB)
The list of types, functions, etc, is very comprehensive and it does not make sense to enumerate them.
The best way to learn how to use GoNum, numpy, and for that matter LAPACK and BLAS, is to read the docs and use the libraries
mkcog demo
/drv3
Basic functions, obj creation, basic LinAlg Ops
Show how GoNB works under VSC
Show how to use GoNum mat lib for matrix mul, transposing, inverse
Review: Linear Regression & Pseudo Inverse Matrix
Data is typically organized as a set of P parameters and a set of N records. Hence, data is represented as a matrix having N rows and P columns, or X[N,P]. The values in the cells may be numerical (continuous or discrete) or categorical (binary or multiple categories).
Yes, we will use linear algebra
Yes, Linear Algebra is a must.
Yes, we will use linear algebra
The set of observed data typically is configured as
{ ( x1,y1),
( x2,y2),
. . .
( xn,yn) }
where each row in X[N,P] is configured as a row a vector xi of length p. Y is a column vector of length n
Typical Problem Setup
Given a set of p predictors X = (X1, X2, …, Xp ) we would like to
predict the value of a response Y.
We assume the existence of a function f such as
Y = f(x) + e
where e is a term inserted to account for error.
In principle, f is a formulation that captures the explicit, systematic
nature of the data.
Problem Setup
The figure shows income vs years of education.
The figure suggests that a function fitting the data might exists.
Problem Setup
We need to find the best estimate of f.
Y = f(x) + e
Given that f is a function of x and e, we must also find e.
The accuracy of our predictions depends on the reducible error and the irreducible error.
Reducible errors are inaccuracies that we can try to reduce.
The other element, known as the irreducible error, is part of f and can not be reduced.
Problem Setup
Let f be a function of the data, and Y ’ = f ’ (X) be an estimate. The expected value is simply
reducible error
irreducible error
The focus of statistical learning is to estimate the best possible f, i.e., the function that minimizes the reducible error.
Problem Setup
To be successful, we must determine:
Which predictors are associated to the outcome?
Which predictors are NOT associated to the outcome?
Are some predictors more important than others? If so how can we measure their importance?
What type of models can we use to answer these questions:
Parametric models?
Linear (first degree),
higher degrees?
Non-parametric models?
Prediction accuracy vs Model interpretability
Methods:
Classification vs regression
Supervised Learning ?
Unsupervised Learning?
LinReg is a relatively simple algorithm for supervised learning.
LinReg is useful because it illustrates the premises, assumptions, and working practices of of ML
Linreg is also useful because it shows the connection between statistical learning and linear algebra
Therefore a solid understanding of LinReg and its methods provides a solid foundation for understanding the general ML concepts, processes and steps involved in more complex ML algorithms.
Why Linear Regression?
The book “An Introduction to Statistical Learning” (ISL book) IMHO is one of the best sources of information for anyone who wishes to learn Data Science and Machine Learning.
The ISL BOOK
https://www.statlearning.com/
The book explains setting up data, shows how to choose algorithms, how to apply the chosen algorithms to the data in order to determine good estimates of F(X,Y), and how to use F to perform predictions with previously unseen data sets.
We will use the “Advertising” data set from the ISL book that includes 200 data records collected to assess how different advertising media (TV, Radio, Newspaper) influences sales.
Three input variables and one output variable, or { X1, X2, X3, Y }.
Note: (a) the difference in the X scales (300, 50, 100) and
(b) the different inclinations of the blue lines
In the simple linear regression case, model F is a function of a single explanatory variable X.
Therefore p = 1 and
We could estimate
Sales | TV
or
Sales | Radio
Sales | NewsPaper
Using the residual sum of squares, or RSS
Code in GoLang
In the simple LinReg case, F is just a function of a single explanatory variable X.
Therefore p = 1 and
and using that set up we can estimate
Sales | TV
Sales | Radio
Sales | NewsPaper
Could we examine any combination of X parameters, including
p( Y | X1, X2)
p( Y | X2, X3)
p( Y | X1, X3)
p( Y | X1, X2, X3)
In the multivariate linear regression case, F is a function of the set of p explanatory variables
X = [ x1 x2 , . . . , xp ]
nMulti-Variable Linear Regression
Now we have an expression in which X[n,p]
is a matrix of n rows and p columns.
Y is a now column vector of size n
Beta is a row vector of size p
Using gradient descent we find the parameters of the model (the B’s ).
The solution in the matrix form and the summation form are:
By expressing the problem in terms of linear algebra, we can compute the B’s using the linear algebra libraries in Python’s numpy, Go’s Gonum, R, Julia, etc.
The solution involves using linear algebra to obtain a “pseudo-inverse” or Moore-Penrose Matrix (MPM) X+ matrix, which is a (n,p+1) matrix:
GoNum Mat to the rescue
GoNum Mat to the rescue
DEMO
Interpreting Results
The equation relating the variables is Sales = b0 + b1*TV + b2*Radio + b3*Newspaper + e
We must test which parameters in X’s are significant in predicting the dependent variable Y.
Our “Null Hypothesis” H0 is
The p-value or Pr( > | t ) tells us if there are any relations between Y and the X predictors.
A common threshold for the P-value is 0.05. It means that 5% of the time we will falsely reject the null hypothesis, i.e., we might falsely conclude that there is a relationship.
Interpreting Results
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect).
A low p-value (< 0.05) indicates that we can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to the model because changes in the predictor's value affect changes in the response variable.
Table 3.4 suggests that the Newspaper coefficient is not a meaningful addition to the model, because its p value is high ( 0.8599 ).
Interpreting Results
RSS (Sum of Squared Residuals)
RSE (Residual Standard Error) is an absolute measure of the lack of fit of a model.
Total Sum of Squares (TTS) measures the total variance in the response Y
Interpreting Results
R2 (proportion of variance)
SSM (Corrected Sum of Squares for Model). This is also called Sum of Squares for Regression
DFM (Degrees of Freedom of Model). p is the number of parameters in the model, excluding X0
Interpreting Results
DFE Degrees of Freedom for Error
F-Test (F is Fisher Statistic)
The F-Statistic is used to determine if any independent variables in a multiple regression model are significant. To calculate the F-Statistic:
DFE = n - p
DEMO
Interpreting Results
I have a demo in Julia that goes through the calculations discussed in slides 39..41.
We will return to these calculations once we have some intro to Julia
/hm2/code/julia/DsjL/src/LinearRegression.jl
Heteroscedasticity
Homoscedasticity is defined as data having the same finite variance. This is also known as homogeneity of variance.
The complementary notion is called heteroscedasticity. Assuming a variable is homoscedastic when in reality it is heteroscedastic results in unbiased but inefficient point estimates and in biased estimates of standard errors
This may also result in overestimating the goodness of fit as measured by the Pearson coefficient.
The existence of heteroscedasticity is a major concern in regression analysis and the analysis of variance, as it invalidates statistical tests of significance that assume that the modeling errors have the same variance. This condition exists when the variation in data changes in sections of the data, as shown in the figure in the next slide. The variation increases in the higher values of the x-axis.
Heteroscedasticity
While the ordinary least squares estimator is still unbiased in the presence of heteroscedasticity, it is inefficient and generalized least squares should be used instead.
The F test can be used to assess heteroscedasticity under the assumption that the errors are independent and identically distributed (i.i.d.). We can perform the test using the fitted values of the model, the predictors in the model and a subset of the independent variables.
The variation increases in the higher values of the x-axis.
Heteroscedasticity
The ISL book has also an excellent presentation of the subject that includes exercises in R and in Python
Wonderful page in Wikipedia about Homoscedasticity and Heteroscedascity
https://en.wikipedia.org/wiki/Homoscedasticity_and_heteroscedasticity
F Test
An F-test can be used to compare the variances of two samples or the ratio of variances between multiple samples. The F test can determine if the tested data has an F-distribution under the true null hypothesis, and true customary assumptions about the error term.
After several statistical models have been fitted to a data set, the F test can help to determine which model is the better fit for the sampled population.
More specifically, F-tests can determine if the means of a given set of normally distributed populations, all having the same standard deviation, are equal.
This plays an important role in the analysis of variance (ANOVA), which follows these assumptions:
Wikipedia
https://en.wikipedia.org/wiki/F-test#:~:text=An%20F%2Dtest%20is%20any,the%20error%20term%20(%CE%B5).
Leverage and Influence
Leverage is a measure of how far away some independent observations are with respect to the general trend that most observations follow. High-leverage points are often outliers, such as the red dot in the figure:
Leverage and Influence
Knowing that certain points have leverage, we could assess their level of influence by estimating the linear regression twice, first with all points included, then without the points with high leverage. For the data set in this example:
y = 2.96 + 5.04 x Including all points
y = 1.73 + 5.12 x excluding the red point
Leverage and Influence
Doing a full regression analysis will confirm that the red point is an outlier. Doing a regression will also reveal that its influence is not significant. That said, there might be situations in which a point may exert significant influence without being an outlier.
The plot illustrates a point (in red) exerting significant influence. By examining the plot we observe that the red point does not follow the general trend (hence it is an outlier) as it takes an extreme X value.
For reference, the data set is here
Leverage and Influence
The regression analysis reveals the following models, with and without the red point:
y = 8.51 + 3.32x including the red point
y = 1.73 + 5.12x excluding the red point
Therefore we can (visually) conclude that the red point is significant.
Can we determine quantitatively the level of influence that certain points exert on a model?
Leverage and Influence
How can we determine quantitatively the level of influence that certain points exert on a model?
Let’s use another data set, available from UCLA. The data, located here, consists of 3 columns, (labeled number, rain, wheat), with n=26 rows. Rain is the independent variable (X) and wheat is the dependent variable (Y). Hence we are looking for a model of the form:
with rain as X and wheat as Y. For reasons that will be clear when covering the topic
about Hat Matrix, for now the leverage for a point i is given by
var(X) * (n-1)
Leverage and Influence
In order to estimate the leverage of any point we need to obtain first the mean and the variance of X, which we can easily calculate as
Applying the leverage formula for point 1, whose rain value = 12:
Leverage and Influence
And so on for each point as shown in column I of this URL .
We also need to perform linear regression. Doing so yields the following values for the betas:
Using these betas we can calculate the residual errors for each point. The residuals for all points in the data set are shown in column H of the same URL. In matrix form:
Leverage and Influence
We can now use h1 and the residual error to calculate the standardized error. Column J in the URL shows those values, calculated as
Note that se is the std dev of the residuals. If we plot the data we would note that points 25 and 26 look like outliers (the index in the URL is i+1 due to the header)
Leverage and Influence
Point 26 appears to have significant leverage. In fact its leverage is almost 10x the avr leverage for all the other points. We can now use h’s and the residual error to calculate the standardized error as
The standardized error for points 25 and 26 are one order of magnitude larger (10x) than the average of the standardized error, clearly suggesting that these points have a high influence in the model.
Hat Matrix
The Hat Matrix is said to put a “hat” onto the independent variable (y). The Hat Matrix is also called the projection matrix because, conceptually, it projects the y variable onto the plane of independent variables (X’s).
The hMat is also called the influence matrix because it captures the influence that each independent variable exerts on the fitted values of y.
The h-Matrix is discussed in depth in Chap 03, pp 45..47 of the ESL book:
https://hastie.su.domains/ElemStatLearn/
Hat Matrix
Recall that n is the number of rows in X. Hence even for slightly large values of n, e.g.,
n > 50, visualizing/printing the elements within H[n,n] would not be so useful.
Recall that we found the beta parameters using the Moore-Penrose pseudo matrix:
once we have the betas we can find the best estimate of y as
The y-Hat is a matrix H[n,n] where n is the number of rows in X. The H[n,n] matrix can be obtained as
Hat Matrix
Note how nicely the equations for the betas and the Hat Matrix evolve from X+
We can also write
Once we have y-Hat we can estimate the error residuals.
Hat Matrix
Once we have y-Hat we can estimate the error residuals
An interesting property of H[n,n] is that the identity minus the difference with H, expressed as ( I - H ) is symmetric. For reference, a symmetric matrix looks like:
0 | 1 | 2 | 3 |
1 | 0 | 4 | 5 |
2 | 4 | 0 | 6 |
3 | 5 | 6 | 0 |
Another important role which H plays is related to the covariance matrix of y
The value in hij has a direct interpretation as the amount of leverage or influence exerted on the values on yi by yj. Thus a look at the y-Hat matrix can reveal sensitive points in the design, points at which the value of y has a large impact on the fit
Hat Matrix
Example. Data set is at this URL. There are 4 columns labeled beam, gravity, moisture and strength. Strength is the independent variable, hence we are looking at a model of the form
with gravity and moisture as the X’s and strength as Y. When we perform linear regression the betas are { 10.3015, 8.94, -0.266 }. Hence
H[n,n] is
Hat Matrix
Note that the range of values in the Hat Matrix is 0.0 < hij < 1.0
According to the experts, the average value of the matrix diagonal elements ( h[i,i] ) should be about nCols/nRows.
The experts suggest that a reasonable rule of thumb to identify large values h[i,i] is
h[i,i] > 2*nCols/nRows
In looking at the numbers from h in the previous slide, we can see that h[3,3] = 0.604 is the largest value and it is also slightly larger than 2*3/10 =0.6. Therefore the data point i=3 exerts the larger leverage on the data.
You can verify that conclusion by plotting the gravity vs moisture data. When we calculate the residuals, we find that the first point (i=0) has more error than the 4th point (i=3),and hence exerts more influence.
DEMO
function hatMatrix()
df = ReadCSVFile("/hm2/Data/ML_Data/ISL_Data/AdvertisingData.csv", "NO")
df1 = DfConfig( df, [1] ) # Ones
dft = DfConfig( df, [2] ) # TV
dfr = DfConfig( df, [3] ) # Radio
dfn = DfConfig( df, [4] ) # Newspaper
dfs = DfConfig( df, [5] ) # Sales
xDf = DfConfig( df, [2,3,4])
X = Matrix( xDf)
H = X * inv(X' * X) * X'
println("typeof H", typeof(H) )
n = size(H, 1)
p = size(H, 2)
println("H Matrix dimensions rows= ", n, " cols = ", p)
# println("First 10 rows and 10 cols of the Hat matrix:")
# Find the max value and its linear index
max_value, linear_index = findmax(H)
println("Maximum value: ", max_value)
println("linear_index : ", linear_index)
#for i in 1:n
# println(H[i,i])
#end
CSV.write("/hm2/Data/ML_Data/ISL_Data/AdvertisingData_H_Matrix.csv",Tables.table(H), writeheader=false)
end
typeof HMatrix{Float64}
H Matrix dimensions rows= 200 cols = 200
Maximum value: 0.08330536038966604
linear_index : CartesianIndex(17, 17
More Estimators: Variance-Covariance Matrix
https://en.wikipedia.org/wiki/Covariance_matrix
http://stattrek.com/matrix-algebra/covariance-matrix.aspx
The covariance matrix is also known as the dispersion matrix or the variance-covariance matrix. Cov is of size [n,n]. The elements in Cov[ i, j ] contain the covariance between the ith row and the jth column.
Because the covariance of a random variable with itself is the variable’s variance, each element in the principal diagonal, i.e., Cov[i,i] is the variance of i.
Since Cov[i,j] == Cov[j,i] the Covariance Matrix is square and symmetric. Let X be defined as the collection of n random variables, each containing p elements
Variance-Covariance Matrix
The inverse of Cov[n,n] is known as the concentration matrix or precision matrix.
The precision matrix is used to determine conditional independence among variables in a jpd, especially when normality assumptions are made. It is also important for constructing undirected graphical network models such as Bayesian networks.
In the context of univariate distributions, the precision matrix is a scalar precision, which is the reciprocal of the variance
Conditional Probability Example:
Variance-Covariance Matrix
The inverse of Cov[n,n] is known as the concentration matrix or precision matrix.
Using a data set comprised of 6 independent variables contained in this URL the Variance Covariance Matrix is:
lcavol lweight age lbph svi lcp�
lcavol 1.39 0.14 1.97 0.05 0.26 1.11
lweight 0.14 0.18 1.11 0.27 0.03 0.10
age 1.97 1.11 55.43 3.78 0.36 1.33
lbph 0.05 0.27 3.78 2.10 -0.05 -0.01
svi 0.26 0.03 0.36 -0.05 0.17 0.39
lcp 1.11 0.10 1.33 -0.01 0.39 1.96
The elements in the main diagonal (in bold) are the variance, while all the others are the covariance. From the table we observe that age has the highest variance and svi the lowest. Of course every analysis will be different. The data here is just to illustrate the ideas.
demo (if I have it)
Variance-Covariance Matrix
// VarianceCovarianceMatrixFunc calculates the variance-covariance matrix of the betas.
// This is part of the calculation of the standard error (SE) of the model, for all betas
// I do calculations starting from the X NxP matrix where [x0,x1,x2,...xp] and the nRows yVector
func (lrm *RegressionMatrix) VarianceCovarianceMatrixFunc() {
var xdMat DataMatrix
xdMat.GetXDataMatrix(lrm)
_, nCols := xdMat.data.Dims()
vccM := mat.NewSymDense(nCols, nil)
stat.CovarianceMatrix(vccM, xdMat.data, nil)
nr, nc := vccM.Dims()
fmt.Println("nr = ", nr, " nc = ", nc)
for i := 0; i < nr; i++ {
fmt.Println("")
for j := 0; j < nc; j++ {
fmt.Printf("%6.2f ", vccM.At(i, j))
}
}
fmt.Println("DONE ")
}
demo (if I have it)
Correlation: Key Points
demo (if I have it)
Pearson coefficient is calculated using covariance. The formula is:
Pearson Correlation Coefficient
where E is expected value and cov is covariance. Using a data set comprised of 6 independent variables contained in this URL the Correlation Matrix is:
lcavol lweight age lbph svi lcp
lcavol 1.00 0.28 0.22 0.03 0.54 0.68
lweight 0.28 1.00 0.35 0.44 0.16 0.16
age 0.22 0.35 1.00 0.35 0.12 0.13
lbph 0.03 0.44 0.35 1.00 -0.09 -0.01
svi 0.54 0.16 0.12 -0.09 1.00 0.67
lcp 0.68 0.16 0.13 -0.01 0.67 1.00
The values range is -1.0 < < 1.0. The values in the diagonal are 1.0. More importantly, the pair of variables with the strongest correlation are (lcavol,lcp) and the weakest are (lbph,lvacol).
If we were to do the correlations again, this time including (Xi,Y), the matrix could help us identify the Xi’s with weak correlations with Y, suggesting that perhaps those Xi’s should be taken out of the model.
demo (if I have it)
Classification Methods in Data Science
Presentation : https://bit.ly/4dpeF0u
Dimensionality Reduction
The Curse of Dimensionality
Strictly speaking purely linear modeling is not always sufficient, because often the data have high dimensionsionality. The methods we have seen so far feel “ad-hoc” for dealing with the higher dimensionality in the data.
By definition, dimensionality reduction is the transformation of data from a high-dimensional space into a lower dimensional space.
High-dimensional data is difficult to process and visualize. There are several approaches including Principal Component Analysis (PCA), Independent Component Analysis, (IDA) SVD, Factor Analysis, cluster analysis, manifold learning (see Manifold_learning_algorithms), autodecoders, restricted Boltzmann machines.
We will cover using SVD for PCA
SVD and PCA
Singular Value Decomposition (SVD) is one of the most fundamental matrix transformations for data science.
Go MLX
https://github.com/gomlx/gomlx