1 of 41

Neural Networks

and Deep Learning

DATA 621

Cristiano Fanelli

08/29/2024 - Lecture 1

2 of 41

Outline

2

  • Welcome
    • Introductions
  • Course Introduction
    • Syllabus
      • Structure and grading
    • Schedule
      • By the end of the course
  • Brief History of Neural Networks and Deep Learning

3 of 41

3

Introduction

Welcome everybody to DATA 621 - Neural Networks and Deep Learning!

Prof Cristiano Fanelli

Our research group works at the nexus of data science and physical sciences, more info: https://cristianofanelli.com

Bayesian uncertainty quantification in ML/DL, anomaly detection, particle identification, fast simulation, AI-assisted design, multi-objective optimization, autonomous experimental control, calibration/alignment

4 of 41

4

These Lectures

All material can be found at

https://cfteach.github.io/NNDL_DATA621

Lectures

Tutorials

Assignments

Supplemental Material

5 of 41

5

Relationship with Other Courses @ W&M/DS

DATA 201 - Programming for Data Science

DATA 301 - Applied Machine Learning

DATA 442 (DATA 621 - for graduate) - Neural Network & Deep Learning

DATA 462 (DATA 622 - for graduate) - Generative AI

6 of 41

Grading

6

    • Assignments (4 Total) (60%)
      • Comprising problem sets, programming tasks, and data analysis projects. Each assignment contributes 15% to the final grade.
      • An additional submission (Assignment 0) will give you bonus points towards the final grade
    • Final Project (40%)
      • Several aspects will be considered, including the introduction of novel concepts, the use of new methods, the novelty of the application, etc.

7 of 41

Final Project

7

    • Final Report (40%) (~3 to 5 pages, without references)
      • Introduction and Motivation
      • Methodological Approach
      • Results and Discussion
      • Conclusions and Future Work
      • References and Appendices (excluded from the 5-page limit)
    • Presentation (60%) ([13]* mins+5 mins questions)
      • Clarity of presentation and answers
    • “Originality” (bonus 10%)
      • Several aspects will be considered, including the introduction of novel concepts, the use of new methods, the novelty of the application, etc.
    • Active Participation during Final (3h) — peers (bonus 5%)

For collaborative projects: please specify your individual contributions to the project, adhering to standard scientific work practices. The presentation can exceed [13] minutes, but it should not last longer than the number of participants multiplied by 13 minutes. Theoretical background, clarity of presentation will be assessed individually.

Please notice that you will receive questions from your peers and evaluations will be also based on the clarity of your answers (everyone in the collaborative project is encouraged to answer to questions).

Aspects that will influence grading

  • Theoretical understanding
  • Demonstrated / potential application(s)

*13 minutes is tentative

8 of 41

8

If you want to discuss about the NN & Deep Learning course,

and other topics, I am in ISC 1265 (office hours 9:30-11:30, Friday)

Syllabus

More Information can be found in the course Syllabus (link here)

9 of 41

9

What do we mean

by Deep Learning?

10 of 41

Taxonomy

10

Data Mining: The process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems.

Big Data: A term that describes the large volume of data – both structured and unstructured – that inundates a business on a day-to-day basis.

Predictive Analytics: The use of data, statistical algorithms and machine learning techniques to identify the likelihood of future outcomes based on historical data.

Natural Language Processing (NLP): A branch of AI that helps computers understand, interpret and manipulate human language.

11 of 41

UNSUPERVISED

SUPERVISED

REINFORCEMENT

Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples

Reinforcement learning is concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward and make informed choices.

Unsupervised learning is a type of machine learning in which the algorithm is not provided with any pre-assigned labels or scores for the training data. Unsupervised learning algorithms must first self-discover any naturally occurring patterns in that training data set.

R. S. Sutton, and A. G Barto (1998), Reinforcement learning: An introduction, Vol. 1 (MIT press Cambridge)

ML

11

12 of 41

UNSUPERVISED

SUPERVISED

REINFORCEMENT

12

ML

13 of 41

NIPS 2016: “If intelligence is a cake,

the bulk of the cake is unsupervised learning,

the icing on the cake is supervised learning,

and the cherry on the cake is reinforcement learning”

LeCun, Turing award 2018

VP and Chief AI Scientist, Facebook

DeepMind

Deep Q-learning playing Atari Breakout

Nature,

518.7540 (2015)

Mnih et al, 1312.5602

Nature, 518.7540 (2015)

(2016)

13

It creates a hole to penetrate the barrier, and keeps throwing the ball there taking advantage of multiple bounces

14 of 41

Deep Learning

14

Deep Learning

DL contains many layers of neurons

15 of 41

we stand at the height of some of the greatest accomplishments that happened in DL

(2019)

Meta-learning [3]

Autopilot [2]

Natural Language Processing [1]

Video to video synthesis [4]

CF, INFN Machine Learning School, Camogli, 2019

15

16 of 41

16

(2024)

17 of 41

(today) AI/ML is ubiquitous

17

And many more applications (healthcare, finance, social media, cybersecurity, agriculture, etc.)

Smartphone assistance

Home automation

Entertainment

E-commerce

Autonomous vehicles

..and even dining

18 of 41

Foundation Models

18

Foundation models are large-scale machine learning models that are pre-trained on a wide range of data (Internet text), resulting in a model that can be adapted to a wide range of downstream tasks

Generative Pre-training Transformer (GPT), developed by OpenAI, is a leading example of a foundation model. It uses a transformer architecture to understand and generate human-like text

Foundation models like GPT have revolutionized AI, enabling more nuanced and context-aware applications. They also pose ethical challenges, such as potential bias in responses and the difficulty of controlling their output

In the broader NP and HEP community:

1st Large Language Models in Physics Symposium (LIPS) in Hamburg (DESY campus) from Feb 21 – 23, 2024.

https://indico.desy.de/e/lips

19 of 41

Why is ML ubiquitous?

19

  • Last three decades characterized by unprecedented ability to generate / analyze large data sets: big data revolution spurred by exponential increase of computing power and memory
  • Computations that were unthinkable can now be routinely performed on laptops.
  • Specialized computing machines (e.g., GPU-based) are continuing this trend towards cheap, large scale computation.

A. Radovic, et al. "Machine learning at the energy and intensity frontiers of particle physics." Nature 560.7716 (2018): 41-48.

20 of 41

20

How did it all start?

A (non-exhaustive) list of milestones in chronological order to add more context

Many ideas behind neural networks are relatively old, but they have been revitalized and popularized in more recent years

21 of 41

(1958) Perceptron & Artificial Neurons

21

  • The perceptron, a fundamental unit in artificial neural networks, is inspired by the basic functioning of biological neurons: it receives signals, has a computational unit (similar to the cell body) to process the signals, and an output node (axon) that transmit the result
  • The first concept of a simplified brain cell, the so called McCulloch-Pitts neuron, an algorithm for supervised learning of binary classifiers, was introduced in 1943.
  • The first hardware implementation by F. Rosenblatt was achieved in 1957-1958.
  • The term machine-learning was coined in 1959 by Arthur Samuel (IBM)

Impulses carried away from cell body

Biological neuron processing chemical and electrical signals

Mark I Perceptron machine,

Connected to a camera with 20x20 photocells (400-pixel image)

Image recognition

22 of 41

(1958) Perceptron & Artificial Neurons

22

Binary classification, linear decision boundary

Activation: step function

Training

Perceptron is a “simple version of deep learning”

23 of 41

Key Components of DL Architectures

23

Example: multi-layer perceptron

These fundamental components will be revisited during the course

Core elements:

  • Activation Functions
  • Regularization (dropout, L1, L2)

Optimization methods:

  • Stochastic Gradient Descent (Backpropagation, accelerated descent methods)
  • Hyperparameter optimization

Image by Alec Radford

24 of 41

1960-1985 Backpropagation

24

  • Backpropagation of errors to train deep models was lacking at this point. Backpropagation was derived already in the early 1960s but in an inefficient and incomplete form.
  • The modern form was derived first by Linnainmaa in his 1970 masters thesis that included FORTRAN code for backpropagation but did not mention its application to neural networks. Even at this point, backpropagation was relatively unknown and very few documented applications of backpropagation existed the early 1980s (e.g. Werbos in 1982).
  • Rumelhart, Hinton, and Williams showed in 1985 that backpropagation in neural networks could yield interesting distributed representations.
  • The first practical application of backpropagation was demonstrated by LeCun in 1989 at Bell Labs, where he used convolutional networks with backpropagation to classify handwritten digits (MNIST). [paper, video]

G. Hinton: “I have never claimed that I invented backpropagation. David Rumelhart invented it independently long after people in other fields had invented it. It is true that when we first published we did not know the history so there were previous inventors that we failed to cite. What I have claimed is that I was the person to clearly demonstrate that backpropagation could learn interesting internal representations and that this is what made it popular.” (source1, source2)

G. Hinton: “My view is throw it all away and start again. The future depends on some graduate student who is deeply suspicious about everything I’ve said”. (source3)

25 of 41

1979-1982 Introduction of CNN and RNN

25

  • The earliest convolutional networks were used by Fukushima in 1979. Fukushima’s networks had multiple convolutional and pooling layers similar to modern networks, but the network was trained by using a reinforcement scheme where a trail of strong activation in multiple layers was increased over time. Additionally, one would assign important features of each image by hand by increasing the weight on certain connections.
    • The real boom CNNs came after 2012, when the AlexNet architecture, a deep CNN developed by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, won the ImageNet Large Scale Visual Recognition Challenge (NIPS Paper)
  • Recurrent Neural Network, a class of networks for sequential data processing, first introduced by J. Hopfield in 1982 with the development of Hopfield Networks, which laid the groundwork for recurrent neural architectures.
    • The more general and widely recognized form of RNNs was developed by David Rumelhart, Geoffrey Hinton, and Ronald J. Williams in 1986 - learn dependencies through time (speech recognition, language modeling, time series)
    • Challenges (e.g., vanishing gradients) later addressed by more advanced variants like LSTM in 1997 (paper)

26 of 41

AI Winter ~ 80’s

26

Early Hype:

  • Rapid ML advances led to strong AI promises and funding.
  • Parallels with today's deep learning enthusiasm.

1970s Downturn:

  • Unrealistic promises led to drastic funding cuts.
  • AI research nearly marginalized; few persevered.

Revival:

  • Persistent research led to the deep learning resurgence.

Takeaway:

  • Excessive hype risks another AI winter; caution is essential.

27 of 41

(2009) ImageNet

27

  • Mid-2000s: the term deep learning gained popularity following a paper by Hinton and Salakhutinov
  • In 2009, Fei-Fei Li, an AI professor at Stanford, introduced ImageNet, a free database containing over 14 million labeled images. The Internet was, and still is, filled with countless unlabeled images, but labeled images were essential for "training" neural networks. Professor Li’s vision was that big data would revolutionize the way machine learning functions in that data drives learning.

AlexNet is a convolutional network architecture named after Alex Krizhevsky under the supervision of Hinton, utilized in the ImageNet Large Scale Visual Recognition (ILSVRC)-2012 competition. Utilized dropout for improved generalization.

28 of 41

Convolutional Neural Network

28

  • CNN architecture is inspired by human visual cortex. An image can be thought as a group of numbers each describing an intensity value. This can be input in a NN for classification in output.

CNN “scans the image”

  • The neurons in a CNN identify local examples of translationally invariant features. This is achieved by using convolutional filters to detect patterns, creating maps of simple features, which are then combined through multiple layers to build more complex feature representations

29 of 41

http://scs.ryerson.ca/~aharley/vis/conv/

30 of 41

(2014) Generative Adversarial Networks

30

  • GANs were created by Ian Goodfellow, where two neural networks engage in a competitive game. One network attempts to generate images that closely resemble real photos, while the other network, acting as the adversary, seeks to detect any flaws. The game continues until the generated images are nearly indistinguishable from real ones. GANs offer a powerful method for refining products, though they have also been exploited by scammers. [paper]

GANs have raised ethical concerns in the area of deep fakes

31 of 41

2012-2016 More Recent Milestones

31

  • 2009: A NIPS Workshop on Deep Learning for Speech Recognition reveals that with a large enough dataset, neural networks can bypass pre-training, leading to significantly lower error rates.
  • 2012: Artificial pattern-recognition algorithms reach human-level performance on specific tasks, and Google's deep learning algorithm famously identifies cats. [blog.google]
  • 2014: Google acquires the UK-based artificial intelligence startup DeepMind for £400 million.
  • 2015: Facebook implements deep learning technology called DeepFace to automatically tag and identify users in photos, using algorithms that excel in face recognition by analyzing 120 million parameters.
  • 2016: leveraging RL, Google DeepMind's algorithm AlphaGo masters the complex board game Go and defeats professional player Lee Sedol in a highly publicized tournament in Seoul.

32 of 41

32

A Typical Problem

(Bias/Variance)

33 of 41

Typical Problem

33

  • Ingredients:
    • Dataset D(X,y): X matrix of independent variables, y dependent variables
    • Model f(X;θ) where f: Xy is a function of the parameters θ
    • Cost function C(y,f(X;θ)) to judge how well the model performs on the observations y
  • The model is fit to find θ that minimize the cost function; commonly used cost is squared error (method of least squares)
  • Recipe for prediction problems:
  1. Randomly divide the dataset D into mutually exclusive Dtrain (typically ~80%) and Dtest (~20%)
  2. Model is fit on training data θ*=argminθ{C(ytrain,f(Xtrain))}
  3. The performance of the model is evaluated on C(ytest,f(Xtest;θ*))
  4. Splitting data provides an unbiased estimate for the predictive performance (known as cross-validation)
  5. In-sample error: Ein = C(ytrain,f(Xtrain)); out-of-sample error: Eout = C(ytest,f(Xtest))
  6. Eout is always larger than Ein , Eout Ein

34 of 41

Bias/Variance

34

  • The model that provides the best explanation for the current dataset will probably not provide the best explanation for future datasets
  • The discrepancy between Ein , Eout grows with the complexity of our data and of our model (increased model parameters, high dimensional space, curse of dimensionality)
  • For these reasons (and for complicated models), predicting and fitting can be different things. Need to pay attention to out-of-sample performance. Fitting existing data well is fundamentally different from making predictions about new data.
  • Let’s see this starting from simple one-dimensional problem: we want to fit data with polynomials of different orders.
  • Our ability to predict depends on the number of data points, the noise in the data, and our prior knowledge about the system

35 of 41

Fitting vs

Predicting

35

  • We utilize a test interval [0,1.2] which is larger than the training interval [0,1.0]
  • Data sampled from
    1. f(x) = 2x
    2. f(x) = 2x -10x5 + 15x10
  • In absence of noise, even with a small training set (Ntrain=10<Ntest=20) the model class that generated the data provides the best fit and also the best out-of-the sample prediction.

(same linear)

polynomial order 10

(same polynomial order 10)

Training Data

Test Data

Linear dependence

(a)

(b)

(Simple Regression Examples)

36 of 41

36

(a)

(b)

  • We utilize a test interval [0,1.2] which is larger than the training interval [0,1.0]
  • Data sampled from
    • f(x) = 2x
    • f(x) = 2x -10x5 + 15x10
  • Noise =1; training set (Ntrain=100>Ntest=20); even when the model class that generated the data is a 10 order polynomial, the linear and 3rd order polynomials give better out-of-sample predictions.
  • At small sample sizes, noise can create fluctuations in the data that look like genuine patterns.

Training Data

Test Data

Fitting vs

Predicting

37 of 41

37

  • We utilize a test interval [0,1.2] which is larger than the training interval [0,1.0]
  • Data sampled from
    • f(x) = 2x
    • f(x) = 2x -10x5 + 15x10
  • Noise =1; let’s increase the training set to Ntrain=104
  • At small sample sizes, noise can create fluctuations in the data that look like genuine patterns.

Training Data

Test Data

Data from

polynomial order 10

  • The 10 polynomial model gives both the best fit ad the most predictive power over the entire range [0,1] and actually slightly beyond up to ~1.05, but then the predictive power quickly degrades
  • This is our first experience with the bias-variance tradeoff: where the amount of data is limited, we often get better predictive performance by using a less expressive model (lower order polynomial). The simpler model has more bias but is less dependent on the particular realization of the training set, i.e. has less variance.

Fitting vs

Predicting

38 of 41

ML can be difficult (way more than that ;)

38

  • We are good at interpolating but not at extrapolating...

  • Fitting is not predicting.
    • Fitting existing data well is fundamentally different from making predictions about new data
  • Using a complex model can result in overfitting
    • Better result on training data; when data size is small and the data are noise, this results in overfitting and degrade predictive performance
  • For complex datasets and small training sets, simple models can be better at predicting than complex ones due to the bias-variance tradeoff
    • Even though the correct model (less bias) has better predictive performance for an infinite amount of training data, the training errors stemming from finite-size sampling (variance) can cause simpler models to outperform the more complex model
  • It is difficult to generalize beyond what seen in the training dataset.

39 of 41

39

Goals of DATA-621

  • Deep Understanding of “What’s going on with these networks”
    • You should be able to create, debug, train, test, and tweak MLP, CNN, RNN, GAN, GNN, etc
  • Applied
    • Teach you the advantages and disadvantages of different strategies for fitting networks, software packages, hardware architectures, and more knowledge required to successfully create your own nets.
  • Insights
    • We’ll be showing some of the “newer” innovations, and giving you the tools to experiment and get insights (we will see with some concrete example how to explain/interpret results)

Disclaimer: Uncertainty Quantification will be often mentioned but is not the main focus of the course. A good starting point for a more rigorous UQ treatment is the course BRDS

40 of 41

40

That’s It for Today!

41 of 41

41

Spares