1 of 281

2 of 281

PoW Series #1 - RNNs

3 of 281

Sequence data

Time series

4 of 281

Traditional vs RNN Architecture

Multilayer perceptron

RNN

5 of 281

6 of 281

Forward Propagation

In the following example, the prediction for y3 gets the information from inputs x3 and the activation of x2 and x1 as you can see on the green path.

activation function: tanh/Relu

activation function: softmax, sigmoid (binary class), etc

7 of 281

Back Propagation

“often, the programming framework will automatically take care of backpropagation.” However, the way it works in backpropagation, we go from right to left using a loss function (kind of going backwards in time; backpropagation through time).

Model Output

Expected Output

Loss function

8 of 281

Extra - Gated Recurrent Units (GRU)

9 of 281

Long Short Term Memory (LSTMs)

10 of 281

Bidirectional architecture

11 of 281

Different types of RNNs

Different types of RNNs are one to one (example of standard NN), one-to-many (for Sequence generation), many-to-one (example of sentiment classification), many-to-many (use of encoder and decoder to have the size of x and y different).

12 of 281

Sequence to Sequence (seq2seq)

Encoder - Decoder

13 of 281

14 of 281

PoW Series #2 - Transformers

15 of 281

Transformers

RNN

LSTM

GRU

  • Suffer from short term memory
  • LSTMs and GRUs have a bigger memory but not enough

Attention mechanism

Enough compute resources

Being capable of using the entire context of the text

16 of 281

Transformers

17 of 281

Positional Encoding

  • Equip the input words with their positional information

“Positional Embedding”

Dimension_model(d)= 512

18 of 281

Positional Encoding

Word embeddings

FAQ

Why sum and not concat?

Same dimension

(A x B x C)

“Positional Embedding”

19 of 281

Positional Encoding

20 of 281

Encoder

  • The encoder is composed of a stack of N = 6 identical layers.

  • The first Encoder in the stack receives its input from the Embedding and Position Encoding.

  • The other Encoders in the stack receive their input from the previous Encoder.

  • Input → Multi-head Self attention → Feed Forward → Output

Dimension_model= 512

21 of 281

Attention

  • Attention enables the model to focus on other words in the input that are closely related to that word.
  • The Transformer architecture uses self-attention by relating every word in the input sequence to every other word.
  • It doesn’t suffer from short term memory.

22 of 281

Attention

  1. Self-attention in the Encoder — the source sequence pays attention to itself
  2. Self-attention in the Decoder — the target sequence pays attention to itself
  3. Encoder-Decoder-attention in the Decoder — the target sequence pays attention to the source sequence

23 of 281

Attention

  • The Attention layer takes its input in the form of three parameters, known as the Query, Key, and Value.
  • All three parameters are similar in structure, with each word in the sequence represented by a vector.

For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).

24 of 281

Multi-head attention

  • The Transformer calls each Attention processor an Attention Head and repeats it several times in parallel
  • Module in the transformer network that computes the attention weights for the input and produces an output vector with encoded information on how each word should attend to all other words in the sequence

25 of 281

Multi-head attention

26 of 281

Multi-head attention

27 of 281

Multi-head attention

dk= dimension of query and key

28 of 281

Multi-head attention

Take the softmax of the scaled score to get the attention weights, which gives you probability values between 0 and 1

29 of 281

Multi-head attention

  • The higher softmax scores will keep the value of words the model learns is more important.
  • The lower scores will drown out the irrelevant words

30 of 281

Multi-head attention

Concat

“2 stacks”

Perform the attention function in parallel

31 of 281

Recap

32 of 281

Add & Norm

33 of 281

Batch Normalization

Add & Norm are in fact two separate steps.

The add step is a residual connection

t means that we take sum together the output of a layer with the input

F(x)+x. The idea was introduced by He et al (2005) with the ResNet model. It is one of the solutions for vanishing gradient problem.

The norm step is about layer normalization (Ba et al, 2016), it is another way of normalization. TL;DR it is one of the many computational tricks to make life easier for training the model, hence improve the performance and training time.

Add Residual Connection

Add & Norm

34 of 281

Feed Forward

35 of 281

Decoder

  • The decoder’s job is to generate text sequences
  • Target sequence = output embedding + position encoding
  • The first Decoder which then also produces an encoded representation for each word in the target sequence, which now incorporates the attention scores for each word as well.

36 of 281

Decoder - Masked Multi Head Attention

  • When computing attention scores on the word “am”, you should not have access to the word “fine”, because that word is a future word that was generated after.
  • Masking: method to prevent computing attention scores for future words

37 of 281

Decoder

Mask(Dec.)

Only for the first attention layer of the decoder!!

38 of 281

Encoder - Decoder Attention

39 of 281

Encoder - Decoder Attention

We take the index of the highest probability score → predicted word

40 of 281

41 of 281

PoW Series #3

BERT

En realidad se llama Blas

42 of 281

BERT and GPT are transformer-based architecture while ELMo is Bi-LSTM Language model. BERT is purely Bi-directional, GPT is unidirectional and ELMo is semi-bidirectional

43 of 281

44 of 281

GPT-2 vs BERT

Stacking Encoders

45 of 281

BERT

  • Left-to-right architecture, where every token can only attend to previous tokens in the self-attention layers of the Transformer
  • Very harmful when applying fine tuning based approaches to token-level tasks such as question answering, where it is crucial to incorporate context from both directions
  • Alleviates the previously mentioned unidirectionality constraint by using a “masked language model” (MLM) pre-training objective.
  • Enables the representation to fuse the left and the right context, which allows us to pretrain a deep bidirectional Transformer.

46 of 281

BERT

Two Steps: pre-training and fine-tuning

the model is trained on unlabeled data over different pre-training tasks.

the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks

47 of 281

BERT input representation

NEW: To differentiate between different representations

48 of 281

49 of 281

Pre-training

Masked LM(50%)

Train a bidirectional representation.

  • Mask random words from input.

  • Predict masked tokens. *Only predicts masked-words, not entire input.
  • a downside is that we are creating a mismatch between pre-training and fine-tuning, since the [MASK] token does not appear during fine-tuning????

NSP(50%)

Next sentence prediction.

  • Understand relationship between two sentences (used in QA & NLI).

  • Pre-train for a binarized next sentence prediction task (retrieved from any monolingual corpus).

  • In training, 50% correct sentence pairs are mixed in with 50% random sentence pairs to help BERT increase next sentence prediction accuracy.

50 of 281

Masked Language Model (MLM)

  • Enables/enforces bidirectional learning from text by masking (hiding) a word in a sentence and forcing BERT to bidirectionally use the words on either side of the covered word to predict the masked word

The bidirectional methodology you did to fill in the [blank] word above is similar to how BERT attains state-of-the-art accuracy. A random 15% of tokenized words are hidden during training and BERT’s job is to correctly predict the hidden words

51 of 281

Masked Language Model (MLM)

Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension

52 of 281

Fine-Tuning

Train a bidirectional representation.

  • Uses Transformer Attention mechanism by swapping out the appropriate inputs and outputs.

  • Uses the self-attention mechanism to unify two stages: encoding text pairs and applying bidirectional cross attention.

53 of 281

54 of 281

BERT Size & Architecture

55 of 281

PoW Series #4

ViT

16x16

56 of 281

Some quick facts…

  • outperform SOTA CNNs by almost 4x
  • Google Research's GitHub: https://github.com/google-research/vision_transformer
  • Applications:
    • classic computer vision tasks
      • object detection
      • image segmentation
      • image classification
      • action recognition
    • Generative modelling
    • multi-model tasks (visual grounding, visual question-answering, visual reasoning)
  • Dataset Used: Imagenet

Previous work used as a source for inspriation:

  • Vaswani et al (2017) - Transformer
  • Devlin et al (2019) - Bert
  • Radford et al (2018) , Brown et al (2020)- GPT

Attention Mechanisms:

  • Parmar et al (2018): self-attention in local neighborhoods for each query pixel
  • Child et al (2019) Sparse Transformers
  • Cordonnier et al (2020): Extract patches of size 2x2 from input image

57 of 281

Architecture

The ViT Encoder: identical to Attention is All you need.

  • Linear Projection: Standardizes the input

  • Multi-Head Attention Network (MSP): eg. saliency maps, alpha matting

  • Multi-Layer Perceptron (MLP): two layer classification network with GELU (Gaussian Error Linear Unit) (by leveraging classification tokens - as seen in BERT)

58 of 281

Models just want attention 0.0

Handling 2D Images:

  • Prepend a learnale embedding to the
  • Positional Encoding:
  • learned vectors with the same �dimensionality as our patch �embeddings.

59 of 281

Linear Transformation

The problem with transformers: quadratic complexity when computing the Attention Matrix of an entire image. (patches split the images like words in a sentence)

Example: For a 28x28 mnist image, if we flatten it to 784 pixels, we still have to deal with an attention matrix of 784x784 to see which pixels attend to one another.

Steps taken:

  1. Convert image into square patches
  2. Flatten Images into Vector Representations (xp)

Dimensions Explanation

60 of 281

Learnable Embeddings (Classification)

61 of 281

Positional Encoding

62 of 281

Mathematical Intuition

  1. The Transformer uses constant latent vector size D through all of its layers, so we

flatten the patches and map to D dimensions with a trainable linear projection

  • Multi-Headed Self Attention
  • Multi-Layer Perceptron
  • Similar to BERT’s [class] token, we prepend a learnable embedding to the sequence of embed-

ded patches, whose state at the output of the Transformer encoder serves as the image representation y

63 of 281

SOTA

64 of 281

Pending Questions

  1. How does the positional encoding work in relation to local(patch) vs global(full image) pixels?
    1. In terms of the similarity matrix…
  2. Attention techniques for pixels:
    • Prepare example with code

65 of 281

AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

66 of 281

How is it working?

67 of 281

Types of Machine Learning Papers

68 of 281

CLIP: Connecting text and images

69 of 281

Types of Machine Learning Papers

70 of 281

LiT ‎️‍🔥: Zero-Shot Transfer with Locked-image text Tuning

4 billion images for training

71 of 281

Types of Machine Learning Papers

72 of 281

Simple Open-Vocabulary Object Detection with Vision Transformers

Vision Transformer for Open-World Localization, or OWL-ViT for short

73 of 281

Types of Machine Learning Papers

74 of 281

Scaling Vision Transformers to 22 Billion Parameters

75 of 281

VISION TRANSFORMERS

76 of 281

Types of Machine Learning Papers

77 of 281

78 of 281

PoW Series #5

CLIP

79 of 281

Introduction

CLIP stands for Constastive Language-Image Pretraining:

CLIP is an open source, multi-modal, zero-shot model. Given an image and text descriptions, the model can predict the most relevant text description for that image, without optimizing for a particular task.

Constastive Language: With this technique, CLIP is trained to understand that similar representations should be close to the latent space, while dissimilar ones should be far apart. This will become more clear later with an example.

80 of 281

Contrastive pre-training

N images paired with their text: (image1, text1… imageN, textN)

Contrastive Pre-training aims to jointly train an Image and a Text Encoder that produce image embeddings [I1, I2 … IN] and text embeddings [T1, T2 … TN], in a way that:

The cosine similarities of the correct <image-text> embedding pairs <I1,T1>, <I2,T2> (where i=j) are maximized.

In a contrastive fashion, the cosine similarities of dissimilar pairs <I1,T2>, <I1,T3>… <Ii,Tj> (where i≠j) are minimized.

81 of 281

Resnet or ViT

Standard

Transformer model with GPT-2 style modifications

82 of 281

Architecture

Code

83 of 281

(pseudocode)

84 of 281

Dataset - WIT for WebImageText

CLIP’s architecture and data format allows scale.

400M Image-text pairs are all over the internet and don’t have to be labeled.

Learning many words here (purple background, jumping, dog, agile…)

This is what allows it to perform so good on zero-shot benchmarks.

What does zero-shot mean? “We find that CLIP, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others.”

85 of 281

Zero-shot vs Few-shot

86 of 281

Background: “Robustness”

Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”

ImageNet Adversarial

Dataset of examples which fooled a “standard” ResNet50 at the time.

…on this data ↑

Not even MNIST is robust 😱😱😱

go/jax-on-the-web-blog example 2

87 of 281

Background: “Robustness”

Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”

ImageNet V2

ImageNet data collected in the same way as the original ImageNet.

…on this data ↑

88 of 281

Background: “Robustness”

Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”

ImageNet Rendition

ImageNet classes but animated/drawn/sketched.

…on this data ↑

89 of 281

Background: “Robustness”

Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”

ImageNet Corrupted (snow, defocus blur)

ImageNet validation data, with artificial corruptions applied (many variants e.g. blur types, frost, with different severities)

…on this data ↑

90 of 281

Bias

91 of 281

CLIP “opening its eyes progressively” allows a faster pre-training

92 of 281

PoW Series #6

Vicuna

93 of 281

70K

ShareGPT: a website where users can share their ChatGPT conversations

With PyTorch FSDP on 8 A100 GPUs in one day.

Training and Evaluation

94 of 281

70K

With PyTorch FSDP on 8 A100 GPUs in one day.

Training

95 of 281

Training

Similar to Alpaca + improvements:

Memory Optimizations: max_context_length (512 → 2048), which requires more GPU memory. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.

Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot’s output.

Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch.

7B model → ($500 to $140 )

13B model →( $1K to $300)

96 of 281

Flash Attention

Multi-round conversations

Enhance the training scripts provided by Alpaca to better handle multi-round conversations and long sequences.

Spot Instance

97 of 281

70K

ShareGPT: a website where users can share their ChatGPT conversations

With PyTorch FSDP on 8 A100 GPUs in one day.

80 questions

Evaluation

98 of 281

To compare two different models, we combine the outputs from each model into a single prompt for each question. The prompts are then sent to GPT-4, which assesses which model provides better responses. A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below.

Evaluation

99 of 281

Evaluation

Careful prompt Engineering

Generate diverse and challenging questions

Select 10 questions per category

100 of 281

Evaluation

GPT-4 to rate the quality of their answers based on helpfulness, relevance, accuracy, and detail. GPT-4 can produce not only relatively consistent scores but also detailed explanations on why such scores are given.

Not good at coding/math evaluation

101 of 281

Comparison with other LLMs

102 of 281

PoW Series #7

Optical flow-based odometry for trajectory estimation of drones

https://www.imavs.org/papers/2022/8.pdf

103 of 281

Definitions

ODOMETRY

OPTICAL FLOW

KALMAN FILTERS

Consist on estimating the motion of a robot by measuring changes in its position over time.

Spatial distribution of the apparent motion of pixels in an image.

Recursive algorithm to estimate the state of a system from noisy environments. EKF (Extended KF) use non-linear functions to describe the system dynamics and measurement equations.

104 of 281

Odometry & Oscillations

Vertical velocity

Total velocity

Horizontal Velocity

Vertical axis

Height of the hexarotor

- 𝜱

±𝜱 = Optical flow sensor orientation

D = Optical flow sensors distance to the ground

ω(±𝜱) = Optical flow magnitudes

When it goes up:

d1

d1

d2

d2

When it goes down:

* The angle 𝜱 doesn’t change when the hexarotor goes up and down, but the distance on the ground between the sensors do change. d1 < d2

105 of 281

Optical Flow

Optical flow is the pattern of apparent motion of image objects between two consecutive frames caused by the movement of object or camera (or in other terms, based on the relative motion of the camera). It is 2D vector field where each vector is a displacement vector showing the movement of points from first frame to second.

Ceteris Paribus:

  1. The pixel intensities of an object do not change between consecutive frames.
  2. Neighbouring pixels have similar motion.

Pixel correspondence problem.

  • Motion is small
  • The appearance (of the pixel) doesn’t change from t to t+1

Taylor Series Expansion of Right Hand Side (without time index)

brightness constancy constraint

describes a tracked pixel with unknown velocities u and v as spatial motion induced by the camera.

plug back into the

brightness constancy constraint

Understanding

106 of 281

Optical Flow

Understanding

We have an equation describes a tracked pixel with unknown velocities u and v as spatial motion induced by the camera from the equation derived previously. What happens now?

Limitations:

  • 2 unknowns
  • 1 equation

The solution:

  • Track a collection of colocated pixels assuming they have the same u and v.

  • Now we can capture multiple equations and solve for u and v.

Least Squares

Method

*these two are the same equation

Eigenvalue of points interpretation:

  • represents the gradient

Small e: small gradients all directions

Large e: large gradients

Large ratios: combination of both

107 of 281

Optical Flow

Translation Optical Flow:

  • The pattern generated on the optic flow vector field by the translational motion of a drone flying above the ground.

Optic Flow Divergence:

  • The series of contractions and expansions generated in the optic flow vector field by up-and-down oscillatory motions.

*Optical Flow Vector Field:

108 of 281

Optical Flow

Translation Optical Flow:

Hexarotor with two sensors:

Hexarotor with fours sensors:

Three transnational optic flow cues can be measured as:

  • The sum of the two optic flow magnitudes perceived by the two optic flow sensors (S1&S2) set along the longitudinal axis x.

  • The sum of the two optic flow magnitudes perceived on the x axis by the two optic flow sensors (S3&S4) set along the lateral axis y.

  • The median of the four optic flow magnitudes considered, projected on the hexarotor’s vertical axis by a 1/cos(ϕ) factor.

x

S3

S4

S2

S1

y

Hexarotor equipped with four optic flow sensors

http://hyperphysics.phy-astr.gsu.edu/hbase/rotq.html

109 of 281

Optical Flow

Optic Flow Divergence:

Hexarotor with two sensors:

Hexarotor with fours sensors:

Two optical flow divergence cues can be measured as:

  • The subtraction between the two optic flow magnitudes perceived by the two optic flow sensors (S1&S2) set along the longitudinal axis x.

  • The subtraction between the two optic flow magnitudes perceived by the two optic flow sensors (S3&S4) set along the lateral axis y.

x

S3

S4

S2

S1

y

Hexarotor equipped with four optic flow sensors

110 of 281

Flow fields generated by different invariants

Result from translational movements

Result from vertical movements

111 of 281

Kalman Filters

“Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by using Bayesian inference and estimating a joint probability distribution over the variables for each timeframe.”

112 of 281

Kalman Filters - Optimal estimation algorithm

113 of 281

Kalman Filters - Optimal estimation algorithm

114 of 281

Kalman Filters - Optimal estimation algorithm

115 of 281

Extended Kalman Filters

The Extended Kalman Filter (EKF) is a mathematical algorithm that is used to estimate the state of a nonlinear system in the presence of noisy sensor measurements. It is an extension of the standard Kalman Filter, which is designed to work with linear systems.

The EKF works by linearizing the system dynamics and measurement equations around the current estimate of the state, and then applying the Kalman Filter algorithm to the resulting linearized equations. This process is repeated iteratively to estimate the state of the system over time

The EKF is particularly useful in cases where the system dynamics are nonlinear and cannot be modeled accurately using linear models.

116 of 281

Components of the Paper

Optical Flow Cues:

Translation Optical Flow:

  • Used fow visual odometry and localisation

Optical Flow Divergence:

  • Coming from the self-oscillatory motion which generates a series of expansions and contractions in the optic flow field

Extended Kalman Filter (EKF)

  • The local optic flow divergence was used to estimate the local distance between the chariot and the moving panorama using EKF.

117 of 281

Components of the Paper

  • Velocity:
    • V = Vh + Vx
    • If Vh is positive, the optical flow divergence component is a contraction
    • If Vh is negative the optical flow divergence is an expansion
    • The contraction or expansion of the optical flow is superimposed in the central optical flow vector field on the translational optic flow

  • Optical Flow Sensors:
    • Set at angles ±φ
      • Optical Flow Magnitudes:
        • ω(φ) and ω(−φ)

divergence

translation

118 of 281

Components of the Paper

Measurement of the local translational and divergence optic flow cues

Minimalistic Visual Odometer Method

Description of the hexarotor and the optic flow sensors used

Odometry Process based on the raw measurements of 2 optic flow sensors used

Sensor fusion odometry processing based on 4 optic flow sensors (both with a precise and with a rough prior knowledge of optic flow variations)

Sensors fusion strategies based on the knowledge of optical flow and how they increase the measurement accuracy of the local optic flow cues by comparing the three methods

119 of 281

Measurement of the local translational and divergence optic flow cues

Optic Flow Divergence: series of contractions and expansions generated in the optic flow vector field by up-and-down oscillatory motions. When a drone flies forward while oscillating up-and-down above the ground, in the optic flow vector field the optic flow divergence is superimposed on the translational optic flow.

the theoretical local optic flow divergence

local optic flow divergence

Scaling to 4 optical flow sensors

120 of 281

integration of the local translational

optic flow ωT scaled by the estimated distance with respect

to the ground ˆh

Minimalistic Visual Odometer Method (SOFIa)

SOFIa (Self-scaled

Optic Flow time-based Integration model)

h: ˆh was estimated by means of an EKF taking as input the honeybee’s wing stroke amplitude and as measurement the local optic flow divergence computed as the ratio between Vh and h.

FUN FACT: The SOFIa model was found to be about 10 times more accurate than the raw mathematical integration of optic flow.

121 of 281

Description of the hexarotor and the optic flow sensors used

Hexarotor equipped with 4 optic flow sensors oriented towards the ground flying along a bouncing circular trajectory in the Marseille’s flying arena.

2 optic flow sensors (pixart PAW903) set along longitudinal axis x at angles φ = ±30◦ with respect to hexarotor's vertical axis z

2 optic flow sensors set along lateral axis y at angles φ = ±30◦ with respect to hexarotor's vertical axis z

Example of a test flight trajectory over 53m at an oscillation

frequency of 0.28Hz

used a trajectory tracking algorithm to perform up-and-down oscillating circular trajectories: https://github.com/gipsa-lab-uav/trajectory control

122 of 281

Description of the hexarotor and the optic flow sensors used

  • Position and orientation used in the hexarotor’s control were taken from the motion-capture (MoCap) system installed in the Mediterranean Flying Arena.

  • The flying arena was equipped with 17 motion-capture cameras covering a 6 × 8 × 6 m volume using a VICONTM system.

  • Datasets including the optic flow measurements were recorded via the Robot Operating System (ROS) and processed with the Mat-lab/Simulink 2022 software.

123 of 281

Description of the hexarotor and the optic flow sensors used

State space representation used for the EKF

To estimate the hexarotor’s flight height ˆh, we chose to model the hexarotor’s system as a double integrator receiving as input the acceleration az on the vertical axis z given by the drone’s IMU.

124 of 281

Odometry Process based on the raw measurements of 2 optic flow sensors used

State space representation used for the EKF

NO PRIOR KNOWLEDGE (NPK)

  • input: the acceleration of the drone az
  • measurement: the local optic flow divergence

125 of 281

Sensor fusion odometry processing based on 4 optic flow sensors (both with a precise and with a rough prior knowledge of optic flow variations)

Precise Prior Knowledge

(PPK)

126 of 281

Sensor fusion odometry processing based on 4 optic flow sensors

Precise Prior Knowledge

(PPK)

Precise Prior Knowledge

(PPK)

Rough Prior Knowledge

(RPK)

Precise prior knowledge refers to a complete and accurate understanding of the system dynamics, including the physical laws that govern the motion of the objects in the scene, the noise characteristics of the sensors, and the characteristics of the imaging system. This level of knowledge allows for highly accurate predictions of the state of the system, which can be used to improve the accuracy of the optical flow estimation.

Rough prior knowledge, on the other hand, refers to a more limited understanding of the system dynamics. This could include knowledge of the approximate motion of the objects in the scene, but without a detailed understanding of the underlying physics or noise characteristics of the sensors. Rough prior knowledge is still useful for making predictions about the state of the system, but these predictions may be less accurate than those based on precise prior knowledge.

127 of 281

PoW Series # 8

GANs

128 of 281

General Architecture

129 of 281

Generative Adversarial Networks (G vs D)

pg = pd

pg = distribution of generated data

pd = distribution of real data

130 of 281

The Discriminator

  • Is simply a classifier.
  • It tries to distinguish real data from the data created by the generator. It could use any network architecture appropriate to the type of data it's classifying.

TRAINING

  • The discriminator classifies both real data and fake data from the generator.
  • The discriminator loss penalizes the discriminator for misclassifying a real instance as fake or a fake instance as real.
  • The discriminator updates its weights through backpropagation from the discriminator loss through the discriminator network.

131 of 281

The Generator

  • Learns to create fake data by incorporating feedback from the discriminator.
  • It learns to make the discriminator classify its output as REAL.

TRAINING

  1. Sample random noise.
  2. Produce generator output from sampled random noise.
  3. Get discriminator "Real" or "Fake" classification for generator output.
  4. Calculate loss from discriminator classification.
  5. Backpropagate through both the discriminator and generator to obtain gradients.
  6. Use gradients to change only the generator weights.

132 of 281

Alternating Training

Step

Remains constant

WHY?

1.The discriminator trains for one or more epochs.

Generator

Discriminator training tries to figure out how to distinguish real data from fake, it has to learn how to recognize the generator's flaws.

2. The generator trains for one or more epochs.

Discriminator

Otherwise the generator would be trying to hit a moving target and might never converge.

to continue to train the G and D networks.

Repeat steps 1 and 2

133 of 281

Variable Definition

  • We train D to maximize the probability of assigning the correct label to both training examples and samples from G.
  • We simultaneously train G to minimize
  • generators distribution over data x
  • mapping to data space where D, G is differentiable function represented by an MLP
  • prior on input noise variables
  • parameters of MLPs G, D
  • output is a single scalar, which represents the probability that x came from the data rather than
  • generated sample

134 of 281

Mini G & Max D

135 of 281

Understanding the training

D

pg: G

px: data generating distribution

The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg

(d) After several steps of training, G and D reach a point at which both cannot improve because pg = pdata.

The discriminator is unable to differentiate between the two distributions, i.e. D(x) = 1 2 .

(b) Train D

(c) Train G

After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data.

x = data domain

z = noise domain

“We alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough.”

136 of 281

for k steps learn discriminator:

Learning Algorithm

Then, update generator:

Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough.

137 of 281

Algorithm 1

sample noise and real data

maximize Discriminator

minimize Generator

sample noise and real data

138 of 281

Theoretical Results

139 of 281

Practical Results

140 of 281

Applications of Gans

CGANS

Cycle GANS

using labels to improve GANs

Cross-domain transfer GANs

Star�GANS

image-to-image translation for one domain to another

PixelDTGAN

pixel-level domain transfer for for reco systems

SRGAN

super-resolution images from the lower resolution.

Gau�GAN

TP�GAN

synthesizes photorealistic images given an input semantic layout.

Cross-domain transfer GANs

Deblur

GAN

image-to-image translation for one domain to another

141 of 281

Demo

The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain

of x.

142 of 281

Main Challenges Overview

mode collapse

Real-life data distributions are multimodal. For example, in MNIST, there are 10 major modes from digit ‘0’ to digit ‘9’. The samples below are generated by two different GANs. The top row produces all 10 modes while the second row creates a single mode only (the digit “6”). This problem is called mode collapse when only a few modes of data are generated.

Non-�convergence

GAN is based on the zero-sum non-cooperative game. In short, if one wins the other loses. A zero-sum game is also called minimax. Your opponent wants to maximize its actions and your actions are to minimize them. In game theory, the GAN model converges when the discriminator and the generator reach a Nash equilibrium -> But in practice the model does not always converge.

Unstable Gradients

If the Discriminator becomes too good too quickly (e.g., it can perfectly distinguish real from fake samples), the generator may receive gradients that are near zero.This is because when the discriminator is confident, the outputs of the discriminator (for fake images) are close to zero.This leads to very little or no learning by the generator. The generator relies on feedback, if the feedback is weak (because of vanishing gradients), the generator cannot make meaningful updates.

143 of 281

PoW Series

Variational Autoencoders

144 of 281

Firstly…

Autoencoders

145 of 281

Latent vector

146 of 281

Inference phase

Autoencoders

147 of 281

Different distributions

DOGS

CATS

MOUSE

Previous vector

There is a high chance to pick a “garbage” vector…

148 of 281

We cannot generate images with Autoencoders

We dont know how to take a vector from a distribution, but what if we know?

149 of 281

Variational Autoencoders

150 of 281

Different distributions

DOGS

CATS

MOUSE

Previous vector

Continuity: two close points in the latent space should not give two completely different contents once decoded)

Completeness: for a chosen distribution, a point sampled from the latent space should give “meaningful” content once decoded).

151 of 281

152 of 281

  1. The input is encoded as distribution over the latent space
  2. A point from the latent space is sampled from that distribution
  3. The sampled point is decoded and the reconstruction error can be computed
  4. The reconstruction error is backpropagated through the network

153 of 281

The loss function is composed of a reconstruction term (that makes the encoding-decoding scheme efficient) and a regularisation term (that makes the latent space regular).

154 of 281

KL Divergence

KL Divergence in VAEs serves to regularize the learned representations in the latent space:

  • Ensures that the distribution of the latent variables is close to the data distribution.

  • This pushes the model to use the latent space efficiently and prevents overfitting.

  • We also ensure that any point sampled from the prior distribution can be decoded into a meaningful data point, making the latent space dense with meaningful representations.

155 of 281

The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain

of x.

156 of 281

Back to VQ-VAEs

157 of 281

158 of 281

159 of 281

160 of 281

Latent Space

Steps to reproduce:

  1. Train a VAE on the MNIST dataset.

  • Pass MNIST images through the encoder part of the VAE. The encoder will produce a vector in the latent space for each image.

  • Plot these points on a 2D scatter plot. Each point corresponds to an MNIST image.

  • If the VAE learned meaningful representations, you'd expect to see clusters of the same digit grouping together.

Latent space refers to the lower-dimensional space where input data (such as images) are encoded.

FUN FACT: If two clusters are close together or overlapping, it suggests that in the latent space, the model finds those digits similar in some way. For example, the digits "4" and "9" might be closer together than "1" and "8" since their shapes have some similarities.

161 of 281

VQ-VAE vs VAE

2 main differences:

  • Discrete rather than continuous (challenging)

  • Prior distribution is learnt rather than static

Discrete

Continuous

Traditional VAE

VQ-VAE

Common Goals

2 main differences:

  • Conserve Important Features in the Latent Space
  • Even though the data is being compressed by the encoder, the vital characteristics or patterns should not be lost.

“For instance, when trained on speech we discover the latent structure of language without any supervision or prior knowledge about phonemes or words”

  • Optimize for Maximum Likelihood
  • Likelihood is a measure of how well a statistical model describes observed data.
  • In the context of generative models, optimizing for maximum log-likelihood ensures that the model generates data that's statistically similar to the training data.

162 of 281

Latent Variables and Prior Distribution

Discrete

Continuous

Traditional VAE

VQ-VAE

VQ-VAE learns the discrete latent space from the input data instead of having a static space.

This means that, instead of assuming the latent variables should follow a standard Gaussian, the VQ-VAE learns the best distribution for the latent variables based on the data it's trained on. This flexibility can provide advantages in terms of model performance and the quality of generated or reconstructed data.

For a classic VAE, the latent variables distribution is typically chosen to be a standard Gaussian (or Normal) distribution. This means that the VAE tries to make the latent variables it learns for each data point look like they've been drawn from a Gaussian distribution.

163 of 281

VQ-VAE

164 of 281

Loss function

x = model input

ze(x) = encoder output

zq(x) = decoder input

e = embedding vectors

sg = stop gradient operator (detaches variable from learning)

reconstruction loss

  • max log-likelihood
  • decoder
  • encoder

Vector Quantisation (VQ)

  • embeddings optimized

Commitment Loss

  • encoder optimized

165 of 281

Reducing Space

166 of 281

Audio

The decoder is conditioned on both the latents and a one-hot embedding for the speaker

167 of 281

PoW Series #15

Stable Diffusion

High-Resolution Image Synthesis with Latent Diffusion Models

168 of 281

High Level Overview

The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain

of x.

DALL-E 1 uses:

DALL-E 2 uses:

  • uses CLIP embedding directly
  • decodes images via diffusion similar to GLIDE

Stable Diffusion

https://arxiv.org/pdf/2112.10752.pdf

169 of 281

Latent Diffusion - Abstract

The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain

of x.

By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.

To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity

Diffusion Models work great but are too computationally expensive

We apply them in the latent space of pre-trained autoencoders and reach optimality between complexity reduction and detail preservation

Reached SOTA scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.

170 of 281

Too many inputs here

In a fully connected network, each input contributes equally to each output

This approach doesn't make sense, because pixels that make up an edge are more important than the background

171 of 281

172 of 281

173 of 281

residual connections

Network extracts more features by increasing kernel depth and scaling image down to increase field of view of kernel

174 of 281

residual connections

Residual connection

175 of 281

residual connections

Positional encoding is a type of embedding.

Embedding: converting discrete variables into continuous vectors

We train the model on images with varying levels of noise

176 of 281

When we try to jump from complete noise to the actual image it turns out blurry, so we have to do it partially

Since we trained on images with varying levels of noise, we can start with complete noise and work our way up towards an actual image

177 of 281

Remember VAEs?

178 of 281

residual connections

Residual connection

Latent diffusion model

179 of 281

180 of 281

181 of 281

residual connections

Residual connection

what if encoding the images and the text gave us the same embedding vectors?

182 of 281

183 of 281

184 of 281

185 of 281

186 of 281

187 of 281

PoW Series #16

Gaka-Chu

188 of 281

The road that led to the first self-employed autonomous robot

IE Robotics and AI Club - 19/10/2023

189 of 281

190 of 281

191 of 281

192 of 281

193 of 281

194 of 281

195 of 281

196 of 281

197 of 281

Photo source: medium.com

198 of 281

Robot Therapy for Autistic Children

199 of 281

RAW DATA

RAW DATA

RAW DATA

Educators

Public

institutions

Doctors

200 of 281

RAW DATA

RAW DATA

RAW DATA

Educators

Public

institutions

Doctors

Data Server

Query

Query

Query

201 of 281

RAW DATA

RAW DATA

RAW DATA

Aggregated data

Educators

Public

institutions

Doctors

Aggregated data

Aggregated data

Data Server

Query

Query

Query

202 of 281

Q

A

Background Data Service

Background Data Auditing Service

IM Learning

and Sharing Service

Vetted

Algorithms

Logging

& Verification

Blockchain transaction management

Blockchain

Local repository

Train ML

1

3

4

5

6

7

8

2

Carried out

locally

Data Server

Aggregated

info.

203 of 281

HUB

HUB

HUB

HUB

HUB

HUB

HUB

HUB

HUB

I

I

II

III

III

III

III

III

III

204 of 281

HUB

HUB

HUB

HUB

HUB

HUB

HUB

HUB

HUB

III

205 of 281

206 of 281

207 of 281

1

2

3

4

5

AUX

AUX

Robots connect to

AUX node

Consensus starts

AUX node mines contract

Robots register with contract

Robots publish/subscribe contract

Swarm coordination through smart contracts

208 of 281

Blockchain-based smart contracts for the securing robot swarms

Managing Byzantine Robots via Blockchain Technology in a Swarm Robotics Collective Decision Making Scenario, (AAMAS 2018).

Robust consensus achievement

209 of 281

Swarm robotics systems are:

Are robust and fault tolerant.

However:

Redundancy comes at a cost ->

The plan needs to be distributed

(decreasing an attack’s cost)

Is it possible to provide the “blueprint” of a mission without describing the mission itself?

Research question:

210 of 281

Separates data verification from data itself

Blockchain section:

Merkle Tree (MT):

A -> B

1 BTC

d8e131acaf ….

211 of 281

Root Node

Interior Nodes

Leaf Nodes

(Operations)

H1 =

H(H2,H3)

H2 =

H(H4,H5)

H3 =

H(H6,H7)

 

Sensor input

Action

 

 

 

 

 

 

 

 

 

 

 

Action

Action

Action

Sensor input

Sensor input

Sensor input

212 of 281

Send proof (π)

Send query (Q) (node 1)

1

2

P

V

1

2

3

4

Verify proof

3

+

=

1

2

3

4

(received from P)

(local memory)

1

1

2

2

1

213 of 281

214 of 281

215 of 281

216 of 281

Gaka-chu:

A self-employed robot artist

217 of 281

218 of 281

219 of 281

System architecture

1

Robot paints picture

2

Auction starts

3

Winner is selected

4

Picture is sent to winner

5

Robot receives payment for picture

6

Robot buys supplies from arts shop

7

Robot receives supplies

220 of 281

Wallet balance

Timestamps

6 months

ETH

Experiment starts

Network fees

Investor loans

Investor repayment

Auction site fees

Painting sale

Consumable purchase

1)

2)

3)

Experiment starts

Network fees

Investor loans

Investor repayment

Auction site fees

Painting sale

Consumable purchase

1)

2)

3)

221 of 281

Drawing process

Children's Day

子供の日

222 of 281

B

A

Drawing process

A

B

C

C

223 of 281

224 of 281

Ordering supplies

if

< 1

then

Shop ETH Addr, 3x Supplies, 0.1 ETH

API

OK!

225 of 281

General workflow in detail:

1

Robot’s camera grabs img

2

Task planner processes img

3

Img is preprocess (e.g., border detection)

4

Img processing is over

5

Filming starts

6

Video is recorded

7

Video recording is over

8

Files are uploaded to IPFS and NFT minter starts

9

Information is sent to Smart Contract

NFT is sent to the auction platform

10

11

-

14

Order consumables from shop

226 of 281

Auction process

227 of 281

228 of 281

Robotics/AI

229 of 281

Robotics/AI

Digital trust

230 of 281

Robotics/AI

Digital trust

Society

231 of 281

Robotics/AI

Digital trust

Society

Trustable Autonomy (TA)

232 of 281

Thanks!

eduardo.castello@ie.edu

233 of 281

PoW Series #17

Dreambooth

234 of 281

Dreambooth: A new approach for “personalization” of text-to-image diffusion models.

235 of 281

Text-to-image Diffusion Models

Diffusion models are probabilistic generative models that are trained to learn a data distribution by the gradual denoising of a variable sampled from a Gaussian distribution.

Specifically, we are interested in a pre-trained text-to image diffusion model

236 of 281

Personalization of Text-to-Image models

Our first task is to implant the subject instance into the output domain of the model such that we can query the model for varied novel images of the subject.

Careful care had to be taken when finetuning generative models such as GANs in a few-shot scenario as it can cause overfitting and mode-collapse as well as not capturing the target distribution sufficiently well

This line of work primarily seeks to generate images that resemble the target distribution but has no requirement of subject preservation

237 of 281

Designing Prompts for Few-Shot Personalization

Our goal is to “implant” a new (unique identifier, subject) pair into the diffusion model’s “dictionary”.

label all input images of the subject “a [identifier] [class noun]”, where [identifier] is a unique identifier linked to the subject and [class noun] is a coarse class descriptor of the subject (e.g. cat, dog, watch, etc.).

Rare-tokens:

  • We generally find existing English words (e.g. “unique”, “special”) suboptimal since the model has to learn to disentangle them from their original meaning and to re-entangle them to reference our subject.
  • Selecting random characters to generate a rare identifier (e.g. “xxy5syt00”). These tokens incur the similar weaknesses as using common English words.
  • The sequence can be of variable length k, and find that relatively short sequences of k = {1, ..., 3} work well.
  • For Imagen, we find that using uniform random sampling of tokens that correspond to 3 or fewer Unicode characters works well.

238 of 281

Class-specific Prior Preservation Loss

The best results for maximum subject fidelity are achieved by fine-tuning all layers of the model. This raises 2 problems:

  • language drift -> where a model that is pre-trained on a large text corpus and later fine-tuned for a specific task progressively loses syntactic and semantic knowledge of the language.
  • reduced output diversity -> there is a risk of reducing the amount of variability in the output poses and views of the subject (e.g. snapping to the few-shot views). We observe that this is often the case, especially when the model is trained for too long.

To mitigate the two aforementioned issues, we propose an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. In essence, our method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins.

  • We find this prior-preservation loss is effective in encouraging output diversity and in overcoming language-drift. We also find that we can train the model for more iterations without risking overfitting

239 of 281

PoW Series #18

Inverse Kinematic Analysis Of A Quadruped Robot

240 of 281

Inverse Kinematic Analysis Of A Quadruped Robot

241 of 281

What are Kinematics?

Kinematics refers to the subfields in physics that describe the movement of bodies without taking into account the forces that causes them to move.

242 of 281

Two Types of Kinematic Analysis

  • Forward Kinematic Analysis:

Given the joint angles, we are able to obtain the end position (point in space).

  • Inverse Kinematic Analysis:

Knowing the end point, we are able to calculate the angles to arrive to this point.

243 of 281

Physical Model

The quadruped robot consists of a rigid body, rotary joints, and links between said joints.

rotary joints

rigid body

links

244 of 281

Robot Parameters

Physical Dimensions

245 of 281

Robot Parameters

Coordinate System

246 of 281

Robot Parameters

Variables

247 of 281

Robot Parameters

To sum up

248 of 281

Rotation Matrix

The Rotation Matrices represents a full rotation on its respective axis. Rx (roll) represents the roll in x-axis, Ry (pitch) in y-axis and Rz (yaw) in z-axis.

order matters here!!

Check it on our Collab

249 of 281

Transformation Matrix

Determines the position and orientation of the robot center of body in the workspace.

The kinematic equation of the center of body’s coordinate system (xm, ym, zm) and The Main Coordinate System of each leg (x0, y0, z0)

250 of 281

Forward Kinematics

Relationship between the positions, velocities, and accelerations of the robot links.

Denavit-Hartenberg parameters

251 of 281

Forward Kinematics

252 of 281

INVERSE KINEMATIC

Analytic close vs Iterative numerical method

253 of 281

Inverse Kinematic

Tools:

  1. Two-argument arctangent (arctan2): Gets over the sign ambiguity problem of arctan.

254 of 281

Inverse Kinematic

Tools:

2. Law of cosin

255 of 281

Calculations

Nonlinear equations, that is why leg 1 and 3 are in different configurations respect to 2 and 4

256 of 281

Relation

257 of 281

Example Rules

Check it on our Collab

258 of 281

Example Rules

259 of 281

Example Rules

260 of 281

PoW Series #19

crewAI: Framework for orchestrating role-playing, autonomous AI agents.

261 of 281

Key Features

262 of 281

The process

  • Sequential (Supported): This is the only process currently implemented in CrewAI. It ensures tasks are handled one at a time, in a given order, much like a relay race where one runner passes the baton to the next.
  • Consensual (WIP): Envisioned for a future update, the consensual process will enable agents to make joint decisions on task execution, similar to a team consensus in a meeting before proceeding.
  • Hierarchical (WIP): Also in the pipeline, this process will introduce a chain of command to task execution, where some agents may have the authority to prioritize tasks or delegate them, akin to a traditional corporate hierarchy. These additional processes, once implemented, will offer more nuanced and sophisticated ways for agents to interact and complete tasks, much like teams in complex organizational structures.

263 of 281

264 of 281

265 of 281

266 of 281

Multi Agent Collaboration

267 of 281

Agent Supervisor

In this way, the supervisor can also be thought of an agent whose tools are other agents!

268 of 281

Hierarchical Agent Teams

We call this hierarchical teams because the subagents can in a way be thought of as teams.

269 of 281

Hierarchical Agent Teams

Joao Moura put together a great example of using CrewAI with LangChain and LangGraph to automate the process of automatically checking emails and creating drafts. CrewAI orchestrates autonomous AI agents, enabling them to collaborate and execute complex tasks efficiently.

270 of 281

PoW Series #20

DensePose from Wifi

271 of 281

What is the space of these topics?

  1. DensePose was introduced in 2018 and aims to map human pixels in an RGB image to the 3D surface of the human body.

2. Synced has previously covered additional research on the use of WiFi signals for human pose and action recognition through walls and the associated risks of such technologies.

Their “RF-Action” AI model is an end-to-end deep neural network that recognizes human actions from wireless signals.

272 of 281

An overview of the method

WiFi-based DensePose:

  • generates UV coordinates of the human body surface using raw CSI signals, which are cleaned by amplitude and phase sanitization

  • a two-branch encoder-decoder network that translates the sanitized CSI samples to 2D feature maps that resemble images

  • and a modified DensePose-RCNN architecture that uses 2D features from the previous step to estimate a UV map representing the dense correspondence between 2D and 3D humans.

What is

Channel state information (CSI) lays the foundation of most wireless sensing techniques, including Wi-Fi sensing, LTE sensing, and so on. CSI provides physical channel measurements in subcarrier-level granularity, and it can be easily accessed from the commodity Wi-Fi network interface controller (NIC).

CSI describes the propagation process of the wireless signal and therefore contains geometric information of the propagation space. Thus, understanding the mapping relationship between CSI and spatial geometric parameters lays the foundation for feature extraction and sensing algorithm design.

definition

273 of 281

The Encoder Decoder Network

274 of 281

Modality Translation Network. Two encoders extract the features from the amplitude and phase in the CSI domain.

Then the features are fused and reshaped before going through an encoder-decoder network. The output is a 3 × 720 × 1280

feature map in the image domain.

275 of 281

Problem & Evaluation

The group identified two primary categories of failure cases.

(1) The WiFi-based model is biased and is likely to create faulty body parts when body positions are infrequently seen in the training set.

(2) Extracting precise information for each subject from the amplitude and phase tensors of the entire capture is more difficult for the WiFi-based approach when there are three or more contemporary subjects in one capture.

276 of 281

SORA: Text2Video

277 of 281

IDEAS

After this slide the pows are suggestions/not fixed

278 of 281

PoW Series

LCM + LORA

279 of 281

PoW Series

Lora

280 of 281

PoW Series

ELMo & ULMFiT

281 of 281

Language models can explain neurons in language models