PoW Series #1 - RNNs
Sequence data
Time series
Traditional vs RNN Architecture
Multilayer perceptron
RNN
Forward Propagation
In the following example, the prediction for y3 gets the information from inputs x3 and the activation of x2 and x1 as you can see on the green path.
activation function: tanh/Relu
activation function: softmax, sigmoid (binary class), etc
Back Propagation
“often, the programming framework will automatically take care of backpropagation.” However, the way it works in backpropagation, we go from right to left using a loss function (kind of going backwards in time; backpropagation through time).
Model Output
Expected Output
Loss function
Extra - Gated Recurrent Units (GRU)
Long Short Term Memory (LSTMs)
Bidirectional architecture
Different types of RNNs
Different types of RNNs are one to one (example of standard NN), one-to-many (for Sequence generation), many-to-one (example of sentiment classification), many-to-many (use of encoder and decoder to have the size of x and y different).
Sequence to Sequence (seq2seq)
Encoder - Decoder
PoW Series #2 - Transformers
Transformers
RNN
LSTM
GRU
Attention mechanism
Enough compute resources
Being capable of using the entire context of the text
Transformers
Positional Encoding
“Positional Embedding”
Dimension_model(d)= 512
Positional Encoding
Word embeddings
FAQ
Why sum and not concat?
Same dimension
(A x B x C)
“Positional Embedding”
Positional Encoding
Encoder
Dimension_model= 512
Attention
Attention
Attention
For example, when you search for videos on Youtube, the search engine will map your query (text in the search bar) against a set of keys (video title, description, etc.) associated with candidate videos in their database, then present you the best matched videos (values).
Multi-head attention
Multi-head attention
Multi-head attention
Multi-head attention
dk= dimension of query and key
Multi-head attention
Take the softmax of the scaled score to get the attention weights, which gives you probability values between 0 and 1
Multi-head attention
Multi-head attention
Concat
“2 stacks”
Perform the attention function in parallel
Recap
Add & Norm
Batch Normalization
Add & Norm are in fact two separate steps.
The add step is a residual connection
t means that we take sum together the output of a layer with the input
F(x)+x. The idea was introduced by He et al (2005) with the ResNet model. It is one of the solutions for vanishing gradient problem.
The norm step is about layer normalization (Ba et al, 2016), it is another way of normalization. TL;DR it is one of the many computational tricks to make life easier for training the model, hence improve the performance and training time.
Add Residual Connection
Add & Norm
Feed Forward
Decoder
Decoder - Masked Multi Head Attention
Decoder
Mask(Dec.)
Only for the first attention layer of the decoder!!
Encoder - Decoder Attention
Encoder - Decoder Attention
We take the index of the highest probability score → predicted word
PoW Series #3
BERT
En realidad se llama Blas
BERT and GPT are transformer-based architecture while ELMo is Bi-LSTM Language model. BERT is purely Bi-directional, GPT is unidirectional and ELMo is semi-bidirectional
GPT-2 vs BERT
Stacking Encoders
BERT
BERT
Two Steps: pre-training and fine-tuning
the model is trained on unlabeled data over different pre-training tasks.
the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks
BERT input representation
NEW: To differentiate between different representations
Pre-training
Masked LM(50%)
Train a bidirectional representation.
NSP(50%)
Next sentence prediction.
Masked Language Model (MLM)
The bidirectional methodology you did to fill in the [blank] word above is similar to how BERT attains state-of-the-art accuracy. A random 15% of tokenized words are hidden during training and BERT’s job is to correctly predict the hidden words
Masked Language Model (MLM)
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension
Fine-Tuning
Train a bidirectional representation.
BERT Size & Architecture
PoW Series #4
ViT
https://arxiv.org/pdf/2010.11929v2.pdf
https://www.pinecone.io/learn/vision-transformers/
https://www.v7labs.com/blog/vision-transformer-guide
16x16
Some quick facts…
Previous work used as a source for inspriation:
Attention Mechanisms:
Architecture
The ViT Encoder: identical to Attention is All you need.
Models just want attention 0.0
Handling 2D Images:
Linear Transformation
The problem with transformers: quadratic complexity when computing the Attention Matrix of an entire image. (patches split the images like words in a sentence)
Example: For a 28x28 mnist image, if we flatten it to 784 pixels, we still have to deal with an attention matrix of 784x784 to see which pixels attend to one another.
Steps taken:
Dimensions Explanation
Learnable Embeddings (Classification)
Positional Encoding
Mathematical Intuition
flatten the patches and map to D dimensions with a trainable linear projection
ded patches, whose state at the output of the Transformer encoder serves as the image representation y
SOTA
Pending Questions
AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
How is it working?
Types of Machine Learning Papers
CLIP: Connecting text and images
Types of Machine Learning Papers
LiT ️🔥: Zero-Shot Transfer with Locked-image text Tuning
4 billion images for training
Types of Machine Learning Papers
Simple Open-Vocabulary Object Detection with Vision Transformers
Vision Transformer for Open-World Localization, or OWL-ViT for short
Types of Machine Learning Papers
Scaling Vision Transformers to 22 Billion Parameters
VISION TRANSFORMERS
Types of Machine Learning Papers
PoW Series #5
CLIP
Introduction
CLIP stands for Constastive Language-Image Pretraining:
CLIP is an open source, multi-modal, zero-shot model. Given an image and text descriptions, the model can predict the most relevant text description for that image, without optimizing for a particular task.
Constastive Language: With this technique, CLIP is trained to understand that similar representations should be close to the latent space, while dissimilar ones should be far apart. This will become more clear later with an example.
Contrastive pre-training
N images paired with their text: (image1, text1… imageN, textN)
Contrastive Pre-training aims to jointly train an Image and a Text Encoder that produce image embeddings [I1, I2 … IN] and text embeddings [T1, T2 … TN], in a way that:
The cosine similarities of the correct <image-text> embedding pairs <I1,T1>, <I2,T2> (where i=j) are maximized.
In a contrastive fashion, the cosine similarities of dissimilar pairs <I1,T2>, <I1,T3>… <Ii,Tj> (where i≠j) are minimized.
Resnet or ViT
Standard
Transformer model with GPT-2 style modifications
Architecture
Code
(pseudocode)
Dataset - WIT for WebImageText
CLIP’s architecture and data format allows scale.
400M Image-text pairs are all over the internet and don’t have to be labeled.
Learning many words here (purple background, jumping, dog, agile…)
This is what allows it to perform so good on zero-shot benchmarks.
What does zero-shot mean? “We find that CLIP, learns to perform a wide set of tasks during pre-training including OCR, geo-localization, action recognition, and many others.”
Zero-shot vs Few-shot
Background: “Robustness”
Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”
ImageNet Adversarial
Dataset of examples which fooled a “standard” ResNet50 at the time.
…on this data ↑
Not even MNIST is robust 😱😱😱
go/jax-on-the-web-blog example 2
Background: “Robustness”
Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”
ImageNet V2
ImageNet data collected in the same way as the original ImageNet.
…on this data ↑
Background: “Robustness”
Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”
ImageNet Rendition
ImageNet classes but animated/drawn/sketched.
…on this data ↑
Background: “Robustness”
Normally: Train on ImageNet. �Report accuracy on test set.�“My model is 80% accurate…”
ImageNet Corrupted (snow, defocus blur)
ImageNet validation data, with artificial corruptions applied (many variants e.g. blur types, frost, with different severities)
…on this data ↑
Bias
CLIP “opening its eyes progressively” allows a faster pre-training
PoW Series #6
Vicuna
70K
ShareGPT: a website where users can share their ChatGPT conversations
With PyTorch FSDP on 8 A100 GPUs in one day.
Training and Evaluation
70K
With PyTorch FSDP on 8 A100 GPUs in one day.
Training
Training
Similar to Alpaca + improvements:
Memory Optimizations: max_context_length (512 → 2048), which requires more GPU memory. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.
Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbot’s output.
Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch.
7B model → ($500 to $140 )
13B model →( $1K to $300)
Flash Attention
Multi-round conversations
Enhance the training scripts provided by Alpaca to better handle multi-round conversations and long sequences.
Spot Instance
70K
ShareGPT: a website where users can share their ChatGPT conversations
With PyTorch FSDP on 8 A100 GPUs in one day.
80 questions
Evaluation
To compare two different models, we combine the outputs from each model into a single prompt for each question. The prompts are then sent to GPT-4, which assesses which model provides better responses. A detailed comparison of LLaMA, Alpaca, ChatGPT, and Vicuna is shown in Table 1 below.
Evaluation
Evaluation
Careful prompt Engineering
Generate diverse and challenging questions
Select 10 questions per category
Evaluation
GPT-4 to rate the quality of their answers based on helpfulness, relevance, accuracy, and detail. GPT-4 can produce not only relatively consistent scores but also detailed explanations on why such scores are given.
Not good at coding/math evaluation
Comparison with other LLMs
PoW Series #7
Optical flow-based odometry for trajectory estimation of drones
https://www.imavs.org/papers/2022/8.pdf
Definitions
ODOMETRY
OPTICAL FLOW
KALMAN FILTERS
Consist on estimating the motion of a robot by measuring changes in its position over time.
Spatial distribution of the apparent motion of pixels in an image.
Recursive algorithm to estimate the state of a system from noisy environments. EKF (Extended KF) use non-linear functions to describe the system dynamics and measurement equations.
Odometry & Oscillations
Vertical velocity
Total velocity
Horizontal Velocity
Vertical axis
Height of the hexarotor
- 𝜱
±𝜱 = Optical flow sensor orientation
D = Optical flow sensors distance to the ground
ω(±𝜱) = Optical flow magnitudes
When it goes up:
d1
d1
d2
d2
When it goes down:
* The angle 𝜱 doesn’t change when the hexarotor goes up and down, but the distance on the ground between the sensors do change. d1 < d2
Optical Flow
Optical flow is the pattern of apparent motion of image objects between two consecutive frames caused by the movement of object or camera (or in other terms, based on the relative motion of the camera). It is 2D vector field where each vector is a displacement vector showing the movement of points from first frame to second.
Ceteris Paribus:
Pixel correspondence problem.
Taylor Series Expansion of Right Hand Side (without time index)
brightness constancy constraint
describes a tracked pixel with unknown velocities u and v as spatial motion induced by the camera.
plug back into the
brightness constancy constraint
Understanding
Optical Flow
Understanding
We have an equation describes a tracked pixel with unknown velocities u and v as spatial motion induced by the camera from the equation derived previously. What happens now?
Limitations:
The solution:
Least Squares
Method
*these two are the same equation
Eigenvalue of points interpretation:
Small e: small gradients all directions
Large e: large gradients
Large ratios: combination of both
Optical Flow
Translation Optical Flow:
Optic Flow Divergence:
*Optical Flow Vector Field:
Optical Flow
Translation Optical Flow:
Hexarotor with two sensors:
Hexarotor with fours sensors:
Three transnational optic flow cues can be measured as:
x
S3
S4
S2
S1
y
Hexarotor equipped with four optic flow sensors
http://hyperphysics.phy-astr.gsu.edu/hbase/rotq.html
Optical Flow
Optic Flow Divergence:
Hexarotor with two sensors:
Hexarotor with fours sensors:
Two optical flow divergence cues can be measured as:
x
S3
S4
S2
S1
y
Hexarotor equipped with four optic flow sensors
Flow fields generated by different invariants
Result from translational movements
Result from vertical movements
Kalman Filters
“Kalman filtering, also known as linear quadratic estimation (LQE), is an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by using Bayesian inference and estimating a joint probability distribution over the variables for each timeframe.”
Kalman Filters - Optimal estimation algorithm
Kalman Filters - Optimal estimation algorithm
Kalman Filters - Optimal estimation algorithm
Extended Kalman Filters
The Extended Kalman Filter (EKF) is a mathematical algorithm that is used to estimate the state of a nonlinear system in the presence of noisy sensor measurements. It is an extension of the standard Kalman Filter, which is designed to work with linear systems.
The EKF works by linearizing the system dynamics and measurement equations around the current estimate of the state, and then applying the Kalman Filter algorithm to the resulting linearized equations. This process is repeated iteratively to estimate the state of the system over time
The EKF is particularly useful in cases where the system dynamics are nonlinear and cannot be modeled accurately using linear models.
Components of the Paper
Optical Flow Cues:
Translation Optical Flow:
Optical Flow Divergence:
Extended Kalman Filter (EKF)
Components of the Paper
divergence
translation
Components of the Paper
Measurement of the local translational and divergence optic flow cues
Minimalistic Visual Odometer Method
Description of the hexarotor and the optic flow sensors used
Odometry Process based on the raw measurements of 2 optic flow sensors used
Sensor fusion odometry processing based on 4 optic flow sensors (both with a precise and with a rough prior knowledge of optic flow variations)
Sensors fusion strategies based on the knowledge of optical flow and how they increase the measurement accuracy of the local optic flow cues by comparing the three methods
Measurement of the local translational and divergence optic flow cues
Optic Flow Divergence: series of contractions and expansions generated in the optic flow vector field by up-and-down oscillatory motions. When a drone flies forward while oscillating up-and-down above the ground, in the optic flow vector field the optic flow divergence is superimposed on the translational optic flow.
the theoretical local optic flow divergence
local optic flow divergence
Scaling to 4 optical flow sensors
integration of the local translational
optic flow ωT scaled by the estimated distance with respect
to the ground ˆh
Minimalistic Visual Odometer Method (SOFIa)
SOFIa (Self-scaled
Optic Flow time-based Integration model)
h: ˆh was estimated by means of an EKF taking as input the honeybee’s wing stroke amplitude and as measurement the local optic flow divergence computed as the ratio between Vh and h.
FUN FACT: The SOFIa model was found to be about 10 times more accurate than the raw mathematical integration of optic flow.
Description of the hexarotor and the optic flow sensors used
Hexarotor equipped with 4 optic flow sensors oriented towards the ground flying along a bouncing circular trajectory in the Marseille’s flying arena.
2 optic flow sensors (pixart PAW903) set along longitudinal axis x at angles φ = ±30◦ with respect to hexarotor's vertical axis z
2 optic flow sensors set along lateral axis y at angles φ = ±30◦ with respect to hexarotor's vertical axis z
Example of a test flight trajectory over 53m at an oscillation
frequency of 0.28Hz
used a trajectory tracking algorithm to perform up-and-down oscillating circular trajectories: https://github.com/gipsa-lab-uav/trajectory control
Description of the hexarotor and the optic flow sensors used
Description of the hexarotor and the optic flow sensors used
State space representation used for the EKF
To estimate the hexarotor’s flight height ˆh, we chose to model the hexarotor’s system as a double integrator receiving as input the acceleration az on the vertical axis z given by the drone’s IMU.
Odometry Process based on the raw measurements of 2 optic flow sensors used
State space representation used for the EKF
NO PRIOR KNOWLEDGE (NPK)
Sensor fusion odometry processing based on 4 optic flow sensors (both with a precise and with a rough prior knowledge of optic flow variations)
Precise Prior Knowledge
(PPK)
Sensor fusion odometry processing based on 4 optic flow sensors
Precise Prior Knowledge
(PPK)
Precise Prior Knowledge
(PPK)
Rough Prior Knowledge
(RPK)
Precise prior knowledge refers to a complete and accurate understanding of the system dynamics, including the physical laws that govern the motion of the objects in the scene, the noise characteristics of the sensors, and the characteristics of the imaging system. This level of knowledge allows for highly accurate predictions of the state of the system, which can be used to improve the accuracy of the optical flow estimation.
Rough prior knowledge, on the other hand, refers to a more limited understanding of the system dynamics. This could include knowledge of the approximate motion of the objects in the scene, but without a detailed understanding of the underlying physics or noise characteristics of the sensors. Rough prior knowledge is still useful for making predictions about the state of the system, but these predictions may be less accurate than those based on precise prior knowledge.
PoW Series # 8
GANs
General Architecture
Generative Adversarial Networks (G vs D)
pg = pd
pg = distribution of generated data
pd = distribution of real data
The Discriminator
TRAINING
The Generator
TRAINING
Alternating Training
Step | Remains constant | WHY? |
1.The discriminator trains for one or more epochs. | Generator | Discriminator training tries to figure out how to distinguish real data from fake, it has to learn how to recognize the generator's flaws. |
2. The generator trains for one or more epochs. | Discriminator | Otherwise the generator would be trying to hit a moving target and might never converge. |
to continue to train the G and D networks.
Repeat steps 1 and 2
Variable Definition
Mini G & Max D
Understanding the training
D
pg: G
px: data generating distribution
The upward arrows show how the mapping x = G(z) imposes the non-uniform distribution pg on transformed samples. G contracts in regions of high density and expands in regions of low density of pg
(d) After several steps of training, G and D reach a point at which both cannot improve because pg = pdata.
The discriminator is unable to differentiate between the two distributions, i.e. D(x) = 1 2 .
(b) Train D
(c) Train G
After an update to G, gradient of D has guided G(z) to flow to regions that are more likely to be classified as data.
x = data domain
z = noise domain
“We alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough.”
for k steps learn discriminator:
Learning Algorithm
Then, update generator:
Optimizing D to completion in the inner loop of training is computationally prohibitive, and on finite datasets would result in overfitting. Instead, we alternate between k steps of optimizing D and one step of optimizing G. This results in D being maintained near its optimal solution, so long as G changes slowly enough.
Algorithm 1
sample noise and real data
maximize Discriminator
minimize Generator
sample noise and real data
Theoretical Results
Practical Results
Applications of Gans
CGANS
Cycle GANS
using labels to improve GANs
Cross-domain transfer GANs
Star�GANS
image-to-image translation for one domain to another
PixelDTGAN
pixel-level domain transfer for for reco systems
SRGAN
super-resolution images from the lower resolution.
Gau�GAN
TP�GAN
synthesizes photorealistic images given an input semantic layout.
Cross-domain transfer GANs
Deblur
GAN
image-to-image translation for one domain to another
Demo
The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x.
Main Challenges Overview
mode collapse
Real-life data distributions are multimodal. For example, in MNIST, there are 10 major modes from digit ‘0’ to digit ‘9’. The samples below are generated by two different GANs. The top row produces all 10 modes while the second row creates a single mode only (the digit “6”). This problem is called mode collapse when only a few modes of data are generated.
Non-�convergence
GAN is based on the zero-sum non-cooperative game. In short, if one wins the other loses. A zero-sum game is also called minimax. Your opponent wants to maximize its actions and your actions are to minimize them. In game theory, the GAN model converges when the discriminator and the generator reach a Nash equilibrium -> But in practice the model does not always converge.
Unstable Gradients
If the Discriminator becomes too good too quickly (e.g., it can perfectly distinguish real from fake samples), the generator may receive gradients that are near zero.This is because when the discriminator is confident, the outputs of the discriminator (for fake images) are close to zero.This leads to very little or no learning by the generator. The generator relies on feedback, if the feedback is weak (because of vanishing gradients), the generator cannot make meaningful updates.
PoW Series
Variational Autoencoders
Firstly…
Autoencoders
Latent vector
Inference phase
Autoencoders
Different distributions
DOGS
CATS
MOUSE
Previous vector
There is a high chance to pick a “garbage” vector…
We cannot generate images with Autoencoders
We dont know how to take a vector from a distribution, but what if we know?
Variational Autoencoders
Different distributions
DOGS
CATS
MOUSE
Previous vector
Continuity: two close points in the latent space should not give two completely different contents once decoded)
Completeness: for a chosen distribution, a point sampled from the latent space should give “meaningful” content once decoded).
The loss function is composed of a reconstruction term (that makes the encoding-decoding scheme efficient) and a regularisation term (that makes the latent space regular).
KL Divergence
KL Divergence in VAEs serves to regularize the learned representations in the latent space:
The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x.
Back to VQ-VAEs
Latent Space
Steps to reproduce:
Latent space refers to the lower-dimensional space where input data (such as images) are encoded.
FUN FACT: If two clusters are close together or overlapping, it suggests that in the latent space, the model finds those digits similar in some way. For example, the digits "4" and "9" might be closer together than "1" and "8" since their shapes have some similarities.
VQ-VAE vs VAE
2 main differences:
Discrete
Continuous
Traditional VAE
VQ-VAE
Common Goals
2 main differences:
“For instance, when trained on speech we discover the latent structure of language without any supervision or prior knowledge about phonemes or words”
Latent Variables and Prior Distribution
Discrete
Continuous
Traditional VAE
VQ-VAE
VQ-VAE learns the discrete latent space from the input data instead of having a static space.
This means that, instead of assuming the latent variables should follow a standard Gaussian, the VQ-VAE learns the best distribution for the latent variables based on the data it's trained on. This flexibility can provide advantages in terms of model performance and the quality of generated or reconstructed data.
For a classic VAE, the latent variables distribution is typically chosen to be a standard Gaussian (or Normal) distribution. This means that the VAE tries to make the latent variables it learns for each data point look like they've been drawn from a Gaussian distribution.
VQ-VAE
Loss function
x = model input
ze(x) = encoder output
zq(x) = decoder input
e = embedding vectors
sg = stop gradient operator (detaches variable from learning)
reconstruction loss
Vector Quantisation (VQ)
Commitment Loss
Reducing Space
Audio
The decoder is conditioned on both the latents and a one-hot embedding for the speaker
PoW Series #15
Stable Diffusion
High-Resolution Image Synthesis with Latent Diffusion Models
High Level Overview
The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x.
Stable Diffusion
https://arxiv.org/pdf/2112.10752.pdf
Latent Diffusion - Abstract
The lower horizontal line is the domain from which z is sampled, in this case uniformly. The horizontal line above is part of the domain
of x.
By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations.
To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity
Diffusion Models work great but are too computationally expensive
We apply them in the latent space of pre-trained autoencoders and reach optimality between complexity reduction and detail preservation
Reached SOTA scores for image inpainting and class-conditional image synthesis and highly competitive performance on various tasks, including text-to-image synthesis, unconditional image generation and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.
Too many inputs here
In a fully connected network, each input contributes equally to each output
This approach doesn't make sense, because pixels that make up an edge are more important than the background
residual connections
Network extracts more features by increasing kernel depth and scaling image down to increase field of view of kernel
residual connections
Residual connection
residual connections
Positional encoding is a type of embedding.
Embedding: converting discrete variables into continuous vectors
We train the model on images with varying levels of noise
When we try to jump from complete noise to the actual image it turns out blurry, so we have to do it partially
Since we trained on images with varying levels of noise, we can start with complete noise and work our way up towards an actual image
Remember VAEs?
residual connections
Residual connection
Latent diffusion model
residual connections
Residual connection
what if encoding the images and the text gave us the same embedding vectors?
PoW Series #16
Gaka-Chu
The road that led to the first self-employed autonomous robot
IE Robotics and AI Club - 19/10/2023
Photo source: medium.com
Robot Therapy for Autistic Children
RAW DATA
RAW DATA
RAW DATA
Educators
Public
institutions
Doctors
RAW DATA
RAW DATA
RAW DATA
Educators
Public
institutions
Doctors
Data Server
Query
Query
Query
RAW DATA
RAW DATA
RAW DATA
Aggregated data
Educators
Public
institutions
Doctors
Aggregated data
Aggregated data
Data Server
Query
Query
Query
Q
A
Background Data Service
Background Data Auditing Service
IM Learning
and Sharing Service
Vetted
Algorithms
Logging
& Verification
Blockchain transaction management
Blockchain
Local repository
Train ML
1
3
4
5
6
7
8
2
Carried out
locally
Data Server
Aggregated
info.
HUB
HUB
HUB
HUB
HUB
HUB
HUB
HUB
HUB
I
I
II
III
III
III
III
III
III
HUB
HUB
HUB
HUB
HUB
HUB
HUB
HUB
HUB
III
1
2
3
4
5
AUX
AUX
Robots connect to
AUX node
Consensus starts
AUX node mines contract
Robots register with contract
Robots publish/subscribe contract
Swarm coordination through smart contracts
Blockchain-based smart contracts for the securing robot swarms
Managing Byzantine Robots via Blockchain Technology in a Swarm Robotics Collective Decision Making Scenario, (AAMAS 2018).
Robust consensus achievement
Swarm robotics systems are:
Are robust and fault tolerant.
However:
Redundancy comes at a cost ->
The plan needs to be distributed
(decreasing an attack’s cost)
Is it possible to provide the “blueprint” of a mission without describing the mission itself?
Research question:
Separates data verification from data itself
Blockchain section:
Merkle Tree (MT):
A -> B
1 BTC
d8e131acaf ….
Root Node
Interior Nodes
Leaf Nodes
(Operations)
H1 =
H(H2,H3)
H2 =
H(H4,H5)
H3 =
H(H6,H7)
Sensor input
Action
Action
Action
Action
Sensor input
Sensor input
Sensor input
Send proof (π)
Send query (Q) (node 1)
1
2
P
V
1
2
3
4
Verify proof
3
+
=
1
2
3
4
(received from P)
(local memory)
1
1
2
2
1
Gaka-chu:
A self-employed robot artist
System architecture
1
Robot paints picture
2
Auction starts
3
Winner is selected
4
Picture is sent to winner
5
Robot receives payment for picture
6
Robot buys supplies from arts shop
7
Robot receives supplies
Wallet balance
Timestamps
6 months
ETH
Experiment starts
Network fees
Investor loans
Investor repayment
Auction site fees
Painting sale
Consumable purchase
1)
2)
3)
Experiment starts
Network fees
Investor loans
Investor repayment
Auction site fees
Painting sale
Consumable purchase
1)
2)
3)
Drawing process
Children's Day
子供の日
B
A
Drawing process
A
B
C
C
Ordering supplies
if
< 1
then
Shop ETH Addr, 3x Supplies, 0.1 ETH
API
OK!
General workflow in detail:
1
Robot’s camera grabs img
2
Task planner processes img
3
Img is preprocess (e.g., border detection)
4
Img processing is over
5
Filming starts
6
Video is recorded
7
Video recording is over
8
Files are uploaded to IPFS and NFT minter starts
9
Information is sent to Smart Contract
NFT is sent to the auction platform
10
11
-
14
Order consumables from shop
Auction process
Robotics/AI
Robotics/AI
Digital trust
Robotics/AI
Digital trust
Society
Robotics/AI
Digital trust
Society
Trustable Autonomy (TA)
Thanks!
eduardo.castello@ie.edu
PoW Series #17
Dreambooth
Dreambooth: A new approach for “personalization” of text-to-image diffusion models.
Text-to-image Diffusion Models
Diffusion models are probabilistic generative models that are trained to learn a data distribution by the gradual denoising of a variable sampled from a Gaussian distribution.
Specifically, we are interested in a pre-trained text-to image diffusion model
Personalization of Text-to-Image models
Our first task is to implant the subject instance into the output domain of the model such that we can query the model for varied novel images of the subject.
Careful care had to be taken when finetuning generative models such as GANs in a few-shot scenario as it can cause overfitting and mode-collapse as well as not capturing the target distribution sufficiently well
This line of work primarily seeks to generate images that resemble the target distribution but has no requirement of subject preservation
Designing Prompts for Few-Shot Personalization
Our goal is to “implant” a new (unique identifier, subject) pair into the diffusion model’s “dictionary”.
label all input images of the subject “a [identifier] [class noun]”, where [identifier] is a unique identifier linked to the subject and [class noun] is a coarse class descriptor of the subject (e.g. cat, dog, watch, etc.).
Rare-tokens:
Class-specific Prior Preservation Loss
The best results for maximum subject fidelity are achieved by fine-tuning all layers of the model. This raises 2 problems:
To mitigate the two aforementioned issues, we propose an autogenous class-specific prior preservation loss that encourages diversity and counters language drift. In essence, our method is to supervise the model with its own generated samples, in order for it to retain the prior once the few-shot fine-tuning begins.
PoW Series #18
Inverse Kinematic Analysis Of A Quadruped Robot
Inverse Kinematic Analysis Of A Quadruped Robot
What are Kinematics?
Kinematics refers to the subfields in physics that describe the movement of bodies without taking into account the forces that causes them to move.
Two Types of Kinematic Analysis
Given the joint angles, we are able to obtain the end position (point in space).
Knowing the end point, we are able to calculate the angles to arrive to this point.
Physical Model
The quadruped robot consists of a rigid body, rotary joints, and links between said joints.
rotary joints
rigid body
links
Robot Parameters
Physical Dimensions
Robot Parameters
Coordinate System
Robot Parameters
Variables
Robot Parameters
To sum up
Rotation Matrix
The Rotation Matrices represents a full rotation on its respective axis. Rx (roll) represents the roll in x-axis, Ry (pitch) in y-axis and Rz (yaw) in z-axis.
order matters here!!
Check it on our Collab
Transformation Matrix
Determines the position and orientation of the robot center of body in the workspace.
The kinematic equation of the center of body’s coordinate system (xm, ym, zm) and The Main Coordinate System of each leg (x0, y0, z0)
Forward Kinematics
Relationship between the positions, velocities, and accelerations of the robot links.
Denavit-Hartenberg parameters
Forward Kinematics
INVERSE KINEMATIC
Analytic close vs Iterative numerical method
Inverse Kinematic
Tools:
Inverse Kinematic
Tools:
2. Law of cosin
Calculations
Nonlinear equations, that is why leg 1 and 3 are in different configurations respect to 2 and 4
Relation
Example Rules
Check it on our Collab
Example Rules
Example Rules
PoW Series #19
crewAI: Framework for orchestrating role-playing, autonomous AI agents.
Key Features
The process
Multi Agent Collaboration
Agent Supervisor
In this way, the supervisor can also be thought of an agent whose tools are other agents!
Hierarchical Agent Teams
We call this hierarchical teams because the subagents can in a way be thought of as teams.
Hierarchical Agent Teams
Joao Moura put together a great example of using CrewAI with LangChain and LangGraph to automate the process of automatically checking emails and creating drafts. CrewAI orchestrates autonomous AI agents, enabling them to collaborate and execute complex tasks efficiently.
PoW Series #20
DensePose from Wifi
What is the space of these topics?
2. Synced has previously covered additional research on the use of WiFi signals for human pose and action recognition through walls and the associated risks of such technologies.
Their “RF-Action” AI model is an end-to-end deep neural network that recognizes human actions from wireless signals.
An overview of the method
WiFi-based DensePose:
What is
Channel state information (CSI) lays the foundation of most wireless sensing techniques, including Wi-Fi sensing, LTE sensing, and so on. CSI provides physical channel measurements in subcarrier-level granularity, and it can be easily accessed from the commodity Wi-Fi network interface controller (NIC).
CSI describes the propagation process of the wireless signal and therefore contains geometric information of the propagation space. Thus, understanding the mapping relationship between CSI and spatial geometric parameters lays the foundation for feature extraction and sensing algorithm design.
definition
The Encoder Decoder Network
Modality Translation Network. Two encoders extract the features from the amplitude and phase in the CSI domain.
Then the features are fused and reshaped before going through an encoder-decoder network. The output is a 3 × 720 × 1280
feature map in the image domain.
Problem & Evaluation
The group identified two primary categories of failure cases.
(1) The WiFi-based model is biased and is likely to create faulty body parts when body positions are infrequently seen in the training set.
(2) Extracting precise information for each subject from the amplitude and phase tensors of the entire capture is more difficult for the WiFi-based approach when there are three or more contemporary subjects in one capture.
SORA: Text2Video
IDEAS
After this slide the pows are suggestions/not fixed
PoW Series
LCM + LORA
PoW Series
Lora
PoW Series
ELMo & ULMFiT
Language models can explain neurons in language models