1 of 62

Semi-supervised learning

of multiple tasks

on concept neural graphs

Pîrvu Mihai Cristian

2025.04.14

2 of 62

Research goals

Multi-modal data
Deep Neural Networks
Graphs
Ensemble Learning
Modal Distillation
Semi-supervised learning

3 of 62

Data

“Data is the fossil fuel of AI”
The world is inherent multi-modal

There are processes happening at every semantic level
Limited by our ability to capture it with sensors and synchronize it

We need models capable of handling these modalities

Both in space and in time

We need to constantly introduce new and relevant benchmarks

one new dataset or extension to existing one in every paper

4 of 62

Data: synthetic multi-modal UAV dataset

Introduced in the NGC paper [1]
UAV-like synthetic dataset
We used CARLA simulator [2]
40k data points across 5 splits

Train (8k), validation (2k), semi-supervised 1&2 (10k), test (10k)

8 dense representations + pose:

3 inputs & 6 outputs

5 of 62

Data: real-world UAV dataset (2 scenes)

Introduced in the Depth Distillation paper [3]
2 real-world scenes:

Slanic (9k train & 5k test)
Herculane (9k test)

Aimed at creating the same modality (depth) through different methods: analytical, structural and unsupervised learning

6 of 62

Data: real-world multi-modal UAV dataset

Dronescapes dataset: introduced in the first Hyper-Graphs paper [3]
10 real-world scenes:

7 for training (12K), 7 for semi-supervised (11K) & 3 for testing (5.6K)
556 human annotated semantic maps and label propagated using SegProp [6]

8 dense representations

5 inputs & 3 outputs

7 of 62

Data: real-world multi-modal Earth Observation dataset

NEO dataset: introduced in the second Hyper-Graphs paper [4]
22 years of monthly public satellite data recorded

15 used for training & semi-supervised learning and 7 for inference

19 representations

12 inputs & 7 outputs

Challenging dataset due to missing data from various sensors across different periods as well as little overall data

We need models capable of handling missing data and learning from little data points

8 of 62

Data: real-world multi-modal UAV dataset (extended)

Dronescapes-Extended dataset: introduced in the last chapter (unpublished)
16 new scenes

8 for multi-modal training (57K frames + 23K from v1)
8 for semantic only training (+68K frames)

17 representations

14 inputs & 3 outputs (same)

Philosophy: only RGB-only pre-trained experts so we can use any video automatically

9 of 62

Research goals - Model

Multi-modal data
Deep Neural Networks
Graphs
Ensemble Learning
Model Distillation
Semi-supervised learning

11 of 62

Depth Distillation

Plenty of research on analytic solutions for metric depth estimation

SfM: RGB (xN) + intrinsic parameters -> unscaled depth

Add absolute pose (i.e. GPS) for metric depth via rescaling

Odometry: RGB (x2) + optical flow + relative pose (R, t) OR instant change
Stereo: RGB (x2) + intrinsic & extrinsic params (w/ baseline in meters)

Plenty of (newer) research on unsupervised unscaled depth estimation:

SfMLearner[7] popularized this method (photometric loss)

12 of 62

Depth Distillation: advantages and disadvantages

Globally consistent, lots of tooling around it for visualisation, export or correct the meshes
Slow, can take hours to days for a five minutes UAV scene. Cannot be ‘fine tuned’
Produces holes in areas where the global optimizer cannot solve (water, mirrors, textureless)

Odometry-based

Fast, uses only raw sensors usually available (RGB, GPS, camera orientation)
Sensitive to sensor failures or around focus of expansion (optical flow ~= 0)

Various thresholds must be applied which makes it less general

Relies on optical flow method (quality, speed to produce, etc.)

Unsupervised

fast, data-driven and produces very good results visually
scale ambiguity even between two consecutive frames

post processing required: we use median scaling per frame based on a window [t-w:t]
may require fine-tuning per scene

13 of 62

Depth distillation - overview

Solution: combine them using ensembles and distillation

Used Usup: Jiaw [8], Optical flow: RAFT [9] and SfM: Meshroom[10], NN: SafeUAV[11]

Ensemble learning works best with diverse candidates.

14 of 62

Depth distillation - results

Analytical methods have invalid areas which cannot be evaluated
Simple average ensembles will have an average error as well
Distillation fails to generalize to other scenes when trained on just memorization (no task specific loss) and little data (a single scene)

15 of 62

Depth distillation - results

16 of 62

Research goals - Model

Multi-modal data
Deep Neural Networks
Graphs
Ensemble Learning
Modal Distillation
Semi-supervised learning

18 of 62

Neural Graph Consensus - Motivation

The world is inherently multi-modal. We need models capable of handling these modalities
We use synthetic data (CARLA) to simulate the real world interdependencies
NGC models pairwise relationships between input and output and representations with edge neural networks

19 of 62

Neural Graph Consensus

20 of 62

Neural Graph Consensus - Inputs & Outputs

Create a bipartite graph (two disjoint sets): input modalities and outputs tasks

Inputs: [RGB, Optical flow, Halftone]
Outputs: [Depth, Semantic segmentation, Camera normals,
World normals, Wireframe, Absolute Pose]

Inputs and outputs are chosen on the principle of ‘easy acquirebility’

21 of 62

Neural Graph Consensus - Edges

Create Edges between each Input-Output pair

This gives us in total 3 in * 6 out = 18 edges

Each Edge is backed up by a neural network

SafeUAV-1M architecture for dense predictions (Map2Map) and a derivative (Map2Vector) for Pose

We train each edge on our supervised dataset

8k data points for train and 2k for validation

22 of 62

Neural Graph Consensus - Two-Hop Edges

Create Edges between each Input-Output’-Output triplet

In total: 3 in * 5 out’ * 6 out = 90 edges

To simplify we only use RGB as input

We reduce to: 1 in * 5 out’ * 6 out = 30 edges

We train each Two-Hop Edge on our supervised dataset

Input-Output’ is assumed pre-trained

23 of 62

Neural Graph Consensus - Ensemble Learning

We now have 18 Edges and 30 Two-Hop trained Edge Neural Networks
For each of the 6 output tasks:

8 independent edges: 3 Single-Hop Edges & 5 Two-Hops Edges

Each independent prediction can be seen as an ensemble candidate
We aggregate them together using the simple average
Note: we now have 48 edges and 6 ensembles across 6 tasks to evaluate!

24 of 62

Neural Graph Consensus - Selection Algorithm

Is using all 8 candidates the best thing we can do ?

Exhaustively looking for all combinations leads to 2⁸ - 1 Ensembles to evaluate per task
We use a greedy selection algorithm by ranking the individual edges performance and adding 1 edge at a time only resulting in just 8 ensembles to check for each task.

We are left with 21 Edges (out of 48 in total) after selection

RGB is always used
Pose is never used as intermediate (Vector2Map)

25 of 62

Neural Graph Consensus - Semi-supervised Learning

We have 4 sets:

8k Train + 2k validation -> full labels
10k Test -> full labels for evaluation
10k Semi-supervised (*2) -> only input representations available

Simulates the case where we get new unlabeled data after the initial deployment

Generate pseudo-labels on the first semi-supervised set using the best (21 edges) graph discovered earlier at the supervised training layer
Create a new dataset: Supervised (8k) + Semi-supervised/Pseudo-labels (10k)

Retrain all the edges: Single hop and then Two-hop
Note: we only train the 21 edges, not the full graph of 48 edges

Create new ensembles leading to the second trained graph
Repeat one more time on Semi-supervised #2 set (only single-hop)

8k Train + 10k semi-supervised #1 + 10k semi-supervised #2
We don’t regenerate old pseudo-labels

28 of 62

Hyper-Graphs

30 of 62

Hyper-Graphs: Motivation

The NGC research was done only on simulated data

It was unclear if similar gains would be measured on real world data

The edges (Single-Hop and Two-Hop) from NGC had a few blind spots:

No communication between input representations
Only pairwise communication between output representations (Two-Hops)

The ensembles used a simple average aggregation

Unable to adapt to one candidate predicting very wrong results (i.e. out of domain)

31 of 62

Same setup as in NGC
We define input and output representations creating a bipartite graph
Input representations are ‘easy’ to acquire

RGB: make a new flight or get UAV footage from the internet

Output representations are ‘hard’ to acquire

Semantic segmentation may require human annotation

We do supervised training, graph pathways ensembles, pseudo-labels and iterative semi-supervised learning

32 of 62

Hyper-Graphs: edges

Single-Hop Edges and Two-Hop Edges are not changed from NGC

# of Edges: I * O
# of TH-E: I * (O - 1) * O

33 of 62

Hyper-Graph: Aggregation hyper-edge

The first hyper-edge concatenates all (or subsets) of the inputs before passing through the edge-neural network
Inputs must be concatenable channel-wise

maps of same shape: RGB + Edges
map + vector is not possible: RGB + pose
pose can be turned into a map as a workaround

Maximum theoretical number of AH:

(2ᴵ - I - 1) * O
“Minus I” because these are just single-hop edges
In practice we just do 1 AH per output node that combines all inputs

34 of 62

Hyper-Graph: Ensemble hyper-edge

Ensemble hyper-edges are the generalization of the Two-Hop Edge from multiple pathways
We can message pass multiple pathways, aggregate them and then

35 of 62

Hyper-Graph: Cycle hyper-edge

Aggregation Hyper-edge at output nodes level
First, we have one or more pathways towards an output node

AH: ABC->D’
AH: ABC->E’

Then, we add these with an AH at the next timestamp and we make a secondary edge towards itself

CH: ABC + D’ + E’ -> D’’
CH: ABC + D’ + E’ -> E’’

There’s may possible CH, but in practice we use just 1 on top of the single AH

36 of 62

Hyper-Graph: learned ensembles

We have N>1 pathways towards each output node through various edges and hyper-edges
Before: we used to make a simple average.
Problem: we had to make an expensive selection algorithm to remove outliers
Proposal: introduce a small aggregation neural network

37 of 62

Hyper-Graph: learned ensembles

4 methods:

Linear regression
Direct mapping
Learned map weights
Learned pixel weights

38 of 62

Experiments: Dronescapes

10 scenes: 7 for training, 3 for testing

Train+Val 233 human labeled -> 12k samples in total
Semi-supervised 207 human labeled -> 11k
Test set 116 human labeled -> 5.6k

5 input & 3 output representations

Semantic segmentation is only evaluated on human annotated data
Camera Normals and Depth are trained and evaluated based on SfM data

39 of 62

Experiments: Dronescapes - Edges performance

40 of 62

Experiments: Dronescapes - Ensembles on semantic

41 of 62

Experiments: NEO

42 of 62

Experiments: NEO

12 inputs & 7 outputs

84 Single Hop Edges (E), 42 Ensemble Hyper-Edges (EH) one for each E
7 Aggregation Hyper-edges (AH), 7 Cycle Hyper-edges (CH) one for each AH
In total: 140 neural networks

We use “Relative Performance Improvement” (RPI) as a metric so we can aggregate all 7 tasks together without scaling issues

Baseline: best performing edge for each task
ARPI: Average RPI across all 7 tasks

43 of 62

Experiments: NEO - Ensemble results

44 of 62

Experiments: NEO - Iterative semi-supervised learning

45 of 62

Probabilistic Hyper-Graphs

46 of 62

Probabilistic hyper-graphs: Motivation

One of the issues with NGC and hyper-graphs is the need to explicitly model the edges

NGC: 48 edges
Hyper-graphs Dronescapes: 33 edges
Hyper-graphs NEO: 144 edges

Each edge is a neural network, and some depend on others (E>TH-E)

we built a framework on top of this just to manage the complexity of dependencies, knowing what needs to be trained, doing parallel training on multi GPU etc.

What if we can boil this complexity into a single neural network that models the hyper-graph edges ?

47 of 62

Probabilistic hyper-graphs: Motivation

48 of 62

Probabilistic hyper-graphs: MAE

Turns out that this is possible with Masked Autoencoders (MAE)
Standard MAE: Given a input vector X, MAE reconstructs the vector Y

49 of 62

Probabilistic hyper-graphs: Modeling Edges

If we define input and output features

Inputs may or may not be masked
Outputs are always masked
Only 1 output is reconstructed

Then this MAE effectively models Edges and Aggregation HyperEdges

50 of 62

Probabilistic hyper-graphs: Modeling Edges

In experiments:

We use the same SafeUAV model

We mask entire views, not at patch level

RGB is never masked
We create intermediate modalities, derived from experts (13 new + RGB)

They are sometimes masked. If masked, they are also reconstructed.

Output tasks are always masked

All output tasks are reconstructed at once. This worked better experimentally

Each masking is equivalent to sampling from the space hyper-edges (E or AH)

51 of 62

New intermediate modalities from pre-trained experts

52 of 62

Probabilistic hyper-graphs: Modeling Ensembles

After MAE is trained it can be queried multiple times for the same input
Each inference provides a different set of reconstructions

equivalent to multiple pathways reaching output nodes

These are aggregated like in ensembles learning

We only do average aggregation for now

53 of 62

Probabilistic hyper-graphs: Dronescapes-Extended

54 of 62

Results on Multi Task Learning

55 of 62

Results on Semantic Segmentation

MTL variant is trained without masking on the extended dataset

RGB -> 3 tasks
80k data points per epoch

MAE variant is the same as MTL results but different checkpoint
The MAE variant shows very close results

56 of 62

Results on Semantic Segmentation with Ensembles

57 of 62

Results on Semantic Segmentation: Distillation

Online inference Google Colab: link

58 of 62

Qualitative results: Ensemble learning

59 of 62