1 of 62

Semi-supervised learning

of multiple tasks

on concept neural graphs

Pîrvu Mihai Cristian

2025.04.14

2 of 62

Research goals

  • Multi-modal data
  • Deep Neural Networks
  • Graphs
  • Ensemble Learning
  • Modal Distillation
  • Semi-supervised learning

3 of 62

Data

  • “Data is the fossil fuel of AI”
  • The world is inherent multi-modal
    • There are processes happening at every semantic level
    • Limited by our ability to capture it with sensors and synchronize it
  • We need models capable of handling these modalities
    • Both in space and in time
  • We need to constantly introduce new and relevant benchmarks
    • one new dataset or extension to existing one in every paper

4 of 62

Data: synthetic multi-modal UAV dataset

  • Introduced in the NGC paper [1]
  • UAV-like synthetic dataset
  • We used CARLA simulator [2]
  • 40k data points across 5 splits
    • Train (8k), validation (2k), semi-supervised 1&2 (10k), test (10k)
  • 8 dense representations + pose:
    • 3 inputs & 6 outputs

5 of 62

Data: real-world UAV dataset (2 scenes)

  • Introduced in the Depth Distillation paper [3]
  • 2 real-world scenes:
    • Slanic (9k train & 5k test)
    • Herculane (9k test)
  • Aimed at creating the same modality (depth) through different methods: analytical, structural and unsupervised learning

6 of 62

Data: real-world multi-modal UAV dataset

  • Dronescapes dataset: introduced in the first Hyper-Graphs paper [3]
  • 10 real-world scenes:
    • 7 for training (12K), 7 for semi-supervised (11K) & 3 for testing (5.6K)
    • 556 human annotated semantic maps and label propagated using SegProp [6]
  • 8 dense representations
    • 5 inputs & 3 outputs

7 of 62

Data: real-world multi-modal Earth Observation dataset

  • NEO dataset: introduced in the second Hyper-Graphs paper [4]
  • 22 years of monthly public satellite data recorded
    • 15 used for training & semi-supervised learning and 7 for inference
  • 19 representations
    • 12 inputs & 7 outputs
  • Challenging dataset due to missing data from various sensors across different periods as well as little overall data
    • We need models capable of handling missing data and learning from little data points

8 of 62

Data: real-world multi-modal UAV dataset (extended)

  • Dronescapes-Extended dataset: introduced in the last chapter (unpublished)
  • 16 new scenes
    • 8 for multi-modal training (57K frames + 23K from v1)
    • 8 for semantic only training (+68K frames)
  • 17 representations
    • 14 inputs & 3 outputs (same)
  • Philosophy: only RGB-only pre-trained experts so we can use any video automatically

9 of 62

Research goals - Model

  • Multi-modal data
  • Deep Neural Networks
  • Graphs
  • Ensemble Learning
  • Model Distillation
  • Semi-supervised learning

10 of 62

11 of 62

Depth Distillation

  • Plenty of research on analytic solutions for metric depth estimation
    • SfM: RGB (xN) + intrinsic parameters -> unscaled depth
      • Add absolute pose (i.e. GPS) for metric depth via rescaling
    • Odometry: RGB (x2) + optical flow + relative pose (R, t) OR instant change
    • Stereo: RGB (x2) + intrinsic & extrinsic params (w/ baseline in meters)
  • Plenty of (newer) research on unsupervised unscaled depth estimation:
    • SfMLearner[7] popularized this method (photometric loss)

12 of 62

Depth Distillation: advantages and disadvantages

  • SfM
    • Globally consistent, lots of tooling around it for visualisation, export or correct the meshes
    • Slow, can take hours to days for a five minutes UAV scene. Cannot be ‘fine tuned’
    • Produces holes in areas where the global optimizer cannot solve (water, mirrors, textureless)
  • Odometry-based
    • Fast, uses only raw sensors usually available (RGB, GPS, camera orientation)
    • Sensitive to sensor failures or around focus of expansion (optical flow ~= 0)
      • Various thresholds must be applied which makes it less general
    • Relies on optical flow method (quality, speed to produce, etc.)
  • Unsupervised
    • fast, data-driven and produces very good results visually
    • scale ambiguity even between two consecutive frames
      • post processing required: we use median scaling per frame based on a window [t-w:t]
      • may require fine-tuning per scene

13 of 62

Depth distillation - overview

  • Solution: combine them using ensembles and distillation
    • Used Usup: Jiaw [8], Optical flow: RAFT [9] and SfM: Meshroom[10], NN: SafeUAV[11]
  • Ensemble learning works best with diverse candidates.

14 of 62

Depth distillation - results

  • Analytical methods have invalid areas which cannot be evaluated
  • Simple average ensembles will have an average error as well
  • Distillation fails to generalize to other scenes when trained on just memorization (no task specific loss) and little data (a single scene)

15 of 62

Depth distillation - results

16 of 62

Research goals - Model

  • Multi-modal data
  • Deep Neural Networks
  • Graphs
  • Ensemble Learning
  • Modal Distillation
  • Semi-supervised learning

17 of 62

18 of 62

Neural Graph Consensus - Motivation

  • The world is inherently multi-modal. We need models capable of handling these modalities
  • We use synthetic data (CARLA) to simulate the real world interdependencies
  • NGC models pairwise relationships between input and output and representations with edge neural networks

19 of 62

Neural Graph Consensus

20 of 62

Neural Graph Consensus - Inputs & Outputs

  • Create a bipartite graph (two disjoint sets): input modalities and outputs tasks
    • Inputs: [RGB, Optical flow, Halftone]
    • Outputs: [Depth, Semantic segmentation, Camera normals,
    • World normals, Wireframe, Absolute Pose]
  • Inputs and outputs are chosen on the principle of ‘easy acquirebility’

21 of 62

Neural Graph Consensus - Edges

  • Create Edges between each Input-Output pair
    • This gives us in total 3 in * 6 out = 18 edges
  • Each Edge is backed up by a neural network
    • SafeUAV-1M architecture for dense predictions (Map2Map) and a derivative (Map2Vector) for Pose
  • We train each edge on our supervised dataset
    • 8k data points for train and 2k for validation

22 of 62

Neural Graph Consensus - Two-Hop Edges

  • Create Edges between each Input-Output’-Output triplet
    • In total: 3 in * 5 out’ * 6 out = 90 edges
  • To simplify we only use RGB as input
    • We reduce to: 1 in * 5 out’ * 6 out = 30 edges
  • We train each Two-Hop Edge on our supervised dataset
    • Input-Output’ is assumed pre-trained

23 of 62

Neural Graph Consensus - Ensemble Learning

  • We now have 18 Edges and 30 Two-Hop trained Edge Neural Networks
  • For each of the 6 output tasks:
    • 8 independent edges: 3 Single-Hop Edges & 5 Two-Hops Edges
  • Each independent prediction can be seen as an ensemble candidate
  • We aggregate them together using the simple average
  • Note: we now have 48 edges and 6 ensembles across 6 tasks to evaluate!

24 of 62

Neural Graph Consensus - Selection Algorithm

  • Is using all 8 candidates the best thing we can do ?
    • Exhaustively looking for all combinations leads to 2⁸ - 1 Ensembles to evaluate per task
    • We use a greedy selection algorithm by ranking the individual edges performance and adding 1 edge at a time only resulting in just 8 ensembles to check for each task.

  • We are left with 21 Edges (out of 48 in total) after selection
    • RGB is always used
    • Pose is never used as intermediate (Vector2Map)

25 of 62

Neural Graph Consensus - Semi-supervised Learning

  • We have 4 sets:
    • 8k Train + 2k validation -> full labels
    • 10k Test -> full labels for evaluation
    • 10k Semi-supervised (*2) -> only input representations available
      • Simulates the case where we get new unlabeled data after the initial deployment
  • Generate pseudo-labels on the first semi-supervised set using the best (21 edges) graph discovered earlier at the supervised training layer
  • Create a new dataset: Supervised (8k) + Semi-supervised/Pseudo-labels (10k)
    • Retrain all the edges: Single hop and then Two-hop
    • Note: we only train the 21 edges, not the full graph of 48 edges
  • Create new ensembles leading to the second trained graph
  • Repeat one more time on Semi-supervised #2 set (only single-hop)
    • 8k Train + 10k semi-supervised #1 + 10k semi-supervised #2
    • We don’t regenerate old pseudo-labels

26 of 62

27 of 62

28 of 62

Hyper-Graphs

29 of 62

30 of 62

Hyper-Graphs: Motivation

  • The NGC research was done only on simulated data
    • It was unclear if similar gains would be measured on real world data
  • The edges (Single-Hop and Two-Hop) from NGC had a few blind spots:
    • No communication between input representations
    • Only pairwise communication between output representations (Two-Hops)
  • The ensembles used a simple average aggregation
    • Unable to adapt to one candidate predicting very wrong results (i.e. out of domain)

31 of 62

  • Same setup as in NGC
  • We define input and output representations creating a bipartite graph
  • Input representations are ‘easy’ to acquire
    • RGB: make a new flight or get UAV footage from the internet
  • Output representations are ‘hard’ to acquire
    • Semantic segmentation may require human annotation
  • We do supervised training, graph pathways ensembles, pseudo-labels and iterative semi-supervised learning

32 of 62

Hyper-Graphs: edges

  • Single-Hop Edges and Two-Hop Edges are not changed from NGC
    • # of Edges: I * O
    • # of TH-E: I * (O - 1) * O

33 of 62

Hyper-Graph: Aggregation hyper-edge

  • The first hyper-edge concatenates all (or subsets) of the inputs before passing through the edge-neural network
  • Inputs must be concatenable channel-wise
    • maps of same shape: RGB + Edges
    • map + vector is not possible: RGB + pose
    • pose can be turned into a map as a workaround
  • Maximum theoretical number of AH:
    • (2ᴵ - I - 1) * O
    • “Minus I” because these are just single-hop edges
    • In practice we just do 1 AH per output node that combines all inputs

34 of 62

Hyper-Graph: Ensemble hyper-edge

  • Ensemble hyper-edges are the generalization of the Two-Hop Edge from multiple pathways
  • We can message pass multiple pathways, aggregate them and then

35 of 62

Hyper-Graph: Cycle hyper-edge

  • Aggregation Hyper-edge at output nodes level
  • First, we have one or more pathways towards an output node
    • AH: ABC->D’
    • AH: ABC->E’
  • Then, we add these with an AH at the next timestamp and we make a secondary edge towards itself
    • CH: ABC + D’ + E’ -> D’’
    • CH: ABC + D’ + E’ -> E’’
  • There’s may possible CH, but in practice we use just 1 on top of the single AH

36 of 62

Hyper-Graph: learned ensembles

  • We have N>1 pathways towards each output node through various edges and hyper-edges
  • Before: we used to make a simple average.
  • Problem: we had to make an expensive selection algorithm to remove outliers
  • Proposal: introduce a small aggregation neural network

37 of 62

Hyper-Graph: learned ensembles

4 methods:

  • Linear regression
  • Direct mapping
  • Learned map weights
  • Learned pixel weights

38 of 62

Experiments: Dronescapes

  • 10 scenes: 7 for training, 3 for testing
    • Train+Val 233 human labeled -> 12k samples in total
    • Semi-supervised 207 human labeled -> 11k
    • Test set 116 human labeled -> 5.6k
  • 5 input & 3 output representations
    • Semantic segmentation is only evaluated on human annotated data
    • Camera Normals and Depth are trained and evaluated based on SfM data

39 of 62

Experiments: Dronescapes - Edges performance

40 of 62

Experiments: Dronescapes - Ensembles on semantic

41 of 62

Experiments: NEO

42 of 62

Experiments: NEO

  • 12 inputs & 7 outputs
    • 84 Single Hop Edges (E), 42 Ensemble Hyper-Edges (EH) one for each E
    • 7 Aggregation Hyper-edges (AH), 7 Cycle Hyper-edges (CH) one for each AH
    • In total: 140 neural networks
  • We use “Relative Performance Improvement” (RPI) as a metric so we can aggregate all 7 tasks together without scaling issues
  • Baseline: best performing edge for each task
  • ARPI: Average RPI across all 7 tasks

43 of 62

Experiments: NEO - Ensemble results

44 of 62

Experiments: NEO - Iterative semi-supervised learning

45 of 62

Probabilistic Hyper-Graphs

46 of 62

Probabilistic hyper-graphs: Motivation

  • One of the issues with NGC and hyper-graphs is the need to explicitly model the edges
    • NGC: 48 edges
    • Hyper-graphs Dronescapes: 33 edges
    • Hyper-graphs NEO: 144 edges
  • Each edge is a neural network, and some depend on others (E>TH-E)
    • we built a framework on top of this just to manage the complexity of dependencies, knowing what needs to be trained, doing parallel training on multi GPU etc.
  • What if we can boil this complexity into a single neural network that models the hyper-graph edges ?

47 of 62

Probabilistic hyper-graphs: Motivation

48 of 62

Probabilistic hyper-graphs: MAE

  • Turns out that this is possible with Masked Autoencoders (MAE)
  • Standard MAE: Given a input vector X, MAE reconstructs the vector Y

49 of 62

Probabilistic hyper-graphs: Modeling Edges

  • If we define input and output features
    • Inputs may or may not be masked
    • Outputs are always masked
    • Only 1 output is reconstructed
  • Then this MAE effectively models Edges and Aggregation HyperEdges

50 of 62

Probabilistic hyper-graphs: Modeling Edges

In experiments:

  • We use the same SafeUAV model
    • We mask entire views, not at patch level
  • RGB is never masked
  • We create intermediate modalities, derived from experts (13 new + RGB)
    • They are sometimes masked. If masked, they are also reconstructed.
  • Output tasks are always masked
    • All output tasks are reconstructed at once. This worked better experimentally
  • Each masking is equivalent to sampling from the space hyper-edges (E or AH)

51 of 62

New intermediate modalities from pre-trained experts

52 of 62

Probabilistic hyper-graphs: Modeling Ensembles

  • After MAE is trained it can be queried multiple times for the same input
  • Each inference provides a different set of reconstructions
    • equivalent to multiple pathways reaching output nodes
  • These are aggregated like in ensembles learning
    • We only do average aggregation for now

53 of 62

Probabilistic hyper-graphs: Dronescapes-Extended

54 of 62

Results on Multi Task Learning

55 of 62

Results on Semantic Segmentation

  • MTL variant is trained without masking on the extended dataset
    • RGB -> 3 tasks
    • 80k data points per epoch
  • MAE variant is the same as MTL results but different checkpoint
  • The MAE variant shows very close results

56 of 62

Results on Semantic Segmentation with Ensembles

57 of 62

Results on Semantic Segmentation: Distillation

  • Online inference Google Colab: link

58 of 62

Qualitative results: Ensemble learning

59 of 62

Qualitative results: Ensemble learning

60 of 62

Next steps

  • Integrate Two-Hop Edges/hyper-edges into the MAE paradigm
  • Integrate selection & learned ensembles
  • Try other masking strategies
  • Try other model architectures: Transformers

61 of 62

Thank you!

62 of 62

References

[8] Jiaw

[9] RAFT

[10] Meshroom

[11] SafeUAV