1 of 20

Week 2: Know thy data

Daniel Buscombe

Marda Science & USGS Pacific Coastal and Marine Science Center

“ML Mondays”

Oct. 2020

2 of 20

Label data classes

Some label data is “messy”, which often means it contains significant error, high intra-class variability, or low inter-class variability

A lot of label data is or class imbalanced:

  • Visualize the data and classes
  • Recategorize if necessary
  • Garbage in = Garbage out

Intra- (within) class

variability:

Low

High

Inter- (between) class variability:�

High Low

A lot of time (most!) on a ML project is spent on the data, and the data delivery pipeline

3 of 20

Class definition is usually iterative

Lump? Or split?

6 classes

4 classes

What are the project implications?

4 of 20

FAIR data

5 of 20

Openml.org

6 of 20

Google Datasets

7 of 20

OpenImages V6 (9,178,275 images + labels)

8 of 20

LILA-BC

http://lila.science/datasets

9 of 20

Earth Observatory

https://earthobservatory.nasa.gov/map#2/0.0/-0.2

10 of 20

Committee on Earth Observation Satellites (CEOS)

https://coverage.ceos.org/

11 of 20

SpaceNet7

https://medium.com/the-downlinq/the-spacenet-7-multi-temporal-urban-development-challenge-dataset-release-9e6e5f65c8d5

12 of 20

LandCoverNet

https://www.mlhub.earth/

https://medium.com/radiant-earth-insights/radiant-earth-foundation-releases-the-benchmark-training-data-landcovernet-for-africa-7e8906e846a3

13 of 20

14 of 20

makesense.ai

  • No login/account required
  • Doesn’t store your data or labels - no privacy concerns
  • Bounding boxes for object recognition
  • Polygons (convert to label images for image segmentation)
  • Points and lines
  • New “image recognition” function - discrete labels of images

15 of 20

This week’s blog post

How to create label images from JSON polygons downloaded from makesense.ai

16 of 20

Microsoft VOTT

https://github.com/microsoft/VoTT

17 of 20

Doodler

https://dbuscombe-usgs.github.io/doodle_labeller/

For rapid segmentation of natural scenes (primarily)

“Human in the loop” segmentation - takes clues from you, and uses them to complete the scene, with uncertainty estimates

Work in progress - contributions welcome!

18 of 20

DotDotGoose

Just counts?

https://github.com/persts/DotDotGoose

(that might be an image regression problem)

19 of 20

Data labelling

Most ML (and DL) is currently designed for supervised learning, but there is a growing need for unsupervised, and semi-supervised methodologies because:

  • There is far more unlabelled data than label data
  • A lot of label data is messy
  • The classes (the discrete set of label choices) themselves may not be optimal for the data

https://amitness.com/2020/02/illustrated-self-supervised-learning/

20 of 20

Pseudo-labelling

Most straightforward, but absolutely relies on a model that generalizes well and very accurately, otherwise errors may propagate

Used well, should only enhance a good model

https://datawhatnow.com/pseudo-labeling-semi-supervised-learning/