Week 2: Know thy data
Daniel Buscombe
Marda Science & USGS Pacific Coastal and Marine Science Center
“ML Mondays”
Oct. 2020
Label data classes
Some label data is “messy”, which often means it contains significant error, high intra-class variability, or low inter-class variability
A lot of label data is or class imbalanced:
Intra- (within) class
variability:
Low
High
Inter- (between) class variability:�
High Low
A lot of time (most!) on a ML project is spent on the data, and the data delivery pipeline
Class definition is usually iterative
Lump? Or split?
6 classes
4 classes
What are the project implications?
FAIR data
Openml.org
Google Datasets
OpenImages V6 (9,178,275 images + labels)
LILA-BC
http://lila.science/datasets
Earth Observatory
https://earthobservatory.nasa.gov/map#2/0.0/-0.2
Committee on Earth Observation Satellites (CEOS)
https://coverage.ceos.org/
SpaceNet7
https://medium.com/the-downlinq/the-spacenet-7-multi-temporal-urban-development-challenge-dataset-release-9e6e5f65c8d5
LandCoverNet
https://www.mlhub.earth/
https://medium.com/radiant-earth-insights/radiant-earth-foundation-releases-the-benchmark-training-data-landcovernet-for-africa-7e8906e846a3
makesense.ai
This week’s blog post
How to create label images from JSON polygons downloaded from makesense.ai
Microsoft VOTT
https://github.com/microsoft/VoTT
Doodler
https://dbuscombe-usgs.github.io/doodle_labeller/
For rapid segmentation of natural scenes (primarily)
“Human in the loop” segmentation - takes clues from you, and uses them to complete the scene, with uncertainty estimates
Work in progress - contributions welcome!
DotDotGoose
Data labelling
Most ML (and DL) is currently designed for supervised learning, but there is a growing need for unsupervised, and semi-supervised methodologies because:
https://amitness.com/2020/02/illustrated-self-supervised-learning/
Pseudo-labelling
Most straightforward, but absolutely relies on a model that generalizes well and very accurately, otherwise errors may propagate
Used well, should only enhance a good model
https://datawhatnow.com/pseudo-labeling-semi-supervised-learning/