1 of 24

Distributed Data Grids & Federation: The CyVerse Data Store

Nirav Merchant

nirav@arizona.edu�PI Cyverse

Dir. Data Science Institute

University of Arizona, Tucson

2 of 24

3 of 24

4 of 24

How does CyVerse work?

5 of 24

Science use cases using CyVerse

22.1 m

394 m

2,000 kg

480 V 3-phase AC

30 ton

Photo: Jesse Rieser for The Wall Street Journal

6 of 24

Processing a whole lot of data

RGB

~550GB

Thermal

~5.5GB

Fluorescence

~80GB

3D

~300GB

Hyperspectral

~600GB

  • The world’s biggest scanalyzer
  • Data volume
    • Max capacity of 10 TB/day
    • Typical performance of 1.5 TB/day

7 of 24

Plant Phenotyping Data transfer and computation

  • Compression
  • Checksums
  • Collect data
  • 1-10TB/day
  • Storage

(raw + processed)

  • HPC
  • Workflows

DataStore

Cache Server

UA HPC

Algorithm

  • Prototyping
  • Development
  • Testing
  • Collaboration/Publication

Collaborators and Public

8 of 24

Event Horizon Telescope

  • Compression
  • Checksums
  • Worldwide network of observatories
  • Storage

(raw + processed)

  • Distributed Compute
  • Workflows

DataStore

Cache Server

Open Science Grid

Algorithm

  • Prototyping
  • Development
  • Testing
  • Collaboration/Publication

Collaborators and Public

9 of 24

From Citizen-Science to Your Phone: Insect Classification and Detection Using Self-Supervised Learning Methods

Zi Deng

PhD Student

Electrical Engineering

University of Arizona

Data Science Institute

Shivani Chiranjeevi

PhD Student

Mechanical Engineering

Iowa State University

AIIRA

Arti Singh

Assistant Professor

Department of Agronomy

Iowa State University

10 of 24

Goal: Mobile Insect Identification App

AIIRA

Mobile web app to accurately identify 142 agriculturally important insect-pest species (extended to 2526 species)

Key Features:

  • Differentiate beneficial and harmful insects
  • Ability to detect insect-pest species at different stages
  • Ability to differentiate fine-grained species
  • Provides scientific to common name mapping

11 of 24

iNaturalist Open Dataset

AIIRA

Metadata Columns

  1. Observations
    1. observation_uuid
    2. observer_id
    3. latitude
    4. longitude
    5. positional_accuracy
    6. taxon_id
    7. quality_grade
    8. observed_on
  2. Observers
    • observer_id
    • login
    • name
  3. Photos
    • photo_uuid
    • photo_id
    • observation_uuid
    • observer_id
    • extension
    • license
    • width
    • height
    • position
  4. Taxa
    • taxon_id
    • ancestry
    • rank_level
    • rank
    • name
    • active

  • Observation Access:

http://inaturalist-open-data.s3.amazonaws.com/photos/[photo_idl]/[size].jpg

Original

Large

Medium

Small

Thumb

Square

2048px

1024px

500px

240px

100px

75px x 75px

12 of 24

Data Extraction for Classification

AIIRA

Challenges:

    • Depth of hierarchy varies for different species, e.g. some levels are missing in the phylogenetic tree for certain species
    • Image-by-image querying from iNaturalist website
        • very time consuming, could potentially take months to years
        • Not feasible for dataset size in the range of millions of images
    • Bulk download by species is not available

13 of 24

Insecta Dataset

AIIRA

  • The insecta dataset totals around ~14 million images over ~95,000 species, totaling ~5.7 terabytes.
  • Acquisition of this dataset would not have been feasible without iNatSD.

14 of 24

AIIRA

Scaling and Parallelization

  • Scalable Design
  • AWS Open Data Sponsorship covers all costs
  • Limited only by self architecture
  • Each species download considered a separate job
  • Parallelization through Snakemake

15 of 24

AIIRA

Dataflow

Repeat for each species

  • For large datasets:
    • Utilize CyVerse for data store
    • Utilize HPC or other computing resources for computation

16 of 24

AIIRA

Additional Features

A sunburst plot is a data visualization technique that displays hierarchical data in a radial layout.

iNaturalist Insect Top 100 Species interactive visualization

iNaturalist is constantly updated. iNatSD includes update feature to repeat downloads.

17 of 24

What is the community building ?

18 of 24

New, improved workflow using ML

Train machine learning model to detect leaves

Annotate images

Build Streamlit app

(Hosted on CyVerse as VICE app)

Researcher uploads new data to CyVerse

Researcher can use Streamlit app to run ML model on new data

Add unique QR codes for each plant

19 of 24

QR codes

  • Automatically identify each plant
  • Can use this information to rename files automatically or attach metadata

20 of 24

Annotate images

  • Open source software for labeling data for machine learning
  • Allows many people to collaborate on labeling data
  • Supports many types of data: audio, text, time series, and images

https://labelstud.io

21 of 24

Machine Learning Model

  • Use labeled data to “train” an ML model
  • We are using: Mask-RCNN model w/ Detectron2 (object detection library built on Pytorch)

Final output is .pth file

(pytorch ML model weights)

22 of 24

So we have a model.. now what?

23 of 24

New, improved workflow using ML

Train machine learning model to detect leaves

Annotate images

Build Streamlit app

(CyVerse VICE app)

Researcher uploads new data to CyVerse

Researcher use app in DE to run ML model on new data

Add unique QR codes for each plant

Model weights

24 of 24

Where to host your app?

  • VICE App: launch it when you need it
  • Can easily share your app with collaborators
  • Built-in authentication
  • Connects directly to data store

  • Cost: Subscription model (free basic tier)

  • App always available
  • Can deploy unlimited public apps, but only one private app

  • Previously $250/month, no paid options available currently