1 of 56

DRS Alignment with Beacon and Search

GA4GH Connect 2021

Chairs: Max Barkley, Brian O’Connor, Miro Cupak

ga4gh.org

2 of 56

Welcome!

Agenda

  • Search API Introduction (Miro Cupak)
  • DRS Alignment with Search (Ian Fore)
  • DRS Alignment with Beacon V2 (Jordi Rambla)
  • Networks of DRS Services? (Jordi Rambla)
  • DRS+Passports Summary (Max Barkley)

2

ga4gh.org

3 of 56

Search API Introduction

Miro Cupak, Jonathan Fuerth

ga4gh.org

4 of 56

Search

4

  • API specification for a simple, uniform mechanism to publish, discover, and query biomedical data.
    • Works with any data that fits into rows and columns, including nesting.
  • The API is composed of two principal components:
    • Tables API
      • Exposes tabular data and its data model.
      • Tables are paginated arrays of JSON (“rows” don’t have to be flat)
    • Query API
      • Provides SQL queries over data.

ga4gh.org

5 of 56

Features

  • Preserves and communicates semantics
    • Provides a rich framework for describing the meaning of the data, without prescribing a fixed data model.
  • Flexible
    • Allows data controllers to make their data available without extensive ETL.
  • Minimal by design
    • The API is purposely kept minimal so that the barriers to publishing existing data are as small as possible.
  • Supports federation
    • Serves as a general-purpose framework for building federatable search-based applications across multiple implementations. Federations reference common schemas and properties.
  • Backend agnostic
    • It is possible to implement the API across a large variety of backend datastores.
  • General purpose
    • Can be used to support existing use cases (e.g. Beacon, Matchmaker, data exploration), admits use cases that have not yet been thought of.

5

ga4gh.org

6 of 56

Existing ecosystem

6

GA4GH Search

GET /tables

GET /table/{tableName}/info

GET /table/{tableName}/data

POST /search (optional)

Relational database

JSON files in a bucket

VCF+TBI

Files

Google Sheets

Phenopackets

CSV/TSV

files with data dictionaries

Other vaguely rectangular files or APIs

Data Explorer

Jupyter Notebook

Command Line Interface

Beacon

GA4GH Search (federation)

Google BigQuery

Other applications

R data frame

FHIR

ga4gh.org

7 of 56

Summary

7

  • Status
    • 1.0.0 ready.
    • Currently going through extended approval process.
  • Materials

ga4gh.org

8 of 56

API overview

8

  • Simple API based on REST and JSON (Schema).
  • Built around the concept of a “Table”.
    • Array of JSON objects.
  • 4 endpoints:
    • GET /tables
      • Retrieve a paginated list of tables.
    • GET /table/{id}/info
      • Retrieve the data model (JSON Schema) associated with the given table.
    • GET /table/{id}/data
      • Retrieve paginated data from the given table.
    • POST /search
      • Execute the given SQL query and returns the results as a Table.�(Optional operation)

ga4gh.org

9 of 56

Specification overview: /tables

9

GET /tables

{

"tables": [

{

"name": "drs",

"data_model": {

"$ref": "https://example.com/table/drs/info"

}

},

// more tables

],

"pagination": {

"next_page_url": "https://example.com/tables/catalog/search_drs"

}

}

ga4gh.org

10 of 56

Specification overview: /table/{id}/info

10

GET `/table/drs/info`

{

"name": "drs",

"description": "Table / directory containing DRS links",

"data_model": {

"$id": "https://example.com/table/drs/info",

"$schema": "http://json-schema.org/draft-07/schema#",

"properties": {

"id": {

"type": "string",

"description": "An identifier specific for this DRS object"

},

"file": {

"$ref": "https://ga4gh.github.io/data-repository-service-schemas/preview/release/drs-1.1.0/swagger.json#/definitions/DrsObject"

}

// more attributes

}

}

ga4gh.org

11 of 56

Specification overview: /table/{id}/data

11

GET `/table/drs/data`

{

"data_model": // skipped for brevity, see previous slide

"data": [

{

"id": "file-001",

"name": "file-001.txt",

"size": 100,

"created": "2019-01-01T12:00:01Z",

"checksums": [],

"access_methods": [

{

"type": "https"

}

]

}, //more rows

]

}

ga4gh.org

12 of 56

Specification overview: /search

12

POST /search

{

"query": "SELECT * FROM drs"

}

{

"data": [

{

"id": "file-001",

"name": "file-001.txt",

"size": 100,

...

},

//more rows

]

"pagination": {

"next_page_url": ...

}

}

ga4gh.org

13 of 56

How can Search complement DRS?

13

Data Discovery

Cohort Selection

Workflows

Results

Data that includes DRS URLs somewhere

Search API

Publish non-blob outputs & run stats as Search Tables

DRS Server(s)

Resolve DRS URLs to data

Cohort:

a Search table

Contains DRS URLs and relevant Subject/Sample info

  • By helping to discover DRS URLs for files of interest
  • For grouping, storing, and submitting DRS URLs into workflows accompanied by relevant non-blob data

ga4gh.org

14 of 56

14

ga4gh.org

15 of 56

DRS Alignment with Search

Ian Fore

ga4gh.org

16 of 56

DRS Scaling - GitHub Tickets under 342

  • 337 - Bundling and other approaches to mapping accessions/logical entities to DRS/ physical objects
  • 286 - Improve support for containers that contain *lots* of Objects
  • 334 - DRS bulk requests
  • 325 - DRS paging
  • 323 - Clarify DRS bundle translates into a directory on disk

16

ga4gh.org

17 of 56

Topics

DRS as Physical level protocol

vs Logical

Bundling/Search - Imaging

Data for a DRS id

Research Objects

FASP & Process

17

ga4gh.org

18 of 56

Physical vs Logical - and bundling

  • Hackathon Jan 21
    • Kurt Rodarmer, Kenneth Durbrow, Digant Shah, Ian Fore

18

ga4gh.org

19 of 56

Conclusions - Physical vs Logical

DRS has primary value as a low level physical protocol

Rather than for logical level constructs - project, experiment run etc.

Generally valid (see imaging examples)

Higher level, application, questions should use schemas and models specific to the domain being supported

That follows much existing practice amongst GA4GH participants

This does not exclude that the higher level schemas might be referenced by or even included within DRS bundles

Bulk operations and pagination DRS are needed

At the fundamental level, not to handle application/logical level concepts.

19

ga4gh.org

20 of 56

“The thin middle”

20

Bob Grossman - Data Commons Framework Services

Carole Goble - Research Objects

ga4gh.org

21 of 56

21

Based on:Carole Goble - Research Objects

DRS

SRA Model

Imaging models

Search

ga4gh.org

22 of 56

Metadata for a DRS ID

  • Search and other capabilities

  • Fasp-scripts /notebooks/drs/DRS File Data

22

ga4gh.org

23 of 56

Conclusions - MetaData for a DRS id

  • Separation of concerns

  • The data you use to Search has the same origin as that which can provide metadata

  • How we define Schema is key
    • The metaschema!

  • Ask whether the use case is real
    • For: I need metadata about this DRS id
    • Is it not the case DRS is used 99% of the time in a context where the other data is already known?

23

ga4gh.org

24 of 56

Research Objects - RO

24

RO-Crate Presentation

ga4gh.org

25 of 56

25

ga4gh.org

26 of 56

26

ga4gh.org

27 of 56

27

ga4gh.org

28 of 56

28

ga4gh.org

29 of 56

Research Objects Relevance to DRS and Search

RO have addressed areas identified as needs in DRS

Manifests if needed

Typing

Schema overlap - Search and SchemaBlocks

RO is opinionated about Schema

Broader applicability than Genomics and Health

29

ga4gh.org

30 of 56

What is FASP?

30

ga4gh.org

31 of 56

Summary

31

  • How does Search complement DRS?
    • Search over arbitrary data, get back a table (search result) with subject/sample info and DRS ids in it.
    • Identify specific objects of interest - vs unpacking a bundle
    • Ability to provide metadata about a DRS id.�
  • What does Search need from DRS?
    • Canonical link for the types in the DRS API (JSON schema)
      • SchemaBlocks?
    • See DRS ids in Search documentation
    • A reusable approach to Passport integration for GA4GH APIs�
  • What does DRS need from Search?
    • We’re listening...

ga4gh.org

32 of 56

DRS Alignment with Beacon V2

Jordi Rambla

ga4gh.org

33 of 56

How Beacon has addressed the link to DRS

Handover mechanism

    • Included in Beacon v 1.1

Declaring a handover type, label, URL, notes (details and considerations)

Available at different levels

    • For the whole Beacon
    • Per each Dataset
    • Per query results
    • Per variant

33

ga4gh.org

34 of 56

handover in the Beacon+ UI

ga4gh.org

35 of 56

A handover example in the JSon response

ga4gh.org

36 of 56

Handover definition

Beacon 1.1

handoverType: $ref: '#/components/schemas/HandoverType'

note: string

description: An optional text including considerations on the handover link provided.

example: "This handover link provides access to a summarized VCF. To access the VCF containing the details for each sample filling an application is required. See Beacon contact information details."

url: string

description: URL endpoint to where the handover process could progress (in RFC 3986 format).

example: "https://api.mygenomeservice.org/handover/9dcc48d7/"

HandoverType:

description: Handover type, as an Ontology_term object with CURIE syntax for the "id" value.

id: string

Use “CUSTOM” for the "id" when no ontology is available.

example: "EFO:0004157"

label: string

This would be the "preferred Label" in the case of an ontology term.

example: "BAM format"

ga4gh.org

37 of 56

One example leveraging Phenopacket response

37

ga4gh.org

38 of 56

How Beacon has addressed the link to DRS - revisited

Handover mechanism

    • Included in Beacon v 1.1

Declaring a handover type (DRS type), label, URL, notes (details and considerations)

Available at different levels

    • For the whole Beacon
    • Per each Dataset << DRS object bundle
    • Per query results << DRS file
    • Per variant

38

ga4gh.org

39 of 56

Networks of DRS Services?

Jordi Rambla

ga4gh.org

40 of 56

Question brainstorming

Discovery goal is to allow the discovery of resources (e.g. variants, cases…)

There are other resources like files, execution services, etc.

The same way that Beacon shines in the context of a network… could the other GA4GH service type benefit from a network

...and in particular from an smart one?

ga4gh.org

41 of 56

41

Screen capture courtesy of Jonathan Dursi - CanDiG

ga4gh.org

42 of 56

ELIXIR Beacon Network specification

  • In the context of 2018 and 2019 ELIXIR Beacon Implementation Studies, the Network specification has been generated …
    • Able to support different topologies: flat, hierarchical, peer-to-peer…
    • Allowing private and public services
    • Able to self-administer securely and with optimized human intervention
    • and agnostic to the specific type of service, thus, applicable to any (e.g. AA, DRS. Refget…)
  • The network tasks have been carried out by ELIXIR-ES, ELIXIR-FI, ELIXIR-SE and EBI

Registry1

Beacon

Aggr.1

B1

R3

BA3

B3

R2

BA2

B2

Hierarchy

Beacon

Aggr.1

B1

B2

B3

B4

Registry 1

Flat

R3

BA3

B3

R2

BA2

B2

R1

BA1

B1

P2P

ga4gh.org

43 of 56

ELIXIR Beacon DP and GA4GH

  • The ELIXIR specification has been designed for allowing the inclusion of evolving GA4GH standards
  • As a GA4GH DP, ELIXIR Beacon is contributing to the Discovery Network requirements and as part of the team

Beacon

Aggr.1

B1

B2

B3

B4

Registry 1

Flat

ga4gh.org

44 of 56

The role of the Aggregator

  • Dispatching the queries and aggregating (sensu laxo) the results
  • Customizing the query for heterogeneous servers (e.g. different versions of the standard)
  • Harmonization for heterogeneous responses
  • Single AutN/AutZ point

Beacon

Aggr.1

B1

B2

B3

B4

Registry 1

Flat

ga4gh.org

45 of 56

DRS+Passport Summary

Max Barkley

ga4gh.org

46 of 56

The Reality Today

Analysis

Results

Biomedical

Platform UI

1

2

3

4

How do we authenticate for API access?

ga4gh.org

47 of 56

Next Step

We want this flexibility

(and more)

This is harder to do if each arrow uses different authorization

Datasets

Analysis

Results

Workflow

API

1

2

3

4

Standard

Client

ga4gh.org

48 of 56

Authorization in DRS

DRS, like some other GA4GH standards, recommends using OAuth2 bearer tokens (since they predated the Passports spec)

It does not prescribe:

  • How you discover an authorization server
  • How you discover what OAuth scopes are required
  • How you would use GA4GH passports, in cases of federated authorization

It’s good that we started off simple! But now there are use cases that require answering these questions in an automated way

ga4gh.org

49 of 56

DRS-Passport Use Case

DRS server objects from multiple datasets

Datasets have different authorities granting access

Need a passport with particular visas signed by particular authorities

Possibly also from particular broker

Passport Broker 1

DRS

Dataset 1

Dataset 2

Authority 1

Authority 2

Passport Broker 2

Authority 3

Where do I get a Passport?

ga4gh.org

50 of 56

DRS-Passport: What we need

We need to expand the API so a client knows where to go for passports

Passport Broker 1

DRS

Dataset 1

Dataset 2

Authority 1

Authority 2

Passport Broker 2

Authority 3

Go to broker 1 and ask for...

ga4gh.org

51 of 56

Considerations

Some considerations need to be considered before crafting a solution:

  1. Issues of Scale
    1. Passport Access Token Size
    2. Large Selections of Objects
  2. Scope of Credentials
  3. OAuth and Existing Systems

ga4gh.org

52 of 56

Issues of Scale: Selections

Requesting DRS objects individually doesn’t scale to workloads on large datasets

We need to coalesce these requests into a single “selection”, that can encompass many objects

Passport Broker 1

DRS

Dataset 1

Dataset 2

Authority 1

Authority 2

Passport Broker 2

Authority 3

...

...

ga4gh.org

53 of 56

Issues of Scale: Selections

Even with coalescing requests, selection lists can be large

We need to be careful where we are sending these payloads and the verbosity of description

Passport Broker 1

DRS

Dataset 1

Dataset 2

Authority 1

Authority 2

Passport Broker 2

Authority 3

/ob1

/ob2

/obj100000

?

ga4gh.org

54 of 56

Issues of Scale: Token Size

HTTP headers typically have 4K or 8K limits

Passport access tokens containing visas can exceed these limits

We need a proposal that can use Passport access tokens in the body of requests when they exceed header limits

ga4gh.org

55 of 56

Scope of Credentials

Users may end up with passports containing many visas

They may need to down-scope by either:

  1. Getting a passport with a subset of visas (broker down-scoping)
  2. Getting a passport with visas that grant lesser permissions than the originals (visa authority down-scoping)

Passport Broker 1

DRS

Dataset 1

Dataset 2

Authority 1

Authority 2

Passport Broker 2

Authority 3

I only want to access dataset 1?

DRS

Dataset 3

Token

ga4gh.org

56 of 56

OAuth2 and Existing Systems

There are implementations of DRS that have accept OAuth2 bearer tokens

Organizations exposing data over HTTP will have pre-existing authorization systems

Designing solutions compatible existing standards, implementations, and libraries (where possible) makes them easier to adopt

Token

Authorization Domain

DRS

Dataset 1

Dataset 2

Auth Server

How do I handle these tokens in my existing auth system?

ga4gh.org