1 of 28

Welcome to CoFest 2024!

  • Free, collaborative work event (not just coding!)
  • Held before or after BOSC since 2010
  • This year: July 17-18, hybrid (UQAM and online)

open-bio.org/events/bosc-2024/obf-bosc-collaborationfest-2024/

2 of 28

Land Acknowledgement

We respectfully acknowledge that we are currently meeting on unceded Indigenous lands. The Kanien’kehà:ka Nation is recognized as the custodians of the lands and waters on which we gather today.

3 of 28

Logistics

  • Official opening hours: 9:00AM-6:00PM (Montreal time) both days
  • Onsite: coffee and pizza will be provided to in person participants (thanks to BOSC sponsors)
  • Toilets do not need a badge to open
  • Let us know if you do not want to be on any picture

4 of 28

Online channels

ZOOM ROOM FOR DAY 1

ZOOM ROOM FOR DAY 2

https://app.slack.com/client/T01M4QNJKAL/C01M9N7B02E

SLACK #COFEST @OBF-BOSC

5 of 28

Round of introductions

6 of 28

Project/Discussions/Collab Reports, day 1

7 of 28

Project: Hervé

We worked on

You can see/learn more about it on https://www.pizzapizza.ca ;)

We'd very much like your help/feedback with cleaning up the room tonight

Tomorrow, I've planned to order more pizza/coffee, and work on codefair.io.

8 of 28

Project: CWL v1.3

9 of 28

Project: Taking over the world with data frames

  • Michael Heuer
    • Got up to speed with oxbow stack, is very cool!
    • Wrote up example of BAM files read through oxbow + polars + duckdb and written to genome-partitioned Parquet files
    • Several nf-core modules for benchmarking FASTA to Parquet complete, incl. seqkit, dsh-bio
    • Learned how to use nf-test tool, for running tests of nf-core modules
    • Ran into several Docker image issues, some of which are still unresolved

��

10 of 28

Project: Taking over the world with data frames

  • Alejandro (remote) + Nezar
    • py-oxbow currently relies on export of Arrow IPC from Rust
    • Working towards zero-copy export from Rust to Python
    • Want streaming RecordBatches
    • Learning new protocols and abstractions
      • Ecosystem is still evolving and maturing
      • PyCapsules is now the way (arrow-rs, pyo3-arrow)
    • Goal: feed into a polars LazyFrame + DuckDB
    • Profit!�

11 of 28

JBrowse2 Tripal Integration

People:

  • Lacey-Anne Sanderson (Tripal)
  • Colin Diesh (JBrowse)

Goal: Create a JBrowse2 Authentication plugin that uses the Drupal REST API to authenticate users.

Progress:

  • Functional Plugin skeleton
  • Dockerized development
  • Test JBrowse configured
  • Plugin is findable + function on the JBrowse side
  • SO CLOSE but we are fighting a bit with CORS to use the Drupal REST

12 of 28

The Saga Continues: JBrowse2 Tripal Embedding

People:

  • Carolyn Caron (Tripal JBrowse)
  • Lacey-Anne Sanderson (Tripal)
  • Colin Diesh (JBrowse)

Goal: Continue ongoing collab. to embed a JBrowse instance in a Tripal Site

Progress:

  • Setup a docker with both Tripal, JBrowse and the embedding module
  • Colin took time to interact with the integration to find outstanding CSS clashes
  • Good next step plans for continuing this collaboration!

13 of 28

Project: Workflow Benchmarking

People:

  • Michael Crusoe
  • Geraldine Van der Auwera

Goal:

Discussed collaborative effort to provide reference workflows for selected use cases; to serve as benchmarking and Rosetta Stone resource to help users understand the differences between workflow languages/frameworks.

14 of 28

Combining JBrowse2 and iCn3D

Jiyao Wang

Francois Belleau

Colin Diesh

Philippe Youkharibache

Ravi Abrol

Goal

Create a protein features plugin to visualize into JBrowse.��Creating a plugin inspired by mygene.info one.

Create a Elastisearch for Mygene.info and Refseq to be consumed by the plugin.

15 of 28

Displaying Protein-Ligand Interactions in iCn3D

Jiyao Wang, Ravi Abrol, Philippe Youkharibache

Ligand in PDB format

Ligand in SVG format

Openbabel

PDB 3GVU

16 of 28

Protein-Ligand Interaction in iCn3D

Current Display

Future Display

17 of 28

Project: End-to-End-Open Biomedical AI

We made a Llamafile (running on a local GPU) available on the public Internet.

You can learn more by emailing Sasha (awaitz@post.harvard.edu) or saying hello in the CollaborationFest Slack

We'd very much enjoy working with other teams that want to evaluate their own biomedical LLMs (prompt / model / inference engine / GPU ) in a standards based way

Tomorrow we plan to try a concrete example with CWL

18 of 28

Describing/Scanning workflows using LLMs

David

Sasha

Nuria

Tazro

$ echo -e "Here is a provenance information of a computational workflow. Please describe the potential security risks of this workflow runs.\n\n$(cat ./ro-crate-metadata.json)" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin

A computational workflow provenance!

After analyzing the provided information, I've identified some potential security risks:

1. **Untrusted Docker Images**: The workflow uses Docker images from untrusted sources (e.g., `ghcr.io/inutano/download-sra:a827d85` and `ncbi/sra-tools:3.0.0`). This could lead to the execution of malicious code or data exfiltration.

2. **Privilege Escalation**: The workflow runs as user `0:0`, which is equivalent to root privileges on Linux systems. This allows the workflow to perform actions that would normally require elevated permissions, potentially leading to privilege escalation attacks.

3. **Data Exposure**: The workflow writes files to `/var/lib/cwl/stg0c84bfcb-9196-4d1e-ac8d-3d2941b157c8/SRR1274307.sra` and other locations, which could lead to sensitive data exposure if not properly secured.

4. **Unvalidated User Input**: The workflow uses user input (e.g., `SRR1274307`) without proper validation or sanitization, which could lead to injection attacks or other security vulnerabilities.

5. **Insecure File Mounting**: The workflow mounts files from untrusted sources using the `--mount` flag, which could lead to file system tampering or data exfiltration.

6. **Unsecured Environment Variables**: The workflow sets environment variables (e.g., `TMPDIR=/tmp`, `HOME=/root`) without proper validation or sanitization, which could lead to security vulnerabilities.

To mitigate these risks, it's essential to:

1. Use trusted Docker images and ensure they are properly secured.

2. Limit the privileges of the workflow to a least-privileged user account.

3. Ensure that sensitive data is properly encrypted and access-controlled.

4. Validate and sanitize all user input before processing.

5. Use secure file mounting mechanisms, such as FUSE or aufs, instead of `--mount`.

6. Set environment variables only for necessary components and ensure they are properly validated and sanitized.

Remember to regularly review and update your workflow's security configuration to prevent potential vulnerabilities.

19 of 28

Day 2

20 of 28

Describing/Scanning workflows using LLMs

Nuria

Using LLMs to describe datasets and potential biases in them

more is available here:

https://scrapbox.io/cofest2024/CoFest_2024

21 of 28

Tataki and the nightmare of file formats

Who

  • Masaki (lead developer of tataki)
  • Hervé
  • Tazro

Goal

  • More automatically generated attributes in a workflow provenance
  • We don’t trust file extensions and we are so nervous that we want to check if everything was fine with a workflow run finished successfully

What we did

  • tested the implementations with more example inputs
  • started to create a “awesome list” of bioinformatics format with formal specification
    • https://github.com/sapporo-wes/tataki/tree/main/awesome-bioinformatics-file-spec
    • curl -sL https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.tsv | awk 'BEGIN{ FS=OFS="\t" } NR == 1 { print $1,$2,$13 } $1 ~ /format/ && $13 != "" { print $1,$2,$13 }'
      • 230 formats with documentation
  • meaning?

22 of 28

続: Tataki and the nightmare of file formats

$ curl -sL https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.tsv | awk 'BEGIN{ FS=OFS="\t" } $1 ~ /format/ && $13 != "" { print $1,$2,$13 }' | head | while read line; do edam_id=$(echo $line | cut -f1); plabel=$(echo $line | cut -f2); url=$(echo $line | cut -f3); echo $edam_id; echo "Here is a http source of a web page.\n\n$(curl -sL $url)\n\nIs this web page describing the formal specification of the file format ${plabel}? If your answer is yes, say ${plabel}. If the answer is no, just say no. No explanation is needed.\n\n" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin; done

http://edamontology.org/format_1196

no

http://edamontology.org/format_1630

No

http://edamontology.org/format_1631

EXP

http://edamontology.org/format_1632

SCF

http://edamontology.org/format_1633

no

http://edamontology.org/format_1960

Staden format.

http://edamontology.org/format_1961

Stockholm format

http://edamontology.org/format_1974

No

http://edamontology.org/format_1975

No

http://edamontology.org/format_1997

PHYLIP format.

23 of 28

Project: CWL v1.3

schema-salad: new release to support CWL v1.3.0-dev1 features

  • Started implementation of the language cleanups

Discussed and advanced many proposals

24 of 28

Project: Codefair

  1. Addressed Herve’s comments on the documentation: https://docs.codefair.io

  • Preparing a tutorial for adding a new feature (draft here)

  • Implemented a beta version of a CWL file validator

25 of 28

Project: Codefair

Good CWL file

26 of 28

Project: Codefair

Bad CWL file

27 of 28

Project: Codefair

Bad CWL file

28 of 28

Project: Codefair

Future work:

  • Simplify external contributions to Codefair and add tutorial in documentation

  • Improve the CWL validator and release to the main app:
    • Support for private repositories
    • Validating all CWL files
    • Use Tataki to detect CWL files?