1 of 28

Welcome to CoFest 2024!

Free, collaborative work event (not just coding!)
Held before or after BOSC since 2010
This year: July 17-18, hybrid (UQAM and online)

open-bio.org/events/bosc-2024/obf-bosc-collaborationfest-2024/

2 of 28

Land Acknowledgement

We respectfully acknowledge that we are currently meeting on unceded Indigenous lands. The Kanien’kehà:ka Nation is recognized as the custodians of the lands and waters on which we gather today.

3 of 28

Logistics

Official opening hours: 9:00AM-6:00PM (Montreal time) both days
Onsite: coffee and pizza will be provided to in person participants (thanks to BOSC sponsors)
Toilets do not need a badge to open
Let us know if you do not want to be on any picture

4 of 28

Online channels

ZOOM ROOM FOR DAY 1

ZOOM ROOM FOR DAY 2

https://app.slack.com/client/T01M4QNJKAL/C01M9N7B02E

SLACK #COFEST @OBF-BOSC

5 of 28

Round of introductions

6 of 28

Project/Discussions/Collab Reports, day 1

7 of 28

Project: Hervé

We worked on

You can see/learn more about it on https://www.pizzapizza.ca ;)

We'd very much like your help/feedback with cleaning up the room tonight

Tomorrow, I've planned to order more pizza/coffee, and work on codefair.io.

9 of 28

Project: Taking over the world with data frames

Michael Heuer

Got up to speed with oxbow stack, is very cool!
Wrote up example of BAM files read through oxbow + polars + duckdb and written to genome-partitioned Parquet files
Several nf-core modules for benchmarking FASTA to Parquet complete, incl. seqkit, dsh-bio
Learned how to use nf-test tool, for running tests of nf-core modules
Ran into several Docker image issues, some of which are still unresolved

��

10 of 28

Project: Taking over the world with data frames

Alejandro (remote) + Nezar

py-oxbow currently relies on export of Arrow IPC from Rust
Working towards zero-copy export from Rust to Python
Want streaming RecordBatches
Learning new protocols and abstractions

Ecosystem is still evolving and maturing
PyCapsules is now the way (arrow-rs, pyo3-arrow)

Goal: feed into a polars LazyFrame + DuckDB
Profit!�

11 of 28

JBrowse2 Tripal Integration

People:

Lacey-Anne Sanderson (Tripal)
Colin Diesh (JBrowse)

Goal: Create a JBrowse2 Authentication plugin that uses the Drupal REST API to authenticate users.

Progress:

Functional Plugin skeleton
Dockerized development
Test JBrowse configured
Plugin is findable + function on the JBrowse side
SO CLOSE but we are fighting a bit with CORS to use the Drupal REST

Repo: tripal/Tripal-JBrowse-InternetAccountPlugin

12 of 28

The Saga Continues: JBrowse2 Tripal Embedding

People:

Carolyn Caron (Tripal JBrowse)
Lacey-Anne Sanderson (Tripal)
Colin Diesh (JBrowse)

Goal: Continue ongoing collab. to embed a JBrowse instance in a Tripal Site

Progress:

Setup a docker with both Tripal, JBrowse and the embedding module
Colin took time to interact with the integration to find outstanding CSS clashes
Good next step plans for continuing this collaboration!

Repo: tripal/tripal_jbrowse

13 of 28

Project: Workflow Benchmarking

People:

Michael Crusoe
Geraldine Van der Auwera

Goal:

Discussed collaborative effort to provide reference workflows for selected use cases; to serve as benchmarking and Rosetta Stone resource to help users understand the differences between workflow languages/frameworks.

http://workflows.community/groups/benchmarking

14 of 28

Combining JBrowse2 and iCn3D

Jiyao Wang

Francois Belleau

Colin Diesh

Philippe Youkharibache

Ravi Abrol

Goal

Create a protein features plugin to visualize into JBrowse.��Creating a plugin inspired by mygene.info one.

Create a Elastisearch for Mygene.info and Refseq to be consumed by the plugin.

15 of 28

Displaying Protein-Ligand Interactions in iCn3D

Jiyao Wang, Ravi Abrol, Philippe Youkharibache

Ligand in PDB format

Ligand in SVG format

Openbabel

PDB 3GVU

16 of 28

Protein-Ligand Interaction in iCn3D

Current Display

Future Display

17 of 28

Project: End-to-End-Open Biomedical AI

We made a Llamafile (running on a local GPU) available on the public Internet.

You can learn more by emailing Sasha (awaitz@post.harvard.edu) or saying hello in the CollaborationFest Slack

We'd very much enjoy working with other teams that want to evaluate their own biomedical LLMs (prompt / model / inference engine / GPU ) in a standards based way

Tomorrow we plan to try a concrete example with CWL

18 of 28

Describing/Scanning workflows using LLMs

David

Sasha

Nuria

Tazro

$ echo -e "Here is a provenance information of a computational workflow. Please describe the potential security risks of this workflow runs.\n\n$(cat ./ro-crate-metadata.json)" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin

A computational workflow provenance!

After analyzing the provided information, I've identified some potential security risks:

1. **Untrusted Docker Images**: The workflow uses Docker images from untrusted sources (e.g., `ghcr.io/inutano/download-sra:a827d85` and `ncbi/sra-tools:3.0.0`). This could lead to the execution of malicious code or data exfiltration.

2. **Privilege Escalation**: The workflow runs as user `0:0`, which is equivalent to root privileges on Linux systems. This allows the workflow to perform actions that would normally require elevated permissions, potentially leading to privilege escalation attacks.

3. **Data Exposure**: The workflow writes files to `/var/lib/cwl/stg0c84bfcb-9196-4d1e-ac8d-3d2941b157c8/SRR1274307.sra` and other locations, which could lead to sensitive data exposure if not properly secured.

4. **Unvalidated User Input**: The workflow uses user input (e.g., `SRR1274307`) without proper validation or sanitization, which could lead to injection attacks or other security vulnerabilities.

5. **Insecure File Mounting**: The workflow mounts files from untrusted sources using the `--mount` flag, which could lead to file system tampering or data exfiltration.

6. **Unsecured Environment Variables**: The workflow sets environment variables (e.g., `TMPDIR=/tmp`, `HOME=/root`) without proper validation or sanitization, which could lead to security vulnerabilities.

To mitigate these risks, it's essential to:

1. Use trusted Docker images and ensure they are properly secured.

2. Limit the privileges of the workflow to a least-privileged user account.

3. Ensure that sensitive data is properly encrypted and access-controlled.

4. Validate and sanitize all user input before processing.

5. Use secure file mounting mechanisms, such as FUSE or aufs, instead of `--mount`.

6. Set environment variables only for necessary components and ensure they are properly validated and sanitized.

Remember to regularly review and update your workflow's security configuration to prevent potential vulnerabilities.

20 of 28

Describing/Scanning workflows using LLMs

Nuria

Using LLMs to describe datasets and potential biases in them

more is available here:

https://scrapbox.io/cofest2024/CoFest_2024

21 of 28

Tataki and the nightmare of file formats

Who

Masaki (lead developer of tataki)
Hervé
Tazro

Goal

More automatically generated attributes in a workflow provenance
We don’t trust file extensions and we are so nervous that we want to check if everything was fine with a workflow run finished successfully

What we did

tested the implementations with more example inputs

https://github.com/sapporo-wes/tataki

started to create a “awesome list” of bioinformatics format with formal specification

https://github.com/sapporo-wes/tataki/tree/main/awesome-bioinformatics-file-spec
curl -sL https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.tsv | awk 'BEGIN{ FS=OFS="\t" } NR == 1 { print $1,$2,$13 } $1 ~ /format/ && $13 != "" { print $1,$2,$13 }'

230 formats with documentation

meaning?

22 of 28

続: Tataki and the nightmare of file formats

$ curl -sL https://github.com/edamontology/edamontology/releases/download/1.25/EDAM_1.25.tsv | awk 'BEGIN{ FS=OFS="\t" } $1 ~ /format/ && $13 != "" { print $1,$2,$13 }' | head | while read line; do edam_id=$(echo $line | cut -f1); plabel=$(echo $line | cut -f2); url=$(echo $line | cut -f3); echo $edam_id; echo "Here is a http source of a web page.\n\n$(curl -sL $url)\n\nIs this web page describing the formal specification of the file format ${plabel}? If your answer is yes, say ${plabel}. If the answer is no, just say no. No explanation is needed.\n\n" | yllm --api-url http://localhost:11434/v1/chat/completions --model llama3 --stdin; done

http://edamontology.org/format_1196

http://edamontology.org/format_1630

http://edamontology.org/format_1631

EXP

http://edamontology.org/format_1632

SCF

http://edamontology.org/format_1633

http://edamontology.org/format_1960

Staden format.

http://edamontology.org/format_1961

Stockholm format

http://edamontology.org/format_1974

http://edamontology.org/format_1975

http://edamontology.org/format_1997

PHYLIP format.

23 of 28

Project: CWL v1.3

schema-salad: new release to support CWL v1.3.0-dev1 features

Started implementation of the language cleanups

Discussed and advanced many proposals

24 of 28

Project: Codefair

Addressed Herve’s comments on the documentation: https://docs.codefair.io

Preparing a tutorial for adding a new feature (draft here)

Implemented a beta version of a CWL file validator

25 of 28

Project: Codefair

Good CWL file

26 of 28

Project: Codefair

Bad CWL file

27 of 28

Project: Codefair

Bad CWL file

28 of 28

Project: Codefair

Future work:

Simplify external contributions to Codefair and add tutorial in documentation

Improve the CWL validator and release to the main app:

Support for private repositories
Validating all CWL files
Use Tataki to detect CWL files?

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28