Hacking groups
BioHackathon 2024
Group topic
Participants
Description
Use this slide as a template
← Put your names here to participate in
Use the bold font for group lead
← Describe objectives, requirements etc.
(Group leader is responsible for this and presentation)
We will start presentations and reviews of the hacking projects at 11:00
Suggested target domains
R1
Human genotype to phenotype
Genome variation
Participants
Motivation
The analysis of the human genome has been flooded with data due to the widespread use of sequencers. The simple variations such as SNV and indel are being integrated with TogoVar, but structural variation has not yet even been standardized. Meanwhile, the pangenome graphs has emerged as a powerful tool for integrating multiple haplotypes. During this hackathon, we would like to discuss how to handle this heterogeneous data in a unified manner.
Description (Write down your proposal here.)
Join #genome-variation channel on slack.
Pangenome Graphs Database (PGD)
Participants
Description
Integrating facial analysis into PubCaseFinder
Slack Channel: #pubcasefinder_gestaltmatcher
Participants
Description
HPO suggest
Participants
Description
Objectives
- Given (one or more) HPO terms, suggest one or more HPO terms based on the log of PubcaseFinder queries
- Analyze PubCaseFinder queries for any biases or usage patterns to better understand users
- bias in branches of HPO being searched for, JP vs EN, IP/geography, terms in particular order
Methods
- Data cleaning (deduplicate sequential queries from same user?)
- Use data, time, or IPs related to each query
- Create a matrix of the co-occurrences between HPO terms
- Calculate conditional probabilities given the frequency of each HPO and its combinations.
- Measure performance against searches in other time period
Note: We are tracking our activities and progress in this Google doc
Hidden-Rad ontology
Participants
Description
Connecting healthcare data
Participants
→ can we connect this to clinical trials, having CT as a starting point and expanding from it?
Description
Annotations of clinical trials
Participants
→ can we connect this to “connecting healthcare data”, looking on how we can expand from basic elements in a CT? Drug, indication, symptoms, phenotypes to, for instance, what pathways are clinical trials about? What environmental factors link to a disease?
Description
�Brainstorming gdoc here and brainstorming slack #clinical_trials
Visualization for cohort data
Participants
Description
Visualization for HPO and MP
Participants
Description
Objectives:
- To Visualize the HPO and MPO much easy way
Methods
- Data extraction form Riken Metadatabase SPARQL endopoint
https://knowledge.brc.riken.jp/bioresource/sparql
- Create Visualization and WebApp
https://drive.google.com/drive/folders/1XmcCRT1iwGOfRL9QZOmY6F4UknK3uGEf
Achievement (Both HPO and MP)
- Data extraction
Among each subclass (first tier category) for each node,
① leaf or not? ② No of layer ③ The path to the node ④ Overlapped nodes
- Visualization in a tree structure
Future Direction
- Develop WebApp for Ontology curation
(Selecting or Adding term)
→1:For mapping HPO and MP
→2:For curation of HPO with clinicians
R2
Other organisms
Viral phylogenomics
Participants
Problem : Species trees are usually built using sets of universal marker genes. Viruses don’t have universal genes!
Proposal : Cluster gene trees by topology, build species trees for taxonomic groups with compatible gene trees.
Motivation : Species trees are the starting point for studying recombination among viruses and their hosts, testing models of species concepts in viruses, illuminating the origin of cellular and viral life, and many other things.
Dataset : IMG/VR (https://img.jgi.doe.gov/cgi-bin/vr/main.cgi) is the largest collection of viral genomes, with 5,576,197 genomes and MAGs in 2,917,521 vOTUs spanning all clades of the viral world.
Workflow : 木槌 (kizuchi) (https://github.com/ryneches/kizuchi/) uses prodigal-gv, hmmer, mafft, trimal, and fasttree to generate gene trees.
e.g. : https://ggdc.dsmz.de/victor.php
Cultivation media & phenotypic traits
Participants
Description
R3
Broader life sciences
GLYCO and all things sweet!
Participants
Description
Human Glycome Atlas (HGA)
Participants
Description
PubChem ⇔ Nikkaji Alignment
Participants
Description
Plant Breeding Ontology(PBO)
Participants
Description
Japanese Food Ontology
Participants
Description
BH24 Wikiblitz (fun and sidetopic)
Participants
What is a Wikiblitz?
A Wikiblitz combines Wikidata/Commons with a Bioblitz:
• A Bioblitz is a communal effort to record as many species as possible within a specific location and time.�
Why Participate?
• Your observations, under an open license, can be reused.
• Using Wikidata, we link these observations to the semantic web.
• You might even discover a species not yet observed!
Join Us!
• Slack: #wikiblitz
• iNaturalist: Biohackathon 2024 Project
Let’s explore and contribute together! (Maybe interesting? https://www.earthmetabolome.org/)
R4
Data analysis and methods
Hindsight/best practices
Participants
Description
Getting shapes from large RDF inputs
Participants
Description:
We aim to automatically extract RDF shapes (ShEx, SHACL) from large data sources. To address scalability challenges, we've developed a solution that involves splitting the input source into manageable slices and then merging the resulting schemas. However, 1) this is just one approach, and 2) we need to enhance the subsetting process to ensure the subgraphs are as complete as possible while remaining manageable by commodity hardware.
We would appreciate assistance with:
docker run -p 5000:5000 gm-api
Visualize sheXer results
Participants
Introducing sheXer: Automate Your RDF Schema Inference!
• What is sheXer?
A Python library that automatically infers RDF schemas.
• Current Features:
• Outputs ShEx, SHACL, and PlantUML visualizations.
• Our Goal:
• Enhance schema visualizations with new and diverse visualization backends.
Slack: #shexer
Using (discovered) schema
Participants
Description
LLM-SPARQL
We propose to develop an LLM-assisted SPARQL query answering system
Tasks:
Participants
SPARQL - Schema conversions
Building blocks identified as part of the LLM SPARQL project
Challenge 1: ShEx → NL Question + SPARQL :
To compare with the other direction: NL Question → SPARQL
SPARQL query benchmark
Challenge 2: SPARQL → ShEx:
Goal: Schema extraction
Challenge 3: Compare between ShEx schemas
Goal: Schema mapping
Challenge 4: Visualize ShEx schemas
Goal: Help humans understand the schemas
Participants:
Jose Labra,
Hikaru Nagazumi,
Eric Prud’hommeaux,
Claude Nanjo
Andra
Yasunori Yamamoto
Dani
Gos
…???
Schema
(ShEx)
RDF
Data
Endpoint
SPARQL
queries
Generated
Schemas
(ShEx)
=
?
NL
queries
Slack channel:
#sparql_schema_conversions
LLM-assisted BioSample curation
Participants
Description
{� "accession": "SAMN15915146",� "Matrigel_Passages": "0",� "isolate": "SW480",� "organism": "Homo sapiens",� "replicate": "1",� "tissue": "cell line",� "title": "Human sample from Homo sapiens"�}
{� "cell line": "SW480"�}
Characterize Biology use of LLMs
The interaction with LLMs presents a new way of using computers and should be studied directly.
WildChat dataset includes 1m real-world usages of ChatGPT including many biological questions. Let’s find out what people used it for (and if it was any good!)
Participants:
David
Hirokazu
Tazro [interested]
Toshiaki [interested]
Susumu [interested]�Pitiporn(Sam) [Interested]
Ruby coding with help of LLMs
Participants
Description
Slack channel: #ruby
UMAP all the APIs
Make a visual representation of metadata about DBCLS services
Participants:
Data quality
Participants
Description
Enhancement of the Bioresource Retrieval System using ChatGPT
Participants
Objectives:
Tasks:
Expected Results:
Mass Spectrum Viewer
Participants
Description
https://github.com/masspp
Workflow and Container helpdesk
Participants
Description
and workflows!
Help !!
Slack channel: #workflows
Revisiting SRAmetadb.sqlite
Participants: Nishad
Description
�SRAmetadb.sqlite is an SQLite database that assembles the Sequence Read Archive (SRA) metadata into an offline SQLite database. This database is utilized by the SRAdb R package and the pysradb Python CLI tool to query SRA metadata. However, it has not been updated frequently, with the last update in late 2023, and no public tools are available for rebuilding or updating it. This project aims to create an open-source pipeline for generating and updating a similar SRAmetadb.sqlite database from SRA metadata.�
Rationale:
Directions:
Additionals
Request for tutorials
Group photo
Let’s take a group photo in YUMORI before lunch!