1 of 14

Improving Large Molecular Language Models via Relation-aware Multimodal Collaboration

Jinyoung Park, Minseong Bae, Jeehye Na, Hyunwoo J. Kim

Korea Advanced Institute of Science and Technology

1

MLV Lab

2 of 14

2

Introduction

> Large Language Models

MLV Lab

3 of 14

3

Introduction

Molecule description generation

Could you give me a brief overview of this molecule?

IUPAC name prediction�What is the IUPAC name of the molecule?

Chemical reaction prediction

Please suggest a potential product based on the given reactants and reagents.

Molecule description generation

The molecule is the hydrogenmaleate salt of O-(cyclohexanecarbonyl)lysergol…

Property prediction�Please provide the energy separation between the highest occupied and lowest unoccupied molecular orbitals (HOMO-LUMO gap) of this molecule.

IUPAC name prediction�The molecule's IUPAC name is 2-amino-1-phenylethanol.

Property prediction�0.1913

Chemical reaction prediction

O=[N+1]([O-1])C1=CC(CO)=C(F)C=C1F

LMLM

> Assistant for molecular reasoning

MLV Lab

4 of 14

4

Introduction

Prior models typically process molecular data (1D strings, 2D graphs, 3D conformations) in isolation or fuse them shallowly, which prevents them from leveraging the complementary properties of different modalities.

Existing evaluation frameworks on molecule-language models are generally based on generic text metrics like BLEU and ROUGE, which are "molecule-agnostic”, which fails to assess “chemical correctness”.

MLV Lab

5 of 14

5

Method

CoLLaMo integrates 1D sequences, 2D graphs, and 3D conformations into a shared token space using relation-aware attention to capture structural and spatial details of molecules.

MLV Lab

6 of 14

6

Method

CHARM/RCHARM quantifies the proportion of mentioned molecular entities that are factually incorrect or not grounded in the input molecule.

GPT-based evaluation verifies the generated output based on two criteria: factual informativeness and alignment with the ground truth.

MLV Lab

7 of 14

7

Experiment

Quantitative Results

CoLLaMo achieves the best performance compared to other baseline models including GPT-based models such as GPT-4, GPT-4o and o1-mini.

MLV Lab

8 of 14

8

Experiment

Quantitative Results

Furthermore, CoLLaMo shows its effectiveness on multiple property QA benchmark datasets with its capability to capture both 1D, 2D, 3D molecular modalities.

MLV Lab

9 of 14

9

Experiment

Quantitative Results

CoLLaMo achieves the highest score, demonstrating its superior molecule understanding and generation quality from both automatic and LLM-based evaluations.

MLV Lab

10 of 14

10

Experiment

Quantitative Results

Table demonstrates that integrating 1D, 2D, and 3D modalities through the modality-collaborative projector consistently improves the performance across all tasks except motif counting.

MLV Lab

11 of 14

11

Experiment

Quantitative Results

The ablation study shows that all components contribute to the performance improvement of our CoLLaMo.

MLV Lab

12 of 14

12

Experiment

Quantitative Results

The results reveal that even when only a single modality is available during inference, CoLLaMo maintains strong performance, demonstrating its robustness, while the model without Co-Projector is significantly degraded.

MLV Lab

13 of 14

Qualitative Results

CoLLaMo accurately identifies the molecule as 3-hydroxy fatty acyl-CoA(4−) by integrating complementary cues from 1D, 2D, and 3D modalities, whereas single-modality models produce incorrect descriptions.

MLV Lab

14 of 14

14

Summary

We propose a modality-collaborative projector equipped with relation-aware modality-collaborative attention, which facilitates relation-guided information exchange by integrating 2D structural and 3D spatial relations between atoms.

We introduce CoLLaMo, a large molecular language model based on a modality-collaborative projector, which integrates molecular multimodal representations (1D SELFIES, 2D graphs, and 3D conformations) into unified molecule tokens to fully leverage diverse aspects of molecule information.

We present a molecule-centric evaluation framework for LMLMs, including an automatic hallucination assessment metric and a GPT-based caption quality evaluator, addressing the limitations of conventional token-based metrics (e.g., BLEU).

MLV Lab