1 of 45

Navigating the ESG landscape with LLMs

1

Find clarity in a world of noise

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

2 of 45

Who are we? What is Datamaran?

2

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

3 of 45

Who are we?

3

3

Vincent Rizzo

Senior Engineer

Martin Quesada Z.

Senior Data Scientist

Mantis NLP team

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

4 of 45

Environment Social & Governance (ESG)

4

Environmental challenges, scientists sounding the alarm, social unrest and greenwashing

Corporate reports scrutinized now more than ever to make sure companies hold to their commitments.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

5 of 45

What is Datamaran?

Data-driven & dynamic

Endorsed as best practice by regulators (EFRAG and US SEC) and standard setters (ISSB), Datamaran’s patented technology uses AI to help C-Suite validate ESG priorities - current and emerging.

5

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

6 of 45

What is Datamaran?

Data-driven & dynamic

6

Embed ESG into the DNA of every major company in the world

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

7 of 45

Our clients

7

“Datamaran is the most advanced solution in the market,

providing key market intelligence to look around corners and identify risks and opportunities.”

Our Partners:

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

8 of 45

CSRD:

A use case for Language Models in ESG

8

  • Context: The Double Materiality
  • Impacts, Risks & Opportunities
    • Proof of Concept
    • Adding Context
    • Curse of Dimensionality
    • Combining Steps
    • Refining Quality
  • Conclusions

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

9 of 45

Double Materiality

9

9

50,000 companies to align disclosures with CSRD in 2024

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

10 of 45

Double Materiality

10

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

11 of 45

How to generate interesting potential Impacts, Risks & Opportunities (IROs)?

11

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

12 of 45

What is the shortest path to building our IRO feature?��

12

12

Step #1� �Proof of Concept

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

13 of 45

Building a minimum-viable IRO with LLMs

13

“Generate a 15-word-maximum sentence detailing a risk related to {$topic} [...]”

Topic description

Prompt template

“Non-greenhouse gas air emissions that impact air quality, atmospheric conditions and/or human health. [...]”

IRO

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

14 of 45

Building a minimum-viable IRO with LLMs

14

argilla

Minimum-viable IRO evaluation

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

15 of 45

Step #2�

Adding Context

How can we leverage existing company reports to improve IRO generation?

15

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

16 of 45

Adding Context

Retrieval Augmented Generation (RAG)

16

Report sentences tagged with topics

Company reports

“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

Energy use, conservation & reductions

Climate Change Risks & Management

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

17 of 45

Adding Context

17

“We are on track and ahead on our goals to increase product energy efficiency �10X for client and server microprocessors, respectively, by 2030.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Increase hiring of veterans by at least 23% and military spouses by at least 15% in 2022”

Fair & inclusive workplace

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Climate change may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Climate change may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”

“Generate a 15-word-maximum sentence detailing a risk related to {$topic}. The risk should use the following sentences as relevant context on the issue from the company’s peers: {$sentences}”

  • Too many relevant sentences to include them all: we may sample and inject some in the LLM prompt.�
  • But this leaves us with a very narrow view of the company’s context.

Energy use, conservation & reductions

Climate Change Risks & Management

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

18 of 45

Adding Context

18

.8�.2�.3� .� .

We want to cluster report sentences in groups and summarize those

“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

transformersopenai

Requires

Vectorization

Climate Change Risks & Management

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

19 of 45

Adding Context

19

McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017

HDBSCAN over other clustering methods:

  • Easier hyperparameter tuning compared

to DBSCAN and K-Means.

  • It prioritizes dense clusters.

hdbscan

Condensed HDBSCAN cluster tree. https://hdbscan.readthedocs.io/.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

20 of 45

Step #3��Curse of Dimensionality

How can we group together large text embeddings without getting lost in their many dimensions?

20

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

21 of 45

Curse of dimensionality

21

HDBSCAN does not work well with more than 100 dimensions

.1 .3 -.5 . . . .9 -.7 .8

2000

100

hdbscan documentation https://hdbscan.readthedocs.io/en/latest/faq.html#q-i-am-not-getting-the-claimed-performance-why-not

Monika Zagrobelna. How to draw a realistic lion, with Monika Zagrobelna. Published June 2, 2024 in https://community.wacom.com/how-to-draw-lion-monika-zagrobelna/.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

22 of 45

Matryoshka Representation Learning

Curse of dimensionality

22

Tom Aarsen, Xenova, Omar Sanseviero. Introduction to Matryoshka Embedding Models. Published February 23, 2024 in https://huggingface.co/blog/matryoshka.

  • We use Matrioshka embeddings with 100 features, which minimize dimensions while preserving most of the information present in the original 2000-dimensional embeddings.�
  • These embeddings are then further reduced to 5 dimensions by UMAP.

Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., ... & Farhadi, A. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35, 30233-30249. https://arxiv.org/abs/2205.13147

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

23 of 45

More standard techniques

Curse of dimensionality

23

  • To use HDBSCAN effectively, we want to reduce our embeddings from 2000 dimensions to 5. To this end, we use a dimension reduction method: UMAP.�
  • UMAP is particularly good at preserving both local and global structure.�
  • However, even methods that UMAP often struggle with embeddings with thousands of features.

Leland McInnes, John Healy, James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018.

umap-learn

.8�.2�.3� .� .

.5�.2�.9�.5�.3

d=2000

d=5

.8�.2�.3� .� .

🪆

.3�.7�.1� .� .

.5�.2�.9�.5�.3

d=2000

d=100

d=5

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

24 of 45

More standard techniques

Curse of dimensionality

24

  • To use HDBSCAN effectively, we want to reduce our embeddings from 2000 dimensions to 5. To this end, we use a dimension reduction method: UMAP.�
  • UMAP is particularly good at preserving both local and global structure.�
  • However, even methods that UMAP often struggle with embeddings with thousands of features.

umap-learn

Topic: Water management for the Financial sector in Europe - Datamaran

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

25 of 45

Step #4

Combining Steps

An overview of the IRO pipeline and BERTopic

25

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

26 of 45

Combining steps

26

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

27 of 45

Combining steps with BERTopic

27

bertopic

https://maartengr.github.io/BERTopic

  • Modular

  • Comes with default hyperparameters

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

28 of 45

Combining steps with BERTopic

28

bertopic

https://maartengr.github.io/BERTopic

  • Modular

  • Comes with default hyperparameters

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

29 of 45

Combining steps with BERTopic

29

bertopic

https://maartengr.github.io/BERTopic

  • Modular

  • Comes with default hyperparameters

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

30 of 45

Combining steps with BERTopic

30

bertopic

https://maartengr.github.io/BERTopic

  • Modular

  • Comes with default hyperparameters

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

31 of 45

Step #5

Refining Quality

Refining IRO quality through data quality and human evaluation

31

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

32 of 45

Refining Quality

32

argilla

IRO pipeline evaluation

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

33 of 45

Refining Quality

33

1. Dates, numbers, organizations and products are being used by HDBSCAN to determine clusters

  • We apply anonymization using the Named Entity Recognition (NER) framework SpanMarker.

  • Prior normalization + separation of punctuation marks.

span_marker

tomaarsen/

span-marker-roberta-large-ontonotes5

“In 2023, we acquired Activision Blizzard. [...] with the Candy Crush franchise representing 77.3% of global earnings.”

“In YEAR, we acquired ORG. [...] with the PRODUCT franchise representing PERCENTAGE of global earnings.”

In 2023, we acquired Activision Blizzard

In 2023 , we acquired Activision Blizzard .

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

34 of 45

Refining Quality

34

  • We removed duplicates using pairwise Levenshtein distance for the incoming sentences.

2. Many of the sentences are quasi-duplicates

Sentence #1��Sentence #2��Sentence #3��Sentence #4��…

#1 #2 #3 #4 …

“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”

“ As mentioned previously, sea level changes may have an adverse impact on our business, and climate change disclosure requirements could reduce demand on our exchanges.”

Sentence #1�…

Sentence #4�…

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

35 of 45

Human expertise is at the core of data quality

35

35

Good IRO

requires

Understanding what a good IRO is

Good clusters

requires

Fined-tuned

parameters

requires

Good input sentences

Good definition of what a good candidate sentence is

requires

Good prompts for the LLM generation

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

36 of 45

Conclusions

36

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

37 of 45

Conclusions

37

LLMs prove to be very valuable in many scenarios.

📊 However, they are often best utilized in combination with other traditional NLP approaches.

Understanding the performance on LLM requires proper evaluation metric and benchmarks.

👩‍💼 Human Expertise / Criteria is highly valuable!

🔤 Annotation tools (Argilla) help consolidate internal knowledge.

Best-in-class models might not always be what you need.

�Smaller or scalable models like Matryoshka can perform great while reducing compute costs.

🤖

🪆

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

38 of 45

Going further

38

Performance optimization

Environmental footprint of LLMs.

Scaling up infrastructure.

Speed up dimension reduction.

Pre-calculate what is possible.

🔭

Better context

Include information about the company’s operations, as well as environmental and social issues from relevant areas of operation.

Take into consideration scientific projections.

🔧

Input data quality

Prior classification of IRO sentences.

Use of sentence quality metrics to pre-select sentences.

Sentence deduplication at the company level.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

39 of 45

Questions?

39

39

London, HQ

Valencia

New York

info@datamaran.com

New York City

Leeuwarden

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

40 of 45

Issues with UMAP

40

The bigger the output dimensions, the slower UMAP is. It also affects input dimensions but not so drastically.

Uses Numba’s JIT (just-in-time), meaning it takes more time first time (orange - jit first time) but not the subsequent (green - jit subsequent times).

Experiment available here.

Output UMAP dimensions

Time (seconds)

Size of the input data of f(x)

Time (seconds)

1 thread (slow)

n threads (quick)

Deterministic

Non-deterministic

Impact of UMAP output dimensions in time

Benchmark of Numba JIT first time / each time (orange) vs after warm-up (green) vs no numba (blue)

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

41 of 45

Alternatives to UMAP

41

41

  • Can we skip it by using only Matryoshka or

just normalizing the data?

Removes times but It hurts quality /

HDBSCAN struggles to converge

  • Can we replace it by, for example, PCA?

PCA gives 90% speed improvement compared to UMAP with an average quality impact / PCA doesn’t keep the original topology of datapoints unlike UMAP

  • Clusters retrieved by UMAP

From a technical perspective, seems able to capture very well different semantic areas, at the cost of speed.

Multithread

GPU

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

42 of 45

Times of HDBSCAN

42

42

HDBSCAN seems well optimized.

�For a total of 1000 sentences, we evaluated the times using dimensions from 5 to 200.

Although the more dimensions take more time, for dim < 200 it only takes around ~0.1 seconds.

This allows us to conclude that HDBSCAN seems well optimized, and although the fewer dimensions the better, the impact will not be big.

Consequently, and given speed only, we did not find a big motivation to move to any other alternative.

Multithread

GPU

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

43 of 45

Matryoshka Loss Function

43

Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. 2022.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

44 of 45

Matryoshka Loss Function

44

Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. 2022.

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.

45 of 45

BERTopic experiments in Streamlit

45

45

  • UMAP, PCA or no reduction?
  • HDBSCAN, DBSCAN or KMEANS?
  • How does dimensions cardinality affect the results?
  • Which hyperparam configuration (merging force, strategies, algorithms) is optimal?
  • Easily analyze and compare experiments in Streamlit

© Datamaran, Ltd. — www.datamaran.com — All rights reserved.