Navigating the ESG landscape with LLMs
1
Find clarity in a world of noise
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Who are we? What is Datamaran?
2
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Who are we?
3
3
Vincent Rizzo
Senior Engineer
Martin Quesada Z.
Senior Data Scientist
Mantis NLP team
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Environment Social & Governance (ESG)
4
Environmental challenges, scientists sounding the alarm, social unrest and greenwashing
Corporate reports scrutinized now more than ever to make sure companies hold to their commitments.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
What is Datamaran?
Data-driven & dynamic
Endorsed as best practice by regulators (EFRAG and US SEC) and standard setters (ISSB), Datamaran’s patented technology uses AI to help C-Suite validate ESG priorities - current and emerging.
5
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
What is Datamaran?
Data-driven & dynamic
6
Embed ESG into the DNA of every major company in the world
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Our clients
7
“Datamaran is the most advanced solution in the market,
providing key market intelligence to look around corners and identify risks and opportunities.”
Our Partners:
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
CSRD:
A use case for Language Models in ESG
8
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Double Materiality
9
9
50,000 companies to align disclosures with CSRD in 2024
Source: European Union
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Double Materiality
10
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
How to generate interesting potential Impacts, Risks & Opportunities (IROs)?
11
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
What is the shortest path to building our IRO feature?��
12
12
Step #1� �Proof of Concept
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Building a minimum-viable IRO with LLMs
13
“Generate a 15-word-maximum sentence detailing a risk related to {$topic} [...]”
Topic description
Prompt template
“Non-greenhouse gas air emissions that impact air quality, atmospheric conditions and/or human health. [...]”
IRO
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Building a minimum-viable IRO with LLMs
14
argilla
Minimum-viable IRO evaluation
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Step #2�
Adding Context
How can we leverage existing company reports to improve IRO generation?
15
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Adding Context
Retrieval Augmented Generation (RAG)
16
Report sentences tagged with topics
Company reports
“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
Energy use, conservation & reductions
Climate Change Risks & Management
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Adding Context
17
“We are on track and ahead on our goals to increase product energy efficiency �10X for client and server microprocessors, respectively, by 2030.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Increase hiring of veterans by at least 23% and military spouses by at least 15% in 2022”
Fair & inclusive workplace
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Climate change may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Climate change may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Our aim is to increase renewable energy use by our primary foundry manufacturing suppliers by 2x from 2020-2025.”
“Generate a 15-word-maximum sentence detailing a risk related to {$topic}. The risk should use the following sentences as relevant context on the issue from the company’s peers: {$sentences}”
Energy use, conservation & reductions
Climate Change Risks & Management
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Adding Context
18
.8�.2�.3� .� .
We want to cluster report sentences in groups and summarize those
“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
transformersopenai
Requires
Vectorization
Climate Change Risks & Management
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Adding Context
19
McInnes L, Healy J. Accelerated Hierarchical Density Based Clustering In: 2017 IEEE International Conference on Data Mining Workshops (ICDMW), IEEE, pp 33-42. 2017
HDBSCAN over other clustering methods:
to DBSCAN and K-Means.
hdbscan
Condensed HDBSCAN cluster tree. https://hdbscan.readthedocs.io/.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Step #3��Curse of Dimensionality
How can we group together large text embeddings without getting lost in their many dimensions?
20
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Curse of dimensionality
21
HDBSCAN does not work well with more than 100 dimensions
.1 .3 -.5 . . . .9 -.7 .8
2000
100
hdbscan documentation https://hdbscan.readthedocs.io/en/latest/faq.html#q-i-am-not-getting-the-claimed-performance-why-not
Monika Zagrobelna. How to draw a realistic lion, with Monika Zagrobelna. Published June 2, 2024 in https://community.wacom.com/how-to-draw-lion-monika-zagrobelna/.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Matryoshka Representation Learning
Curse of dimensionality
22
Tom Aarsen, Xenova, Omar Sanseviero. Introduction to Matryoshka Embedding Models. Published February 23, 2024 in https://huggingface.co/blog/matryoshka.
Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ramanujan, V., ... & Farhadi, A. (2022). Matryoshka representation learning. Advances in Neural Information Processing Systems, 35, 30233-30249. https://arxiv.org/abs/2205.13147
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
More standard techniques
Curse of dimensionality
23
⚡
Leland McInnes, John Healy, James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. 2018.
umap-learn
.8�.2�.3� .� .
.5�.2�.9�.5�.3
d=2000
d=5
.8�.2�.3� .� .
🪆
.3�.7�.1� .� .
.5�.2�.9�.5�.3
d=2000
d=100
d=5
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
More standard techniques
Curse of dimensionality
24
umap-learn
Topic: Water management for the Financial sector in Europe - Datamaran
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Step #4
Combining Steps
An overview of the IRO pipeline and BERTopic
25
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Combining steps
26
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Combining steps with BERTopic
27
bertopic
https://maartengr.github.io/BERTopic
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Combining steps with BERTopic
28
bertopic
https://maartengr.github.io/BERTopic
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Combining steps with BERTopic
29
bertopic
https://maartengr.github.io/BERTopic
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Combining steps with BERTopic
30
bertopic
https://maartengr.github.io/BERTopic
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Step #5
Refining Quality
Refining IRO quality through data quality and human evaluation
31
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Refining Quality
32
argilla
IRO pipeline evaluation
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Refining Quality
33
1. Dates, numbers, organizations and products are being used by HDBSCAN to determine clusters
span_marker
tomaarsen/
span-marker-roberta-large-ontonotes5
“In 2023, we acquired Activision Blizzard. [...] with the Candy Crush franchise representing 77.3% of global earnings.”
“In YEAR, we acquired ORG. [...] with the PRODUCT franchise representing PERCENTAGE of global earnings.”
In 2023, we acquired Activision Blizzard
In 2023 , we acquired Activision Blizzard .
✅
❌
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Refining Quality
34
2. Many of the sentences are quasi-duplicates
Sentence #1��Sentence #2��Sentence #3��Sentence #4��…
#1 #2 #3 #4 …
“Sea level changes may have a long-term adverse impact on our business, and climate change disclosure requirements may reduce demand on our exchanges.”
“ As mentioned previously, sea level changes may have an adverse impact on our business, and climate change disclosure requirements could reduce demand on our exchanges.”
Sentence #1�…
Sentence #4�…
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Human expertise is at the core of data quality
35
35
Good IRO
requires
Understanding what a good IRO is
Good clusters
requires
Fined-tuned
parameters
requires
Good input sentences
Good definition of what a good candidate sentence is
requires
Good prompts for the LLM generation
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Conclusions
36
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Conclusions
37
LLMs prove to be very valuable in many scenarios.
📊 However, they are often best utilized in combination with other traditional NLP approaches.
Understanding the performance on LLM requires proper evaluation metric and benchmarks.
👩💼 Human Expertise / Criteria is highly valuable!
🔤 Annotation tools (Argilla) help consolidate internal knowledge.
Best-in-class models might not always be what you need.
�Smaller or scalable models like Matryoshka can perform great while reducing compute costs.
🤖
🪆
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Going further
38
⚡
Performance optimization
Environmental footprint of LLMs.
Scaling up infrastructure.
Speed up dimension reduction.
Pre-calculate what is possible.
🔭
Better context
Include information about the company’s operations, as well as environmental and social issues from relevant areas of operation.
Take into consideration scientific projections.
🔧
Input data quality
Prior classification of IRO sentences.
Use of sentence quality metrics to pre-select sentences.
Sentence deduplication at the company level.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Questions?
39
39
London, HQ
Valencia
New York
info@datamaran.com
New York City
Leeuwarden
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Issues with UMAP
40
The bigger the output dimensions, the slower UMAP is. It also affects input dimensions but not so drastically.
Uses Numba’s JIT (just-in-time), meaning it takes more time first time (orange - jit first time) but not the subsequent (green - jit subsequent times).
Experiment available here.
Output UMAP dimensions
Time (seconds)
Size of the input data of f(x)
Time (seconds)
1 thread (slow)
n threads (quick)
Deterministic
Non-deterministic
Impact of UMAP output dimensions in time
Benchmark of Numba JIT first time / each time (orange) vs after warm-up (green) vs no numba (blue)
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Alternatives to UMAP
41
41
just normalizing the data?
Removes times but It hurts quality /
HDBSCAN struggles to converge
PCA gives 90% speed improvement compared to UMAP with an average quality impact / PCA doesn’t keep the original topology of datapoints unlike UMAP
From a technical perspective, seems able to capture very well different semantic areas, at the cost of speed.
Multithread
GPU
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Times of HDBSCAN
42
42
HDBSCAN seems well optimized.
�For a total of 1000 sentences, we evaluated the times using dimensions from 5 to 200.
Although the more dimensions take more time, for dim < 200 it only takes around ~0.1 seconds.
This allows us to conclude that HDBSCAN seems well optimized, and although the fewer dimensions the better, the impact will not be big.
Consequently, and given speed only, we did not find a big motivation to move to any other alternative.
Multithread
GPU
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Matryoshka Loss Function
43
Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. 2022.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
Matryoshka Loss Function
44
Aditya Kusupati, Gantavya Bhatt, et al. Matryoshka Representation Learning. 2022.
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.
BERTopic experiments in Streamlit
45
45
© Datamaran, Ltd. — www.datamaran.com — All rights reserved.