1 of 32

Understanding The Semantic Toolkit

Steps:

Converting PDF Reports
Creating semantic Dictionary
Annotating HTML with dictionary

2 of 32

Using Semantic Toolkit for Climate

Slide 1-2: Summary

Slide 3-11: Understanding Climate Reports

Slide 12-13: Problems with Pdf

Slide 14-20: Overview of semantic toolkit

Slide 21: Advantages of colab notebook

Slide 22-29 : DIY

3 of 32

IPCC

4 of 32

What is IPCC?�

Regular assessments for policymakers. ��Providing in-depth evaluation of the current state of knowledge on climate change through assessment reports

5 of 32

IPCC Sixth

Assessment

Report

50 Chapters

10,000 Pages

PDF

6 of 32

IPCC AR6 Reports

7 of 32

IPCC/ar6/wg1 report with 12 chapters

IPCC/ar6/wg2 report with 18 chapters

IPCC/ar6/wg3 report with 17 chapters

Chapters in different working groups

8 of 32

Structure of the IPCC Data Corpus

Each of the seven reports in AR6 further contains chapters and Technical Summary(TS), Summary for PolicyMakers(SPM) and a Glossary etc.

Each chapter is of the size of a small book;

Chapter 8 of WG3, for instance has about 92 pages.

9 of 32

WGIII Report

10 of 32

Example Chapter

Chapter 8: Urban Systems and Other Settlements

11 of 32

Typical IPCC Chapter

Chapter

Executive Summary

FAQ

Text

Figures

Boxes

Tables

References

Executive Summary(15) and (3) FAQs

12 of 32

12

Information to save the world is locked in PDF reports

10000 pages;
technical
not editable

But it’s the most important document ever

I can’t read all of that!

Text Data-Mining can help!

IPCC Reports

13 of 32

13

Machines can help, but…

Really?

They can’t make sense of PDFs

Hey, #SemanticClimate can help!

We need tools!

14 of 32

#semanticClimate team have developed tools to liberate knowledge from climate reports

py4ami (old version)/ pyamihtmlx (new version)

(py4ami /pyamihtmlx: This package is used to convert pdf into html, search query and annotating html)

pygetpapers (for research papers)

(pygetpapers: pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.)

docanalysis

(docanalysis: docanalysis is a Command Line Tool that ingests corpora (CProjects) and carries out text-analysis of documents, including

sectioning
NLP/text-mining
dictionary generation

15 of 32

15

py4ami

docanalysis

IPCC chapters,

IPBES reports,

…

< >

HTML

How our TDM tools liberate knowledge

…despite mitigation efforts 8 including those in near-universal nationally determined contributions (NDCs)…

docanalysis

py4ami

Hegde SN, Garg A, Murray-Rust P, Mietchen D (2022) Mining the literature for ethics statements: a step towards standardizing research ethics. ARPHA Preprints. https://doi.org/10.3897/arphapreprints.e94687

Peter Murray-Rust

16 of 32

16

py4ami

docanalysis

Abbreviation dictionary

Climate-related terms dictionary

IPCC chapters,

IPBES reports,

…

< >

HTML

How our TDM tools liberate knowledge

…despite mitigation efforts 8 including those in near-universal nationally determined contributions (NDCs)…

17 of 32

DICTIONARIES

MANUAL DICTIONARY

KEYWORD/KEYPHRASES

ABBREVIATION DICTIONARY

HTML

TEXT

Annotated HTML

py4ami

rake/yake gensim

docanalysis

py4ami

Dictionary: Collection of words and their meanings

18 of 32

Manual dictionary: This is done by manually reading the pdf

19 of 32

20 of 32

Abbreviation dictionaries: Automated by docanalysis using spaCy

Output: Example from Chapter 8 / IPCC Reports

21 of 32

Introduction to Google Colab Notebook

An open Jupyter notebook environment

No pain with setups

Human Machine friendly

Supports interactive programming

Easy learn and explore new tools

22 of 32

Let’s begin the Action

Open the Colab Notebook

Click here

https://colab.research.google.com/drive/13J-5kXKYUAMWGoSJGAANb-Ws70k7bvPs?usp=sharing

url for colab notebook for chapter08

url for colab notebook for chapter17 analysis

https://colab.research.google.com/github/petermr/semanticClimate/blob/main/outreach/climate_knowledge_hunt_hackathon/Hackathon_Notebook/Chapter17_Analysis_Notebook.ipynb#scrollTo=TJTQ4c1mCq5a

url for colab notebook for literature search on specific keyword and making world cloud

https://colab.research.google.com/drive/12ixmez8zh288hBGzwWmBaj29leeuQRD7#scrollTo=THyLJArMaLDi

23 of 32

Click Step 1 and 2 to Install tools / packages

Step 3: Making directory

Step 5: HTML annotation with dict

Google Colab steps

Click Step 3: Pdf to HTML conversion (py4ami)

Step 4: Abbreviation extraction (docanalysis)

Click on code cell to run each step

24 of 32

Step 1 and 2: setting up the environment

Step 1

Step 2

Click the cell

First two steps will take some time: 10 min-12 min

25 of 32

Step 3

Click the cell

26 of 32

Step 4

Click the cell

Step 5

Click the cell

Result

Input file : cleaned HTML

27 of 32

Typical dictionary output (abb_chapter08.xml)

28 of 32

Step 6

Click the cell

Result

29 of 32

This is annotated Chapter08 HTML outcome using abbreviation dictionary

30 of 32

Use of annotated HTML in Research

The annotated HTML with Dictionary is linked to

wikidata id, so it is helpful in providing more informations about the dictionary term in the literature

Clicking on the link will show wikidata information

Wikidata: Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.

31 of 32

32 of 32

32

GitHub

Our dictionaries and code are all openly available for anybody to browse and re-use

docanalysis

pyami

https://github.com/petermr/pygetpapers

https://github.com/petermr/docanalysis

https://github.com/petermr/pyami

pygetpapers

https://github.com/petermr/pyamiimage

pyamiimage

https://github.com/petermr/semanticClimate

semanticClimate