Understanding The Semantic Toolkit
Steps:
Using Semantic Toolkit for Climate
Slide 1-2: Summary
Slide 3-11: Understanding Climate Reports
Slide 12-13: Problems with Pdf
Slide 14-20: Overview of semantic toolkit
Slide 21: Advantages of colab notebook
Slide 22-29 : DIY
IPCC
What is IPCC?�
Regular assessments for policymakers. ��Providing in-depth evaluation of the current state of knowledge on climate change through assessment reports
IPCC Sixth
Assessment
Report
50 Chapters
10,000 Pages
IPCC AR6 Reports
Chapters in different working groups
Structure of the IPCC Data Corpus
WGIII Report
Example Chapter
Chapter 8: Urban Systems and Other Settlements
Typical IPCC Chapter
Chapter
Executive Summary
FAQ
Text
Figures
Boxes
Tables
References
Executive Summary(15) and (3) FAQs
12
Information to save the world is locked in PDF reports
But it’s the most important document ever
I can’t read all of that!
Text Data-Mining can help!
IPCC Reports
13
Machines can help, but…
Really?
They can’t make sense of PDFs
Hey, #SemanticClimate can help!
We need tools!
#semanticClimate team have developed tools to liberate knowledge from climate reports
(py4ami /pyamihtmlx: This package is used to convert pdf into html, search query and annotating html)
(pygetpapers: pygetpapers is a tool to assist text miners. It makes requests to open access scientific text repositories, analyses the hits, and systematically downloads the articles without further interaction.)
(docanalysis: docanalysis is a Command Line Tool that ingests corpora (CProjects) and carries out text-analysis of documents, including
15
py4ami
docanalysis
IPCC chapters,
IPBES reports,
…
< >
HTML
How our TDM tools liberate knowledge
…despite mitigation efforts 8 including those in near-universal nationally determined contributions (NDCs)…
docanalysis
py4ami
Hegde SN, Garg A, Murray-Rust P, Mietchen D (2022) Mining the literature for ethics statements: a step towards standardizing research ethics. ARPHA Preprints. https://doi.org/10.3897/arphapreprints.e94687
Peter Murray-Rust
16
py4ami
docanalysis
Abbreviation dictionary
Climate-related terms dictionary
IPCC chapters,
IPBES reports,
…
< >
HTML
How our TDM tools liberate knowledge
…despite mitigation efforts 8 including those in near-universal nationally determined contributions (NDCs)…
DICTIONARIES
MANUAL DICTIONARY
KEYWORD/KEYPHRASES
ABBREVIATION DICTIONARY
HTML
TEXT
Annotated HTML
py4ami
rake/yake gensim
docanalysis
py4ami
Dictionary: Collection of words and their meanings
Manual dictionary: This is done by manually reading the pdf
Abbreviation dictionaries: Automated by docanalysis using spaCy
Output: Example from Chapter 8 / IPCC Reports
Introduction to Google Colab Notebook
An open Jupyter notebook environment
No pain with setups
Human Machine friendly
Supports interactive programming
Easy learn and explore new tools
Let’s begin the Action
Click here
https://colab.research.google.com/drive/13J-5kXKYUAMWGoSJGAANb-Ws70k7bvPs?usp=sharing
url for colab notebook for chapter08
url for colab notebook for chapter17 analysis
https://colab.research.google.com/github/petermr/semanticClimate/blob/main/outreach/climate_knowledge_hunt_hackathon/Hackathon_Notebook/Chapter17_Analysis_Notebook.ipynb#scrollTo=TJTQ4c1mCq5a
url for colab notebook for literature search on specific keyword and making world cloud
https://colab.research.google.com/drive/12ixmez8zh288hBGzwWmBaj29leeuQRD7#scrollTo=THyLJArMaLDi
Click Step 1 and 2 to Install tools / packages
Step 3: Making directory
Step 5: HTML annotation with dict
Google Colab steps
Click Step 3: Pdf to HTML conversion (py4ami)
Step 4: Abbreviation extraction (docanalysis)
Click on code cell to run each step
Step 1 and 2: setting up the environment
Step 1
Step 2
Click the cell
Click the cell
First two steps will take some time: 10 min-12 min
Step 3
Click the cell
Step 4
Click the cell
Step 5
Click the cell
Result
Input file : cleaned HTML
Typical dictionary output (abb_chapter08.xml)
Step 6
Click the cell
Result
This is annotated Chapter08 HTML outcome using abbreviation dictionary
Use of annotated HTML in Research
The annotated HTML with Dictionary is linked to
wikidata id, so it is helpful in providing more informations about the dictionary term in the literature
Clicking on the link will show wikidata information
Wikidata: Wikidata is a free and open knowledge base that can be read and edited by both humans and machines.
32
GitHub
Our dictionaries and code are all openly available for anybody to browse and re-use
docanalysis
pyami
pygetpapers
pyamiimage
semanticClimate