UMSI Data4Good
AI Products & Demo�Using AI to find and analyze NGO PDF reports
Professor Edward Happ
Sidra Effendi
October 2024
Agenda
2
Introductions
Intro
The Data4Good center is working on projects that are at the center of the UMSI mission: “We create and share knowledge so that people will use information — with technology — to build a better world.”
4
Our History:
Our Values:
Thank you for joining us. We look forward to your feedback and insights.
Team Members
5
Products
Current Data4Good Products
7
Hangul
Why Hangul?- UN ReliefWeb
9
(12/1970 - 10/2024)
Why Hangul?- ReliefWeb
10
Why Hangul?
Global Humanitarian Aid Forecast
Using an exponential trendline, the increase in 10 years is 4x, in 20 years it’s 14x
11
GHA Report PDF: https://devinit.org/resources/global-humanitarian-assistance-report-2023/
UN HDX Data Source: https://data.humdata.org/dataset/gha-report-2022
Hangul Objective
12
Hangul
13
Send a PDF and your email in the Chat for demo.
Hangul
We Used a Continuum of Methods
14
Algorithms
LLM
ML Models
In-built functions
Hangul
Location Detection
15
Hangul
Customized ML Models
16
Hangul
Model Performance Summary
17
| Baseline Hangul 1.0 | Linear SVC | Naive Bayes | Logistic Regression | Neural Network | LogReg Classifier Chain |
Theme Detection Model | 0.0 (not being returned) | 0.90 (Binary Relevance | | 0.89 (Binary Relevance) | | |
Disaster Detection Model | 0.40 | 0.71 | 0.67 | 0.71 | 0.74 | 0.72 |
Average Micro F1 Scores Across Categories (with balanced dataset)
Hangul Summary
Hangul Summary
Summarize the non-profit reports
19
The goal is to make generate a summary which is
Hangul Summary
20
Method | Date tested | By | Pros | Cons |
BERT | Mar. 2021 | Shivika | Publically available Free | Not good narrative Not very readable |
Paragraph parsing | Nov. 2021 | Sidra | D4G algorithm | Find most relevant paragraph proved difficult; no. of paragraphs in dataset is too large |
First 4 pages extraction | Oct. 2022 | Sidra | ReliefWeb approach | Includes irrelevant text; problem of scope (not comprehensive) |
Find relevant page | Dec. 2022 | Sidra | Most straightforward and readable | Does not summarize the document. Useful for search. |
Find summary in the doc (e.g., abstract, cover letter, intro, etc.) | May 2022 | Sidra | A summary-like statement in the doc is most accurate | Difficult to find an appropriate summary type that applies across PDFs |
Hangul Summary
21
Method | Date tested | By | Pros | Cons |
ChatGPT | Jun-Jul. 2023 | Prithvi, Ed | Readable, high quality summaries | Expensive via API |
Google Bard | Jun. 2023 | Ed | Readable, high quality summaries | No API - in beta version, not available to everyone |
ChatPDF | Aug-Sep. 2023 | Takao, Ed | Readable, high quality summaries, more affordable | Random summary types, slow API, limited entry-level plans |
Facebook BART | Sep. 2023 | Prithvi | Readable, high quality summaries, more affordable | Slow. Requires download and local GPU machine |
Hangul Summary
Readable, affordable and fast summaries for non-profit reports
22
Summary: BERTScore
23
Reference: Extracted existing abstract/ summary
Generated: TextRank+LLM
Precision = 0.84
Recall = 0.82
**Comparing executive summary vs generated summary.
Hangul System Design
Hangul System Design
Objective:
25
Constraints:
Hangul System Design
26
Metadata Generation Time
27
Summary Generation Time
28
Hangul Demo
D4G - Challenges
30
Next Steps
31
Thank you!
32
Edward G. Happ | Sidra Effendi |
Executive Fellow Emeritus, UMSI Leadership Fellow, NetHope Email: ehapp@umich.edu Website: www.eghapp.com |
For more information, contact: