1 of 32

UMSI Data4Good

AI Products & Demo�Using AI to find and analyze NGO PDF reports

Professor Edward Happ

Sidra Effendi

October 2024

2 of 32

Agenda

2

  1. Introductions
  2. Products
  3. Hangul
  4. Upload a PDF
  5. Hangul Summary Generation
  6. Hangul System Design
  7. Demo w/User Input
  8. Challenges
  9. Next Steps
  10. Q&A

3 of 32

Introductions

4 of 32

Intro

The Data4Good center is working on projects that are at the center of the UMSI mission: “We create and share knowledge so that people will use information — with technology — to build a better world.”

4

Our History:

  • Founded in 2017, the UMSI Data4Good (D4G) center is a volunteer run organization.
  • Empowering nonprofits with comprehensive data, to enhance decision making.
  • Currently, on the 9th cohort of students who are also responsible for hiring their successor.

Our Values:

  • Transparency and equal access to information.
  • Utilizing data and AI analysis to positively impact lives through nonprofit organizations.

Thank you for joining us. We look forward to your feedback and insights.

5 of 32

Team Members

5

6 of 32

Products

7 of 32

Current Data4Good Products

  • Chetah: a search engine for nonprofits reports published on the web. It locates reports using a custom search algorithm.

  • Hangul: An NLP-based assistant designed for digital curators at ReliefWeb to handle a larger volume of documents.

7

8 of 32

Hangul

9 of 32

Why Hangul?- UN ReliefWeb

9

  • ReliefWeb is a humanitarian information service provided by UN OCHA.
  • Ultimate Non-Profit Datasource
  • Report count 888,000 (737,000 in english language)
  • Publication date range

(12/1970 - 10/2024)

  • Categories:
    • 6 Org. types
    • 20 Themes
    • 20 Disaster types
  • 8 Document formats/types

10 of 32

Why Hangul?- ReliefWeb

10

  • ReliefWeb has 20-30 editors who manually process the PDFs produced by UN and NGOs.
  • It takes about 10-15 minutes for an editor to process each PDF.
  • Previous automation attempts failed.

11 of 32

Why Hangul?

Global Humanitarian Aid Forecast

Using an exponential trendline, the increase in 10 years is 4x, in 20 years it’s 14x

  • According to Global Humanitarian Assistance report 2023, funding needs will increase exponentially over the coming decades due to downstream impacts of global humanitarian disasters like climate change. This will lead to a proportional increase in the number of reports being generated each year for nonprofit programs.
  • Hangul is designed to increase editor productivity by ~14 times.

11

12 of 32

Hangul Objective

12

  • Automate extraction of content and metadata from PDFs desired by the UN RefliefWeb editors.
  • Reduce time taken to process PDFs by employing NLP techniques to extract metadata.
  • Create a user friendly web-interface for easy access to the system.
  • Leverage expertise of the ReliefWeb editors by keeping them in the loop.

13 of 32

Hangul

13

Document specification :

  • Non-financial reports
  • Maximum 25 pages

seffendi@umich.edu

Send a PDF and your email in the Chat for demo.

14 of 32

Hangul

We Used a Continuum of Methods

14

Algorithms

  • NER
  • POS Tagging
  • Text Filtering

LLM

  • FB Bart

ML Models

  • Neural Networks
  • Bi-Relevance Model
  • Linear SVC

In-built functions

  • Apache Tika
  • Yake
  • Markdown

15 of 32

Hangul

Location Detection

15

16 of 32

Hangul

Customized ML Models

16

  • Created an algorithm and retrieved 92000 unique report text data from the UN Relief Web API
  • Dataset cleaned and stored in csv format to train models
  • Multi-label classifier for disaster type detection
  • Multi-label classification model for UN Theme detection

17 of 32

Hangul

Model Performance Summary

17

Baseline Hangul 1.0

Linear SVC

Naive Bayes

Logistic Regression

Neural Network

LogReg Classifier Chain

Theme Detection Model

0.0 (not being returned)

0.90 (Binary Relevance

0.89 (Binary Relevance)

Disaster Detection Model

0.40

0.71

0.67

0.71

0.74

0.72

Average Micro F1 Scores Across Categories (with balanced dataset)

18 of 32

Hangul Summary

19 of 32

Hangul Summary

Summarize the non-profit reports

19

The goal is to make generate a summary which is

  • Readable
  • Affordable
  • Gets generated fast

20 of 32

Hangul Summary

20

Method

Date tested

By

Pros

Cons

BERT

Mar. 2021

Shivika

Publically available

Free

Not good narrative

Not very readable

Paragraph parsing

Nov. 2021

Sidra

D4G algorithm

Find most relevant paragraph proved difficult; no. of paragraphs in dataset is too large

First 4 pages extraction

Oct. 2022

Sidra

ReliefWeb approach

Includes irrelevant text; problem of scope (not comprehensive)

Find relevant page

Dec. 2022

Sidra

Most straightforward and readable

Does not summarize the document. Useful for search.

Find summary in the doc (e.g., abstract, cover letter, intro, etc.)

May 2022

Sidra

A summary-like statement in the doc is most accurate

Difficult to find an appropriate summary type that applies across PDFs

21 of 32

Hangul Summary

21

Method

Date tested

By

Pros

Cons

ChatGPT

Jun-Jul. 2023

Prithvi, Ed

Readable, high quality summaries

Expensive via API

Google Bard

Jun. 2023

Ed

Readable, high quality summaries

No API - in beta version, not available to everyone

ChatPDF

Aug-Sep. 2023

Takao, Ed

Readable, high quality summaries, more affordable

Random summary types, slow API, limited entry-level plans

Facebook BART

Sep. 2023

Prithvi

Readable, high quality summaries, more affordable

Slow. Requires download and local GPU machine

22 of 32

Hangul Summary

Readable, affordable and fast summaries for non-profit reports

22

23 of 32

Summary: BERTScore

23

Reference: Extracted existing abstract/ summary

Generated: TextRank+LLM

Precision = 0.84

Recall = 0.82

**Comparing executive summary vs generated summary.

24 of 32

Hangul System Design

25 of 32

Hangul System Design

Objective:

  • Reliability
  • Functionality
  • Maintainability
  • Performance

25

Constraints:

  • 3GB RAM
  • Limited horizontal scaling
  • Long PDF reports
  • Cost

26 of 32

Hangul System Design

26

27 of 32

Metadata Generation Time

27

28 of 32

Summary Generation Time

28

29 of 32

Hangul Demo

30 of 32

D4G - Challenges

  • The work is targeted towards the Non-Profit sector where technology comes with many resource restrictions

  • Computational restrictions - No GPUs, limited storage memory

  • Especially with document summarization, we would probably keep investigating on how to achieve good results without compromising much on quality using modern AI models and tools.

30

31 of 32

Next Steps

  • Hangul 2.2 and API 1.0 per ReliefWeb requests
  • Chetah 2.0 launch
  • ReliefWeb cross-PDF chatbot MVP
  • Personas Project
  • AI/DS project ideas from you!

31

32 of 32

Thank you!

32

Edward G. Happ

Sidra Effendi

Executive Fellow Emeritus, UMSI

Leadership Fellow, NetHope

Email: ehapp@umich.edu

Website: www.eghapp.com

Blog: http://eghapp.blogspot.com/

Research Assistant, University of Michigan

Email: seffendi@umich.edu

Website: www.sidraeffendi.com

Blog: https://medium.com/@sidra.effendi

For more information, contact: