1 of 75

Data for Pangolins

Pili Hu (Lecturer, HKBU)

Roy Tang (JOUR+CS grad., HKBU)

2019.06.01 @HKU

2 of 75

Workshop facilitators

Pili Hu

Mr. Hu teaches data journalism and media technology in HKBU. Before joining HKBU, he served as the Chief Technology Officer of Initium Media, responsible for developing technologies to power the fast-growing digital media outlet. He founded Initium Lab whose data-driven news report on Hong Kong Legislative Council voting pattern won SOPA "Excellence in Information Graphics" award in 2016. He was co-founder and CEO of HyperLab which produced a cross social network search engine. He holds an MPhil in Information Engineering from CUHK. Before moving to Hong Kong, he worked for Baidu in Beijing as core algorithm R/D engineer, responsible for user behaviour and link analysis.

Roy Tang

Roy is a recent graduate in Journalism and Computer Science from Hong Kong Baptist University. Having a strong interest in data-driven reporting, he participated in multiple data journalism projects and competitions. Roy studies and practices intensively in data analysis, data visualization, and web front-end development. He took part in the School of Communication's research on Hong Kong Digital Media Report 2018 as a research assistant and the web developer.

3 of 75

Agenda

  • 30 min - A primer on data
    • Reach a rough consensus of data collaboration
  • 30 min - overview of current pangolin data
  • 30 min - (the true) workshop!
    • Identify data opportunities

Questions of public interest

Data answers

Journalism answers

GIJC

today

today

today

4 of 75

Crash Course on Data

5 of 75

Hierarchy of details: Number/ stats/ data

Numbers -- You can get by interviewing experts, quoting from other articles, … Only useful for writing articles.

Statistics -- Usually get from research reports/ surveys. You can analyse trend, extract knowledge and draw infographics.

Data -- “Many datum”. Can not be consumed by human before processing. Useful for data mining, data visualisation, generative arts.

6 of 75

Common Types of (Table) Dataset

Flattened dataset

  • Normalised Dataset
  • List of records
  • Long table

* Normalisation is a more stringent concept in database design. Flattened dataset is not necessarily normalised (e.g. it can be a join of multiple tables)

Crosstab

  • Unnormalised Dataset
  • Pivoted table
  • Wide table

* Unnormalised dataset comes in many formats and does not necessarily be crosstab. For practical purpose, crosstab is more commonly seen, especially Census Dept. of Govs

7 of 75

Flattened dataset

A fraction of 4300+ records of district council election candidates from 1999 to 2015 (source: Initium)

8 of 75

Crosstab dataset

Median age by (year, gender) X (camp)

<40 highlighted in red;

> 48 highlighted in green

9 of 75

Where do we find stats?

Tabulation services:

(large and high-dimensional dataset sliced/ aggregated into smaller tables)

https://www.statcompiler.com/en/

http://www.fao.org/faostat/en/#home

https://datausa.io/

Stats/ charts search engine:

(large collection of small and maybe unrelated stats/charts)

https://www.theatlas.com/

https://www.statista.com/

10 of 75

Logical layers in telling a data story

  • Observation (factual)
    • Citing extreme cases
      • "I know a female, who is paid only ½ that of a male colleague at the same rank"
    • Citing counter example
      • "Women are not necessarily paid less. For example, who..."
  • Correlation (statistical)
    • "In general, women are paid less"
      • For M/F variables, we can test the significance (p-value)
  • Causality (philosophical/ empirical)
    • "Being female makes her get a lower payment"

11 of 75

Pattern recognition & Anomaly Detection

  • Prove “common sense”:
    • Find patterns
    • Discover trends
    • Interpolate history
    • Predict future
    • “Explanatory News”
  • Disprove “common sense”:
    • Anomaly detection
    • “News”

Web

FB

Traffic distribution on web and on Facebook by channels. Real data from Initium Media, anonymised. (Pili Hu @ DNN, Dec 2015)

12 of 75

Data-driven v.s. story-driven

  • Story-driven:
    • From news to data news
    • Visualisation is mostly an “add-on”
    • Most visualisations are charting
    • Data is mostly useful for background/ overview
    • Key: start from original (原著)
  • Data-driven:
    • From research report to mass communication campaign
      • Key: master the methodology for social research
    • From data to insights to story
      • Key: master the methodology of data mining; exploratory analysis

13 of 75

Berkeley admission record

Exercise:

  • Calculate the by-gender admission rate at department level
  • Calculate the by-gender admission rate at university level

How do we interpret this phenomenon:

  • It is a matter of the preference in choosing majors
    • Distribution of population
  • Simpson's Paradox

Data literacy lab 101 dataset: download here .

"5 data literacy cases" (Chinese)

14 of 75

Same dataset: two different stories

  • Gender bias against male?
  • Gender bias against female?

15 of 75

Anscombe's quartet (Correlation and visual signs)

Exercise:

  • Calculate basic stats:
    • =AVERAGE()
    • =VAR()
  • Use Data Analysis Pack to calculate the correlation
  • Visualise by scatter plot

16 of 75

Motivation for visualisation

  • Although seeing is not necessarily believing, it is still better off to see more
    • More data points
    • More details
    • More facets
    • More angles
  • Statistical argument is hard in general
    • Even if you can do, causal relationship is still uncertain
  • Common approach: visualise the data and leverage empirical / logical reasoning to reach (actionable) insights.

17 of 75

The accuracy of visual elements

J. D. Mackinlay, 1986, “Automating the Design of Graphical Presentations of Relational Information” http://www2.parc.com/istl/groups/uir/publications/items/UIR-1986-02-Mackinlay-TOG-Automating.pdf

18 of 75

19 of 75

20 of 75

21 of 75

Pangolin data - status quo

22 of 75

Status update: 20190601

  • Data Catalog: https://github.com/Roytangrb/pangolin
  • Summarise 6 datasets: World wide, Hong Kong, China, Nepal
    • Welcome more!
  • 2G CITES to SQL DB (may put on cloud later)
  • Exploratory analysis of CITES

23 of 75

Catalogue overview

Please feel free to send anything to @Roy Tang .

He will keep things in order

24 of 75

Datasets summary

Dataset

Overview/ Nature

Participation

CITES

2G, 20M+ records, structured

Data analyst

EIA Seizures

Map raw data exist; but not opened

Request data

China Judgements DB

Can batch download

Coder

HKU Seizures

Google news (manual scrape; coding)

Scraper / Coder

HK Custom

Scraped

Coder

Nepal

Ad hoc information

Data Researcher

25 of 75

Today's objective: Identify data opportunities

Data-driven:

  • Exploratory analysis/ visualisation of single dataset
    • Stats/ charts brainstorm in the latter part of the workshop
  • Cross-DB opportunities
  • Enrichment of existing dataset
    • Coverage (add data points): Geography, time, etc
    • Attribute (add variables): may need research, transformation, etc

Story-driven:

  • Visualise numbers/ stats in existing articles
  • Enrich numbers/ stats in existing articles

26 of 75

CITES

A multilateral treaty to protect endangered plants and animals

https://en.wikipedia.org/wiki/CITES

27 of 75

CITES

https://trade.cites.org/

  • Full DB
  • 2.4G in CSV
  • 20 million+ records

28 of 75

Column overview

  • Column name:
    • Year
    • Appendix
      • All eight pangolin species were uplisted to CITES Appendix I in 2016
    • taxon, class, order, family
    • term, quantity, unit
    • importer, exporter, origin, purpose, source
    • reporter type(Importer/Exporter)
  • Data format:
    • Year [date]
    • Quantity [number]
    • Term: {scales, leather, skins, meat...}
    • Unit: {kg, g, cm, m, box, carton ...}
    • Others: Code representation

29 of 75

Compare two sources

  • Top: 2G raw data format
    • Reporter Type
  • Bottom: tabulation output from website (TBC)
    • Aggregate by keys
    • Compare importer/ exporter

30 of 75

Manidae subset from CITES

  • Total records in CITES: 20 million+
  • Subset of CITES dataset: Manidae
    • 3631 manidae records
      • * remarks: data downloaded via CITES query API is not raw, containing only (1600 records)
    • View/ Download: https://github.com/Roytangrb/pangolin/blob/master/CITIES%20Analysis/manidae.csv

31 of 75

Difficulty: Vocabulary

  • Class: Mammalia 哺乳類
  • Order: Pholidota 鱗甲目
  • Family: Manidae 穿山甲科
  • Genus: Manis 穿山甲屬
  • Taxon(Species)
    • 10 species in the dataset

32 of 75

Data: Filtered records for pangolins (with different species names)

Species most reported:

  • Manis javanica 馬來穿山甲

  • Manis spp.

(Several Species; uncategorized)

  • Manis pentadactyla

中華穿山甲

33 of 75

Chart: CITES overview

  • Different species status?
  • E.g.:
    • Compare # of records containing ‘ivory’ and ‘manidae’

34 of 75

Chart: Reported # of records of import/export by year

Observations and discussions:

  • Why is reported import always larger?
    • Import record has no corresponding export record
    • Reporting issues?
    • Need to further confirm by comparing total quantities of import and export
      • Record matching and aggregation (find anomaly)

  • Who are the main importers and exporters?

35 of 75

Chart: Multi-line chart for reported records of import of different countries by year

Top 10:

Observation:

  • ‘Largest’ importer is US
  • China/HK have rather small # of import records
  • As to "scale" (later in the slides), China and HK are top ones.
  • Connected with Janet’s report, US, JP, countries other than CN could be responsible for pangolin trade as well

Future idea:

  • Import activity/ quantity by term (*quantity)
    • Map visualisation
    • Distribution of different terms to different countries

36 of 75

Chart: Multi-line chart for reported records of export of different countries by year

Top 10

IT: Italy, SG: singapore, TG: Togo

  • Many countries report imports more often than reporting exports
  • Relatively, Japan reports exports much more than imports

  • US export terms

37 of 75

Chart: Multi-line chart for # of reported records of import of different pangolins by year

  • Reported import records of Manis javanica flatten at some point between 1995 - 2000
  • Manis pentadactyla most reported between 1980 & 1985
    • What happened during these event points?
  • Manis spp. (uncategorized) imports increase after 2005
    • What species are the Manis spp. ?
    • Other species may be reported as Manis spp.?

38 of 75

Chart: Multi-line chart for reported records of export of different pangolins by year

  • Import records growth trend resembles Export’s, except the report of Manis spp.
    • Who are the exporting country reporting Manis spp. [TBC]

39 of 75

Chart: import by terms

Top 10

  • Skins import has the largest number of records
  • Shoes import takes the second place
  • Scales import report number starts to increase faster since 2010

40 of 75

Chart: export by terms

Top 10

  • Similar trend as import

41 of 75

Difficulty: Multiple Units

  • E.g.: total amount skins is hard to calculate
  • *term<->individual calculator [paul]

* If no unit is shown, the figure represents the total number of specimens

42 of 75

Focus on scales

They are also in great demand in southern China and Vietnam because their meat is considered a delicacy and some believe that pangolin scales have medicinal qualities

(wiki)

传统中药认为穿山甲的鳞片据称有活血通经或产妇下乳等的药用之效,但现仍未得到证明,这与犀角一样只属于传统迷信,其主要成分与人类指甲成分相近,同为角质。而且在非法捕猎的过程中为方便运送等理由,会为穿山甲注射各类型的镇定剂兴奋剂重金属等,令进食人士的肝肾功能受损。

(wiki)

43 of 75

Chart: Import/Export of Scales (kg)

  • # of export reports is larger than the # of import reports until 2015
    • Who are these countries
  • Great amount of import of scales reported since 2015
    • Who
    • From To [TBC]

44 of 75

Charts: Scales import quantity by country

  • Singapore, Hong Kong, China are the main importers around the year of 1995
  • China and Hong Kong import most of the scales after 2010

45 of 75

Charts: Scales export quantity by country

  • Hong Kong and China are importing but not exporting since 2010, they could be consumers
  • Singapore exports but does not import around 2010
  • Malaysia and Singapore are the main exporters around the year of 1995
  • Uganda a new exporter

46 of 75

Chart: by purpose study

T Commercial

P Personal

S Scientific

E Educational

Z Zoo

L Law enforcement / judicial / forensic

Q Circus or travelling exhibition

M Medical (including biomedical research)

H Hunting trophy

B Breeding in captivity or artificial propagation

47 of 75

Charts: Scales import quantity (kg) by purpose

  • Legend:
    • T - Commercial
    • P - Personal
    • L - Law
    • S - Scientific
  • Questions:
    • Quiet time during late 90s 2015
  • Observation:
    • Mostly for commercial use

48 of 75

Charts: Scales export quantity (kg) by purpose

  • Legend:
    • T - Commercial
    • E - Educational
    • L - Law
    • S - Scientific
  • Questions:
    • Quiet time during late 90s 2015
  • Observation:
    • Mostly for commercial use

49 of 75

Data: match/ mismatch between importer/ exporter report

  • How to define a match case?
  • 331 completely matching cases (depth=1 search ) (all columns match except reporter):
    • E.g.:

  • 598 roughly match cases found (depth=1 search ) (ignore origin, purpose, and source)

year

term

quantity

unit

importer

exporter

origin

purpose

source

reporter_type_x

reporter_type_y

2000

specimens

40.0

NaN

US

CF

NaN

S

W

I

E

50 of 75

Data conflict between quantities reported by importer & exporter

  • Two reports from importer and exporter having the identical trade info except quantity
    • For example:

51 of 75

Data: Turn data into SQL DB and queries

  • PostgreSQL DB, can query and produce subset of tabulation by:
    • County
    • Species
    • Term
    • Purpose
    • Importer
    • Exporter
    • etc.
  • E.g.: Taiwan Exportation:

52 of 75

Idea: Map: trading activities on a globe

53 of 75

Idea: identify source/ transit/ destination by data

  • From import/ export behaviour:
    • Source
    • Transit
    • Manufacturing
    • Destination
  • Different behaviours by terms/ parts

Clustering.

Features: (aggregate by year; unify unit)

  • Term_n - Import
  • Term_n - Export
  • Term_n - (Import - Export)

54 of 75

Data: Appendix number II → I

  • Does the change affect the trading volume?

All eight pangolin species were uplisted to CITES Appendix I in 2016

55 of 75

Data: request/ enquire 2018?

Call for help!

  • Updates:

56 of 75

illegal trade seizures: Pangolins

57 of 75

https://eia-international.org/wildlife/wildlife-trade-maps/illegal-trade-seizures-pangolins/

Illegal trade seizures: Pangolins

Pangolins are one of the most illegally traded species on the planet, killed for their meat and scales.

58 of 75

Data: request data from the organisation

Call for help!

59 of 75

Idea: CrossDB: Compare with "China Judgements Online Database"

60 of 75

China Judgements Online Database

中国裁判文书网

http://wenshu.court.gov.cn/

61 of 75

Overview

62 of 75

  • Can be downloaded in a batch
  • One document includes multiple case verdict
  • NLP is needed to turn it into structured format

63 of 75

Data: status of data structuring

  • Jiaming

64 of 75

Data: research the structured service

  • 50,000+ court cases unveiling the reasons why people divorce
    • Article link
  • Judgement scrapped from China Judgements Online by Fagougou
  • Fagougou, a company based on artificial intelligence technology, to provide legal services solutions

65 of 75

HKU China seizure study

66 of 75

  • 200 records
  • province/ city level of China
  • Pairwise transport record

67 of 75

Proposal: Data enrichment

68 of 75

Proposal: Data enrichment

69 of 75

HK Customs

70 of 75

Data Overview

  • 2004 - 2018 (max)
  • #: 2000+
  • Date of release
  • Article content

71 of 75

Workshop

72 of 75

Investigate/ answer questions unveiled by data

73 of 75

Generate questions together

74 of 75

Key takeaways

75 of 75

Roadmap

  • A CITES-like dataset, regarding non-trading behaviours, e.g. seizures, trafficking, poaching, etc.
    • Follow HKU seizure format; as a start
  • Data battle plan: link
    • Group 1: @All, Research
    • Group 2: @Roy, make chart templates/ samples from CITES; prepare for future use
    • Group 3: @All @Roy @Pili, share your link/ article and we can identify data opportunities