Data for Pangolins

Pili Hu (Lecturer, HKBU)

Roy Tang (JOUR+CS grad., HKBU)

2019.06.01 @HKU

Workshop facilitators

Pili Hu

Mr. Hu teaches data journalism and media technology in HKBU. Before joining HKBU, he served as the Chief Technology Officer of Initium Media, responsible for developing technologies to power the fast-growing digital media outlet. He founded Initium Lab whose data-driven news report on Hong Kong Legislative Council voting pattern won SOPA "Excellence in Information Graphics" award in 2016. He was co-founder and CEO of HyperLab which produced a cross social network search engine. He holds an MPhil in Information Engineering from CUHK. Before moving to Hong Kong, he worked for Baidu in Beijing as core algorithm R/D engineer, responsible for user behaviour and link analysis.

Roy Tang

Roy is a recent graduate in Journalism and Computer Science from Hong Kong Baptist University. Having a strong interest in data-driven reporting, he participated in multiple data journalism projects and competitions. Roy studies and practices intensively in data analysis, data visualization, and web front-end development. He took part in the School of Communication's research on Hong Kong Digital Media Report 2018 as a research assistant and the web developer.

  • 30 min - A primer on data
    • Reach a rough consensus of data collaboration
  • 30 min - overview of current pangolin data
  • 30 min - (the true) workshop!
    • Identify data opportunities

Questions of public interest

Data answers

Journalism answers





Crash Course on Data

Hierarchy of details: Number/ stats/ data

Numbers -- You can get by interviewing experts, quoting from other articles, … Only useful for writing articles.

Statistics -- Usually get from research reports/ surveys. You can analyse trend, extract knowledge and draw infographics.

Data -- “Many datum”. Can not be consumed by human before processing. Useful for data mining, data visualisation, generative arts.

Common Types of (Table) Dataset

Flattened dataset

  • Normalised Dataset
  • List of records
  • Long table

* Normalisation is a more stringent concept in database design. Flattened dataset is not necessarily normalised (e.g. it can be a join of multiple tables)


  • Unnormalised Dataset
  • Pivoted table
  • Wide table

* Unnormalised dataset comes in many formats and does not necessarily be crosstab. For practical purpose, crosstab is more commonly seen, especially Census Dept. of Govs

Flattened dataset

A fraction of 4300+ records of district council election candidates from 1999 to 2015 (source: Initium)

Crosstab dataset

Median age by (year, gender) X (camp)

<40 highlighted in red;

> 48 highlighted in green

Where do we find stats?

Tabulation services:

(large and high-dimensional dataset sliced/ aggregated into smaller tables)




Stats/ charts search engine:

(large collection of small and maybe unrelated stats/charts)



Logical layers in telling a data story

  • Observation (factual)
    • Citing extreme cases
      • "I know a female, who is paid only ½ that of a male colleague at the same rank"
    • Citing counter example
      • "Women are not necessarily paid less. For example, who..."
  • Correlation (statistical)
    • "In general, women are paid less"
      • For M/F variables, we can test the significance (p-value)
  • Causality (philosophical/ empirical)
    • "Being female makes her get a lower payment"

Pattern recognition & Anomaly Detection

  • Prove “common sense”:
    • Find patterns
    • Discover trends
    • Interpolate history
    • Predict future
    • “Explanatory News”
  • Disprove “common sense”:
    • Anomaly detection
    • “News”



Traffic distribution on web and on Facebook by channels. Real data from Initium Media, anonymised. (Pili Hu @ DNN, Dec 2015)

Data-driven v.s. story-driven

  • Story-driven:
    • From news to data news
    • Visualisation is mostly an “add-on”
    • Most visualisations are charting
    • Data is mostly useful for background/ overview
    • Key: start from original (原著)
  • Data-driven:
    • From research report to mass communication campaign
      • Key: master the methodology for social research
    • From data to insights to story
      • Key: master the methodology of data mining; exploratory analysis

Berkeley admission record


  • Calculate the by-gender admission rate at department level
  • Calculate the by-gender admission rate at university level

How do we interpret this phenomenon:

  • It is a matter of the preference in choosing majors
    • Distribution of population
  • Simpson's Paradox

Data literacy lab 101 dataset: download here .

"5 data literacy cases" (Chinese)

Same dataset: two different stories

  • Gender bias against male?
  • Gender bias against female?

Anscombe's quartet (Correlation and visual signs)


  • Calculate basic stats:
    • =AVERAGE()
    • =VAR()
  • Use Data Analysis Pack to calculate the correlation
  • Visualise by scatter plot

Motivation for visualisation

  • Although seeing is not necessarily believing, it is still better off to see more
    • More data points
    • More details
    • More facets
    • More angles
  • Statistical argument is hard in general
    • Even if you can do, causal relationship is still uncertain
  • Common approach: visualise the data and leverage empirical / logical reasoning to reach (actionable) insights.

The accuracy of visual elements

J. D. Mackinlay, 1986, “Automating the Design of Graphical Presentations of Relational Information” http://www2.parc.com/istl/groups/uir/publications/items/UIR-1986-02-Mackinlay-TOG-Automating.pdf

Pangolin data - status quo

Status update: 20190601

  • Data Catalog: https://github.com/Roytangrb/pangolin
  • Summarise 6 datasets: World wide, Hong Kong, China, Nepal
    • Welcome more!
  • 2G CITES to SQL DB (may put on cloud later)
  • Exploratory analysis of CITES

Catalogue overview

Please feel free to send anything to @Roy Tang .

He will keep things in order

Datasets summary


Overview/ Nature



2G, 20M+ records, structured

Data analyst

EIA Seizures

Map raw data exist; but not opened

Request data

China Judgements DB

Can batch download


HKU Seizures

Google news (manual scrape; coding)

Scraper / Coder

HK Custom




Ad hoc information

Data Researcher

Today's objective: Identify data opportunities


  • Exploratory analysis/ visualisation of single dataset
    • Stats/ charts brainstorm in the latter part of the workshop
  • Cross-DB opportunities
  • Enrichment of existing dataset
    • Coverage (add data points): Geography, time, etc
    • Attribute (add variables): may need research, transformation, etc


  • Visualise numbers/ stats in existing articles
  • Enrich numbers/ stats in existing articles

A multilateral treaty to protect endangered plants and animals


27 of 75



  • Full DB
  • 2.4G in CSV
  • 20 million+ records

Column overview

  • Column name:
    • Year
    • Appendix
      • All eight pangolin species were uplisted to CITES Appendix I in 2016
    • taxon, class, order, family
    • term, quantity, unit
    • importer, exporter, origin, purpose, source
    • reporter type(Importer/Exporter)
  • Data format:
    • Year [date]
    • Quantity [number]
    • Term: {scales, leather, skins, meat...}
    • Unit: {kg, g, cm, m, box, carton ...}
    • Others: Code representation

Compare two sources

  • Top: 2G raw data format
    • Reporter Type
  • Bottom: tabulation output from website (TBC)
    • Aggregate by keys
    • Compare importer/ exporter

30 of 75

Manidae subset from CITES

  • Total records in CITES: 20 million+
  • Subset of CITES dataset: Manidae
    • 3631 manidae records
      • * remarks: data downloaded via CITES query API is not raw, containing only (1600 records)
    • View/ Download: https://github.com/Roytangrb/pangolin/blob/master/CITIES%20Analysis/manidae.csv

Difficulty: Vocabulary

  • Class: Mammalia 哺乳類
  • Order: Pholidota 鱗甲目
  • Family: Manidae 穿山甲科
  • Genus: Manis 穿山甲屬
  • Taxon(Species)
    • 10 species in the dataset

Data: Filtered records for pangolins (with different species names)

Species most reported:

  • Manis javanica 馬來穿山甲

  • Manis spp.

(Several Species; uncategorized)

  • Manis pentadactyla


Chart: CITES overview

  • Different species status?
  • E.g.:
    • Compare # of records containing ‘ivory’ and ‘manidae’

Chart: Reported # of records of import/export by year

Observations and discussions:

  • Why is reported import always larger?
    • Import record has no corresponding export record
    • Reporting issues?
    • Need to further confirm by comparing total quantities of import and export
      • Record matching and aggregation (find anomaly)

  • Who are the main importers and exporters?

35 of 75

Chart: Multi-line chart for reported records of import of different countries by year

Top 10:


  • ‘Largest’ importer is US
  • China/HK have rather small # of import records
  • As to "scale" (later in the slides), China and HK are top ones.
  • Connected with Janet’s report, US, JP, countries other than CN could be responsible for pangolin trade as well

Future idea:

  • Import activity/ quantity by term (*quantity)
    • Map visualisation
    • Distribution of different terms to different countries

Chart: Multi-line chart for reported records of export of different countries by year

Top 10

IT: Italy, SG: singapore, TG: Togo

  • Many countries report imports more often than reporting exports
  • Relatively, Japan reports exports much more than imports

  • US export terms

37 of 75

Chart: Multi-line chart for # of reported records of import of different pangolins by year

  • Reported import records of Manis javanica flatten at some point between 1995 - 2000
  • Manis pentadactyla most reported between 1980 & 1985
    • What happened during these event points?
  • Manis spp. (uncategorized) imports increase after 2005
    • What species are the Manis spp. ?
    • Other species may be reported as Manis spp.?

Chart: Multi-line chart for reported records of export of different pangolins by year

  • Import records growth trend resembles Export’s, except the report of Manis spp.
    • Who are the exporting country reporting Manis spp. [TBC]

Chart: import by terms

Top 10

  • Skins import has the largest number of records
  • Shoes import takes the second place
  • Scales import report number starts to increase faster since 2010

Chart: export by terms

Top 10

  • Similar trend as import

Difficulty: Multiple Units

  • E.g.: total amount skins is hard to calculate
* If no unit is shown, the figure represents the total number of specimens

42 of 75

Focus on scales

43 of 75

Chart: Import/Export of Scales (kg)

  • # of export reports is larger than the # of import reports until 2015
    • Who are these countries
  • Great amount of import of scales reported since 2015
    • Who
    • From To [TBC]

Charts: Scales import quantity by country

  • Singapore, Hong Kong, China are the main importers around the year of 1995
  • China and Hong Kong import most of the scales after 2010

45 of 75

Charts: Scales export quantity by country

  • Hong Kong and China are importing but not exporting since 2010, they could be consumers
  • Singapore exports but does not import around 2010
  • Malaysia and Singapore are the main exporters around the year of 1995
  • Uganda a new exporter

Chart: by purpose study

T Commercial

P Personal

S Scientific

E Educational

Z Zoo

L Law enforcement / judicial / forensic

Q Circus or travelling exhibition

M Medical (including biomedical research)

H Hunting trophy

B Breeding in captivity or artificial propagation

47 of 75

Charts: Scales import quantity (kg) by purpose

  • Legend:
    • T - Commercial
    • P - Personal
    • L - Law
    • S - Scientific
  • Questions:
    • Quiet time during late 90s 2015
  • Observation:
    • Mostly for commercial use

Charts: Scales export quantity (kg) by purpose

  • Legend:
    • T - Commercial
    • E - Educational
    • L - Law
    • S - Scientific
  • Questions:
    • Quiet time during late 90s 2015
  • Observation:
    • Mostly for commercial use

Data: match/ mismatch between importer/ exporter report

  • How to define a match case?
  • 331 completely matching cases (depth=1 search ) (all columns match except reporter):
    • E.g.:

  • 598 roughly match cases found (depth=1 search ) (ignore origin, purpose, and source)























Data conflict between quantities reported by importer & exporter

  • Two reports from importer and exporter having the identical trade info except quantity
    • For example:

51 of 75

Data: Turn data into SQL DB and queries

  • PostgreSQL DB, can query and produce subset of tabulation by:
    • County
    • Species
    • Term
    • Purpose
    • Importer
    • Exporter
    • etc.
  • E.g.: Taiwan Exportation:

Idea: Map: trading activities on a globe

Idea: identify source/ transit/ destination by data

  • From import/ export behaviour:
    • Source
    • Transit
    • Manufacturing
    • Destination
  • Different behaviours by terms/ parts


Features: (aggregate by year; unify unit)

  • Term_n - Import
  • Term_n - Export
  • Term_n - (Import - Export)

Data: Appendix number II → I

  • Does the change affect the trading volume?

All eight pangolin species were uplisted to CITES Appendix I in 2016

55 of 75

Data: request/ enquire 2018?

Call for help!

  • Updates:

illegal trade seizures: Pangolins

Illegal trade seizures: Pangolins

Pangolins are one of the most illegally traded species on the planet, killed for their meat and scales.

Data: request data from the organisation

Call for help!

Idea: CrossDB: Compare with "China Judgements Online Database"

China Judgements Online Database



  • Can be downloaded in a batch
  • One document includes multiple case verdict
  • NLP is needed to turn it into structured format

Data: status of data structuring

  • Jiaming

Data: research the structured service

  • 50,000+ court cases unveiling the reasons why people divorce
    • Article link
  • Judgement scrapped from China Judgements Online by Fagougou
  • Fagougou, a company based on artificial intelligence technology, to provide legal services solutions

HKU China seizure study

  • 200 records
  • province/ city level of China
  • Pairwise transport record

Proposal: Data enrichment

Proposal: Data enrichment

HK Customs

Data Overview

  • 2004 - 2018 (max)
  • #: 2000+
  • Date of release
  • Article content

Investigate/ answer questions unveiled by data

Generate questions together

Key takeaways

  • A CITES-like dataset, regarding non-trading behaviours, e.g. seizures, trafficking, poaching, etc.
    • Follow HKU seizure format; as a start
  • Data battle plan: link
    • Group 1: @All, Research
    • Group 2: @Roy, make chart templates/ samples from CITES; prepare for future use
    • Group 3: @All @Roy @Pili, share your link/ article and we can identify data opportunities