Data for Pangolins
Pili Hu (Lecturer, HKBU)
Roy Tang (JOUR+CS grad., HKBU)
2019.06.01 @HKU
Workshop facilitators
Pili Hu
Mr. Hu teaches data journalism and media technology in HKBU. Before joining HKBU, he served as the Chief Technology Officer of Initium Media, responsible for developing technologies to power the fast-growing digital media outlet. He founded Initium Lab whose data-driven news report on Hong Kong Legislative Council voting pattern won SOPA "Excellence in Information Graphics" award in 2016. He was co-founder and CEO of HyperLab which produced a cross social network search engine. He holds an MPhil in Information Engineering from CUHK. Before moving to Hong Kong, he worked for Baidu in Beijing as core algorithm R/D engineer, responsible for user behaviour and link analysis.
Roy Tang
Roy is a recent graduate in Journalism and Computer Science from Hong Kong Baptist University. Having a strong interest in data-driven reporting, he participated in multiple data journalism projects and competitions. Roy studies and practices intensively in data analysis, data visualization, and web front-end development. He took part in the School of Communication's research on Hong Kong Digital Media Report 2018 as a research assistant and the web developer.
Agenda
Questions of public interest
Data answers
Journalism answers
GIJC
today
today
today
Crash Course on Data
Hierarchy of details: Number/ stats/ data
Numbers -- You can get by interviewing experts, quoting from other articles, … Only useful for writing articles.
Statistics -- Usually get from research reports/ surveys. You can analyse trend, extract knowledge and draw infographics.
Data -- “Many datum”. Can not be consumed by human before processing. Useful for data mining, data visualisation, generative arts.
Common Types of (Table) Dataset
Flattened dataset
* Normalisation is a more stringent concept in database design. Flattened dataset is not necessarily normalised (e.g. it can be a join of multiple tables)
Crosstab
* Unnormalised dataset comes in many formats and does not necessarily be crosstab. For practical purpose, crosstab is more commonly seen, especially Census Dept. of Govs
Flattened dataset
A fraction of 4300+ records of district council election candidates from 1999 to 2015 (source: Initium)
Crosstab dataset
Median age by (year, gender) X (camp)
<40 highlighted in red;
> 48 highlighted in green
Where do we find stats?
Tabulation services:
(large and high-dimensional dataset sliced/ aggregated into smaller tables)
https://www.statcompiler.com/en/
http://www.fao.org/faostat/en/#home
Stats/ charts search engine:
(large collection of small and maybe unrelated stats/charts)
Logical layers in telling a data story
Pattern recognition & Anomaly Detection
Web
FB
Traffic distribution on web and on Facebook by channels. Real data from Initium Media, anonymised. (Pili Hu @ DNN, Dec 2015)
Data-driven v.s. story-driven
Berkeley admission record
Exercise:
How do we interpret this phenomenon:
Data literacy lab 101 dataset: download here .
"5 data literacy cases" (Chinese)
Same dataset: two different stories
Anscombe's quartet (Correlation and visual signs)
Exercise:
Motivation for visualisation
The accuracy of visual elements
J. D. Mackinlay, 1986, “Automating the Design of Graphical Presentations of Relational Information” http://www2.parc.com/istl/groups/uir/publications/items/UIR-1986-02-Mackinlay-TOG-Automating.pdf
Pangolin data - status quo
Status update: 20190601
Catalogue overview
Please feel free to send anything to @Roy Tang .
He will keep things in order
Datasets summary
Dataset | Overview/ Nature | Participation |
CITES | 2G, 20M+ records, structured | Data analyst |
EIA Seizures | Map raw data exist; but not opened | Request data |
China Judgements DB | Can batch download | Coder |
HKU Seizures | Google news (manual scrape; coding) | Scraper / Coder |
HK Custom | Scraped | Coder |
Nepal | Ad hoc information | Data Researcher |
Today's objective: Identify data opportunities
Data-driven:
Story-driven:
CITES
A multilateral treaty to protect endangered plants and animals
CITES
Column overview
Dataset Documentation: https://trade.cites.org/cites_trade_guidelines/en-CITES_Trade_Database_Guide.pdf
Compare two sources
Manidae subset from CITES
Exploratory analysis notebook: https://github.com/Roytangrb/pangolin/blob/master/CITIES%20Analysis/manidae.ipynb
Difficulty: Vocabulary
Data: Filtered records for pangolins (with different species names)
Species most reported:
(Several Species; uncategorized)
中華穿山甲
Chart: CITES overview
Chart: Reported # of records of import/export by year
Observations and discussions:
Chart: Multi-line chart for reported records of import of different countries by year
Top 10:
Observation:
Future idea:
Chart: Multi-line chart for reported records of export of different countries by year
Top 10
IT: Italy, SG: singapore, TG: Togo
Chart: Multi-line chart for # of reported records of import of different pangolins by year
Chart: Multi-line chart for reported records of export of different pangolins by year
Chart: import by terms
Top 10
Chart: export by terms
Top 10
Difficulty: Multiple Units
* If no unit is shown, the figure represents the total number of specimens
Focus on scales
They are also in great demand in southern China and Vietnam because their meat is considered a delicacy and some believe that pangolin scales have medicinal qualities
(wiki)
传统中药认为穿山甲的鳞片据称有活血通经或产妇下乳等的药用之效,但现仍未得到证明,这与犀角一样只属于传统迷信,其主要成分与人类指甲成分相近,同为角质。而且在非法捕猎的过程中为方便运送等理由,会为穿山甲注射各类型的镇定剂、兴奋剂及重金属等,令进食人士的肝肾功能受损。
(wiki)
Chart: Import/Export of Scales (kg)
Charts: Scales import quantity by country
Charts: Scales export quantity by country
Chart: by purpose study
T Commercial
P Personal
S Scientific
E Educational
Z Zoo
L Law enforcement / judicial / forensic
Q Circus or travelling exhibition
M Medical (including biomedical research)
H Hunting trophy
B Breeding in captivity or artificial propagation
Charts: Scales import quantity (kg) by purpose
Charts: Scales export quantity (kg) by purpose
Data: match/ mismatch between importer/ exporter report
year | term | quantity | unit | importer | exporter | origin | purpose | source | reporter_type_x | reporter_type_y |
2000 | specimens | 40.0 | NaN | US | CF | NaN | S | W | I | E |
Data conflict between quantities reported by importer & exporter
All conflict reported quantities cases: https://github.com/Roytangrb/pangolin/blob/master/CITES%20Analysis/comptab_manidae.ipynb
Data: Turn data into SQL DB and queries
Taiwan exportation download: https://github.com/Roytangrb/pangolin/tree/master/CITES%20Analysis/subsets
Idea: Map: trading activities on a globe
Idea: identify source/ transit/ destination by data
Clustering.
Features: (aggregate by year; unify unit)
Data: Appendix number II → I
All eight pangolin species were uplisted to CITES Appendix I in 2016
Data: request/ enquire 2018?
Call for help!
illegal trade seizures: Pangolins
https://eia-international.org/wildlife/wildlife-trade-maps/illegal-trade-seizures-pangolins/
Illegal trade seizures: Pangolins
Pangolins are one of the most illegally traded species on the planet, killed for their meat and scales.
Data: request data from the organisation
Call for help!
Idea: CrossDB: Compare with "China Judgements Online Database"
Overview
Data: status of data structuring
Data: research the structured service
HKU China seizure study
Proposal: Data enrichment
Proposal: Data enrichment
Browser emulation. Google search.
Resources:
HK Customs
Data Overview
Workshop
Investigate/ answer questions unveiled by data
Generate questions together
Key takeaways
Roadmap