1 of 32

Collaborative Data Science�within Gov’t of Canada��Development of R libraries for common tasks with Open Canada data��Jonathan Dench, Research Analyst, Results Division , Treasury Board of Canada Secretariat�Dmitry Gorodnichy, Research Data Scientist, Chief Data Office, Canada Border Services Agency  �Patrick Little, Advisor, Open Government, Treasury Board of Canada Secretariat �Joseph Stinziano, Science Analyst, Canadian Food Inspection Agency ��Slido: r4gc��

Statistics Canada's 2021 International Methodology Symposium �29 October 2021, Ottawa

1

2 of 32

Outline

  • Raison d'être
  • Vision
  • Why R (for data science collaboration)
  • GC collaborative platforms (for growing technical knowledge)
  • Key outputs (so far)
  • What’s next
  • Appendices: demos and technical details

2

2

3 of 32

Raison d'être�

  • In GoC, we are working on the same data science problems
    • Working with the same data (eg. Geospatial, StatCan, open.canada.ca)
    • Developing many similar visualizations, analyses and reporting tools
    • Addressing many of same data engineering and data mining challenges

  • Challenges
    • Often, data scientists end up “reinventing the wheel”, and not able to catch-up with rapidly growing development of data science tools
    • Lack of collaboration and peer-reviewing creates the risk of being inefficient, producing suboptimal solutions

  • Much more can be achieved, if we leverage each other’s work!
    • Discussed at GC Data 2021 Conf., Data Engineering workshop

3

3

4 of 32

Vision

To ensure standardized and consistent approaches to data science across the GoC, we need:

  1. To grow and maintain our skills and knowledgebase
  2. To build codes and tools for common data science problems
    • Contributed, reviewed and maintained by GoC data science community
    • Open & free - available to any data scientist who needs them

By leveraging what is the best and already available within GC:

  1. Collaboration platforms: gccode, gccollab, gcwiki, github
  2. Programming environment: R

4

4

5 of 32

Why R ?

  1. Advanced graphics with ggplot2 and its extensions
  2. Automated report/tutorials/textbooks generation with RMarkdown
  3. Streamlined package development with devtools
  4. Streamlined Interactive interfaces and dashboards development and deployment with Shiny
  5. “Best for geo-computation”
  6. Common tidy design shared across packages
  7. Curated peer-tested repo of packages at CRAN
  8. RStudio IDE (Integrated Development Environment) on desktop and cloud (rstudio.cloud)
  9. Full support and inter-operability with Python from the same IDE
  10. Global RStudio-led movement for R education and advancement (rstudio.com)

https://geocompr.robinlovelace.net/intro.html#why-use-r-for-geocomputation

https://gccollab.ca/discussion/view/7404883/why-r

5

5

6 of 32

Collaborative Platforms

GC restricted:

Public facing:

  • https://github.com/open-canada
    • UNCLASSIFIED material for Lunch and Learns
    • Apps (e.g. https://open-canada.github.io/Apps/atip)
  • CRAN Views (ideal for finished packages)

6

6

7 of 32

7

7

8 of 32

8

8

9 of 32

9

9

10 of 32

10

10

11 of 32

Next steps

  • The work is in progress (and will always be!)
  • Much more ahead. We need your help!
    • curating data problems and public domain solutions (codes/papers)
    • curating public domain datasets
    • testing & benchmarking
    • tutorials, use cases
  • Join the community: Join GCcollab / GCcode groups
  • Contacts:

11

11

12 of 32

Appendices: key outputs (so far)

  • GCcode 101 for GC employees: https://gccode.ssc-spc.gc.ca/r4gc/resources/gccode101
  • R packages 101 for GC employees: https://gccode.ssc-spc.gc.ca/r4gc/gc-packages/packages101
  • How To: Interactive rmarkdown / learnr built tutorials to various problems
  • Geospatial analysis and visualization: markdown built use cases
  • Data Engineering: package and App for fuzzy matching, record linking & deduplication - �https://rCanada.shinyapps.io/demo
  • Interactive Shiny Apps: for ATIP, PSES, COVID-19, Border Wait Times:�https://open-canada.github.io/Apps/atip (~/pses, ~/covid, ~/border)
  • Working with Open Government Portal API within R (using ckanr and adobeanalyticsr )
  • Automating R scripts to run with GitHub Actions

12

12

13 of 32

Slides below will not be presented, �and are for reference only.

13

13

14 of 32

R packages 101

  • Key package
    • devtools has a series of key functions for setting up a package, especially directory and file structures
  • Testing code
    • Writing tests is a key skill to ensuring robust, reproducible code
      • Goal is to ensure each step of a function works properly with a reproducible example
      • E.g. Is the output of function X a list?
    • testthat & testthis packages facilitate test writing
  • Key considerations for GoC R packages
    • Licensing
    • What can be submitted to CRAN? What are the legal implications?

14

14

15 of 32

Geospatial Analysis in R

  • Guidance & Tutorials

    • Applied Spatial Data Analysis with R (2008) Roger Bivand et al.

    • Geocomputation with R (2021) (https://geocompr.robinlovelace.net/)

    • Preparing series of workshops and guided code for the R4GC group.

15

15

16 of 32

Working with Open Government Portal API (1)

  • CKAN is a very widely used software package for powering open data portal catalogues (data.gov, open.canada.ca, data.gov.uk, etc.)
  • CKAN offers an API that can be used to retrieve datasets and metadata from the system, but also create, update, and manage datasets.
  • Using the ckanr package offers a good developer experience for using the CKAN API within R.

16

16

17 of 32

Working with Open Government Portal API (2)�Using ckanr

Function

API Command

CKANR function

Get information about the system

action/status_show

ckan_info()

List organizations that publish data

action/organization_list

organization_list()

Get a list of datasets on the portal

action/package_list

package_list()

Retrieve the metadata for a dataset

action/package_show/{id}

package_show()

Search for datasets

action/package_search?q={something-to-search-for}

package_search()

Create a new dataset

action/package_create

package_create()

Update an existing resource

action/resource_patch()

resource_patch()

Example Use Case : What datasets relating to COVID-19 are available on the portal?

What you can do with it

17

17

18 of 32

Web Analytics in R with Adobe Analytics

  • The GC uses Adobe Analytics to measure usage on Canada.ca as well as several standalone web applications.
  • The adobeanalyticsr package enables an analyst to pull in data from Adobe Analytics to create web analytics reports within R.
  • This can be used to generate simple data extracts, but also to create Rmd reports, or power Shiny Applications.

18

18

19 of 32

Adobeanalyticsr – basic usage

  • Authenticate into Adobe Analytics using an OAuth token using function aw_token()�
  • Use the function aw_freeform_table to create a report based on parameters you supply�
  • Functions aw_get_metrics, aw_get_dimensions, aw_get_segments can be used to get available parameters.�
  • Analyze or Visualize your data within R�

19

19

20 of 32

Automating R scripts to run in GitHub Actions

  • GitHub Actions is a free workflow driven platform designed for automating software development tasks such as CI/CD.
  • GitHub actions uses docker containers that can be configured to run a myriad of different operating systems and software packages, including R.
  • This allows a user to run an R script based on a cron schedule, or other events such as a change to the script.
  • GitHub actions is very useful for automating reports or other R workloads.

20

20

21 of 32

How to run an R script in GitHub Actions

21

r-lib setup-r will setup your container with Ubuntu and install the version of R you specify

Rscript is the R script you want to run

git . add

git commit

git push

The workflow is defined in a YAML file

On push – runs every time you change the code in the repository

On schedule – runs on cron times

git .add, commit, push will save the output of any files you created/modified into the GitHub repository

21

21

22 of 32

Data Engineering�Records cleaning, deduplication and linking

https://rCanada.shinyapps.io/demo

�Leverages the work of CBSA, various R packages for data cleaning and linking, and RStudio’s Shiny framework

Included use cases:

  • Web crawling: …/demo/#section-web-crawling
    • Dates extraction
    • Finding nicknames and names variants

22

22

23 of 32

Record linking challenges

  • Dates : ‘20210820’ vs. ‘dob 20 Aug 2021’
  • Names: ‘Dmitry Gorodnichy’ vs. ‘Dimitri Horodnytchyyi’
  • Business Names: AC, AirCanada, Air Canada Corp.
  • Geographic Names: Ottawa, Orleans, Orléans

  • General Text : “<tag> ca$h 4 u ! Sooo… C O O L! Cant believe it ☹ ”
  • Postal: “klo 0O1” vs “K100o1”
  • Text matching: Phrase matching, topics/keywords detection

23

23

24 of 32

Cleaning Dates

24

24

25 of 32

Approximate (fuzzy/probabilistic)�name matching

25

25

26 of 32

Record�deduplication�

26

26

27 of 32

Record linking

27

27

28 of 32

NLP topic modeling in TBS ATIP data

https://open-canada.github.io/Apps/atip

Leverages the work of TBS, various R packages for text mining, and RStudio’s Shiny framework

28

28

29 of 32

Univariate and bivariate analysis of dataset variables

29

29

30 of 32

1-grams (single words)

30

30

31 of 32

Topic modeling (30 main topics): wordcloud

31

31

32 of 32

Topic modelling:��Graph / Network view�

32

32