Collaborative Data Science�within Gov’t of Canada��Development of R libraries for common tasks with Open Canada data����Jonathan Dench, Research Analyst, Results Division , Treasury Board of Canada Secretariat�Dmitry Gorodnichy, Research Data Scientist, Chief Data Office, Canada Border Services Agency �Patrick Little, Advisor, Open Government, Treasury Board of Canada Secretariat �Joseph Stinziano, Science Analyst, Canadian Food Inspection Agency ���Slido: r4gc��
Statistics Canada's 2021 International Methodology Symposium �29 October 2021, Ottawa
1
Outline
2
2
Raison d'être�
3
3
Vision
To ensure standardized and consistent approaches to data science across the GoC, we need:
By leveraging what is the best and already available within GC:
4
4
Why R ?
https://geocompr.robinlovelace.net/intro.html#why-use-r-for-geocomputation
5
5
Collaborative Platforms
6
6
7
7
8
8
9
9
10
10
Next steps
11
11
Appendices: key outputs (so far)
12
12
Slides below will not be presented, �and are for reference only.
13
13
R packages 101
14
14
Geospatial Analysis in R
15
15
Working with Open Government Portal API (1)
16
16
Working with Open Government Portal API (2)�Using ckanr
Function | API Command | CKANR function |
Get information about the system | action/status_show | ckan_info() |
List organizations that publish data | action/organization_list | organization_list() |
Get a list of datasets on the portal | action/package_list | package_list() |
Retrieve the metadata for a dataset | action/package_show/{id} | package_show() |
Search for datasets | action/package_search?q={something-to-search-for} | package_search() |
Create a new dataset | action/package_create | package_create() |
Update an existing resource | action/resource_patch() | resource_patch() |
Example Use Case : What datasets relating to COVID-19 are available on the portal?
What you can do with it
17
17
Web Analytics in R with Adobe Analytics
18
18
Adobeanalyticsr – basic usage
19
19
Automating R scripts to run in GitHub Actions
20
20
How to run an R script in GitHub Actions
21
r-lib setup-r will setup your container with Ubuntu and install the version of R you specify
Rscript is the R script you want to run
git . add
git commit
git push
The workflow is defined in a YAML file
On push – runs every time you change the code in the repository
On schedule – runs on cron times
git .add, commit, push will save the output of any files you created/modified into the GitHub repository
21
21
Data Engineering�Records cleaning, deduplication and linking
https://rCanada.shinyapps.io/demo
�Leverages the work of CBSA, various R packages for data cleaning and linking, and RStudio’s Shiny framework
Included use cases:
22
22
Record linking challenges
23
23
Cleaning Dates
24
24
Approximate (fuzzy/probabilistic)�name matching
25
25
Record�deduplication�
26
26
Record linking
27
27
NLP topic modeling in TBS ATIP data
https://open-canada.github.io/Apps/atip
Leverages the work of TBS, various R packages for text mining, and RStudio’s Shiny framework
28
28
Univariate and bivariate analysis of dataset variables
29
29
1-grams (single words)
30
30
Topic modeling (30 main topics): wordcloud
31
31
Topic modelling:��Graph / Network view�
32
32