Metadata
Content: Coding Best Practices
Length: 90 minutes lecture
Purpose: Discusses the best practices for file organization, code and data handling and ways of de-identifying data
Note: this lecture has an exercise found here (Not used in 2021)
Learning goals:
Last edited on/by: 10/28/2021, Jack Cavanagh
Assessment Questions
j-pal | coding best practices
Coding Best Practices
Jack Cavanagh, J-PAL Global
J-PAL/IPA Research Staff Training, 2021
Why care about best practices?
j-pal | coding best practices
Imagine that you join a project in between the baseline and midline
j-pal | coding best practices
Why care about best practices?
j-pal | coding best practices
Outline
j-pal | coding best practices
Outline
j-pal | coding best practices
File/Folder Structure
j-pal | coding best practices
Exercise
In pairs, show each other the folder of the latest project you have been working on.
Let your partner guess where the latest data and code you are working on is.
File and Folder Structure Goals
j-pal | coding best practices
010110100101
1010111011111
010111100010
000100100011
IN
OUT
Good Folder Structure – Example
j-pal | coding best practices
Not-So-Good Folder Structure - Example
j-pal | coding best practices
What if you inherit a messy folder structure?
j-pal | coding best practices
Key points -- file/folder structure
j-pal | coding best practices
Resources
IPA recommend folder structure (Box account required)
Pop quiz: Folder map
Which of these is not true about folder maps:
Version Control
j-pal | coding best practices
Version Control
j-pal | coding best practices
Version Control that works
j-pal | coding best practices
Version Control and Archiving
j-pal | coding best practices
Version control – just the code
j-pal | coding best practices
De-ID data
Raw data
Clean data
Final data
Import.do
De-ID.do
Back-up Raw data
Cleaning.do
HFC.do
Corrections.do
Output
Analysis.do
Graphs.do
Resources
J-PAL Research Resource on data cleaning and management
Version control – code and (some) data
j-pal | coding best practices
De-ID data
Raw data
Clean data
Final data
Import.do
De-ID.do
Back-up Raw data
Cleaning.do
HFC.do
Corrections.do
Output
Analysis.do
Graphs.do
Automatically add today’s date when saving data
Version control with multiple collaborators
j-pal | coding best practices
Resources
DIME Wiki: Getting Started with Github
Udacity course: Version control with Git
Gentzkow and Shapiro: Version control
Key points -- version control
j-pal | coding best practices
Further resources for file structure and version control
File/folder structure:
Version control/github:
Outline
j-pal | coding best practices
Code Organization: Goals
j-pal | coding best practices
?...!!
When creating new do-files
j-pal | coding best practices
Commenting and pseudocode
j-pal | coding best practices
Comments and Readability
j-pal | coding best practices
Type | Stata | R | Purpose |
Full-line comments | * | # | Use to create section dividers |
Line breaks | /// | end the incomplete line with an operator (like +, /, <-, and so on) | To break commands across multiple lines. Note: No longer needed in Stata 16 |
In-line comments | // | #[........] | Use for “markers” (e.g., //decision point) or short comments |
Block comments | /*.......... */ | Select text, Ctrl + Shift + C | Use to explain the purpose of the section of the do-file |
Comments and Readability
j-pal | coding best practices
Pseudocode
j-pal | coding best practices
Pseudocode
j-pal | coding best practices
Pseudocode: exercise
Header Content
j-pal | coding best practices
Essential Header Contents
j-pal | coding best practices
Essential Header Contents
j-pal | coding best practices
Essential Header Contents
j-pal | coding best practices
Essential Header Contents
j-pal | coding best practices
Portability between Co-Authors–Option 1
j-pal | coding best practices
Portability between Co-Authors–Option 2
j-pal | coding best practices
Essential Footer Contents
j-pal | coding best practices
Make a template do-file!
j-pal | coding best practices
Master .do File
j-pal | coding best practices
Master .do file
j-pal | coding best practices
Master file calls all .do
files in order
Master .do file
j-pal | coding best practices
Master .do file
j-pal | coding best practices
Some comments on interoperability
Key points
j-pal | coding best practices
Outline
j-pal | coding best practices
Good Data Handling
Goals:
Guiding principle: preserve the greatest possible amount of information that does not put your subjects at risk.
j-pal | coding best practices
Data Cleaning
j-pal | coding best practices
Naming and Labelling Variables
Rule 1: Name variables in self-explanatory ways and label them.
j-pal | coding best practices
Coding and Labeling Variable Values
Rule 2: Code and label categorical variables meaningfully.
j-pal | coding best practices
Missing Values
Rule 3: Preserve information on why data is missing.
j-pal | coding best practices
“I can’t remember’s not the same as I don’t know”
Missing Values in Stata
Encode missing values in R using the dplyr package
mvdecode variables , mv(-999=.d \-998=.r)
j-pal | coding best practices
Values | Numerical | String |
Missing (no data recorded) | “.” | “not recorded” |
Answer “don’t know” | “.d” | “don’t know” |
Refusal to answer | “.r” | “refusal” |
Conditional questions that were not asked | “.n” | “N/A” |
Make Code Self-Documenting
j-pal | coding best practices
Can you see any potential sources of error in this code?
Key points -- data cleaning
j-pal | coding best practices
57
Resources
De-identification
j-pal | coding best practices
De-identification: Goals
j-pal | coding best practices
Resources
J-PAL Research Resources
Ethical conduct of randomized evaluations
Institutional review board (IRB) proposals
De-identification: Goals
j-pal | coding best practices
Personally Identifiable Information (PII) -- Direct Identifiers
The 'official' definition of PII can differ across countries.
Be aware of which regulations apply to your project.
j-pal | coding best practices
Resources
J-PAL Research Resource on data de-identification
Replacing Direct Identifiers
j-pal | coding best practices
Personally Identifiable Information (PII) -- Indirect Identifiers
Examples:
j-pal | coding best practices
De-identifying Indirect Identifiers
j-pal | coding best practices
Working with Identifiable Data
🡪 De-identify as early as possible and consider working with encrypted data off the cloud.
j-pal | coding best practices
Reasonable Security Practices and Procedures and Sensitive Personal Data or Information Rules in India
j-pal | coding best practices
PII for health data (US HIPAA Privacy Rule)
j-pal | coding best practices
De-identifying Indirect Identifiers
j-pal | coding best practices
Resources
IDEA Handbook chapter on Statistical Privacy
De-identifying GPS Coordinates
GPS data can be de-identified in different ways:
Discuss with collaborators how to protect GPS data in your project.
j-pal | coding best practices
Key Points -- de-identification
j-pal | coding best practices
Pop quiz: De-identification
Consider an exchange economy with two commodities and two consumers. Both consumers have homothetic preferences of the constant elasticity variety. Moreover, the elasticity of substitution is the same for both consumers and is small (i.e. goods are close to perfect complements). Now assume a dataset unrelated to any of the above. Which of these variables could be considered an indirect identifier?
Resources for De-Identification
Programs:
Resources:
j-pal | coding best practices
Data Publication
j-pal | coding best practices
Data publication: Goals
j-pal | coding best practices
Resources
Code Book
j-pal | coding best practices
75
Questionnaires
j-pal | coding best practices
ReadMe Files
j-pal | coding best practices
77
Key points -- data publication
Resources for data publication
Ten Coding Commandments
foreach x in `commandments’ {;
di “`x’”;
};
j-pal | coding best practices
Questions?
j-pal | coding best practices
Credits
The original version of this presentation was developed by IPA, based on
Pollock, H, E. Chuang and S. Wykstra. 2015. Best Practices for Data and Code Management, Innovations for Poverty Action.
The latest updates were made by Karl Rubio, Anja Sautmann, and J-PAL Global Research in January 2018. Thanks go to Aprille Knox, Rose Burnam, Samuel Solomon, and others for useful feedback comments.
Some of the ideas are from this excellent guide:
Gentzkow, Matthew and Jesse M. Shapiro. 2014. Code and Data for the Social Sciences: A Practitioner’s Guide. University of Chicago, mimeo.
Additional Resources:
j-pal | coding best practices
Supplemental Material
Outline
j-pal | coding best practices
Outline
j-pal | coding best practices
Good Folder Structure - Example 2
j-pal | coding best practices
Good Folder Structure – Example 3
j-pal | coding best practices
Good Folder Structure – Example 4
j-pal | coding best practices
Outline
j-pal | coding best practices
Version Control using GitHub
Consider using The OpenScienceFramework (OSF) or GitHub (which saves all versions, branches etc.)
j-pal | coding best practices
What does Github do?
j-pal | coding best practices
Outline
j-pal | coding best practices
Replacing Direct Identifiers
See .do file: SampleDeIdentify.do
j-pal | coding best practices
Replacing Direct Identifiers
j-pal | coding best practices
This file needs to be protected just like the original data – always keep in password protected, encrypted volume.
Replacing Direct Identifiers
j-pal | coding best practices
Replacing Direct Identifiers
j-pal | coding best practices
This file is de-identified and can be saved in an un-encrypted location.
Note: ID_obs gives no information about the district – would require separate scrambling of District