1 of 98

Metadata

Content: Coding Best Practices

Length: 90 minutes lecture

Purpose: Discusses the best practices for file organization, code and data handling and ways of de-identifying data

Note: this lecture has an exercise found here (Not used in 2021)

Learning goals:

Learn the best practices for file organization
Learn the best practices for code and data handling in Stata
Understand different ways of de-identifying data
Understand the various requirements for data sharing and publication

Last edited on/by: 10/28/2021, Jack Cavanagh

2 of 98

Assessment Questions

j-pal | coding best practices

3 of 98

Coding Best Practices

Jack Cavanagh, J-PAL Global

J-PAL/IPA Research Staff Training, 2021

4 of 98

Why care about best practices?

j-pal | coding best practices

5 of 98

Imagine that you join a project in between the baseline and midline

PI says “the folder is a little messy, would you mind reorganizing it at some point?”
You think, “sure” but then–

Enumerator training gets delayed due to a bus strike
Then you find half your tablets don’t have chargers
Then you have to finish the HFC code
Then you find out you have to add a question to the survey
Etc., etc., etc.,

Highly unlikely you will get a chance to fix a messy folder structure or unclear code. Do it right the first time!

j-pal | coding best practices

6 of 98

Why care about best practices?

Data Documentation/Organization

Golden rule: “do unto others (and yourself!) as you would have them do unto you”.
Invest early and reap the benefits later.

De-Identification of Identified Data

Researchers are bound to protect human subjects.
Navigate the tension between obscuring identities and losing information.

Data Publication/Sharing

Journals and funders may require data publication.
Reproduce research for the sake of science.
Data should be available for other researchers (incl. students, policy makers).
Make code, data, and documentation publishable from the beginning.

j-pal | coding best practices

7 of 98

Outline

Good File Organization

Folder structure
Version Control and Archiving

Good Code Organization

File Header
Commenting
Master .do File
Programs and “for” Loops

Good Data Handling

Data Cleaning
De-Identification
Data Publishing/Sharing

j-pal | coding best practices

8 of 98

Outline

Good File Organization

Folder structure
Version Control and Archiving

Good Code Organization

Commenting
File Header/Footer
Master .do File
Programs and “for” Loops

Good Data Handling

Data Cleaning
De-Identification
Data Publishing/Sharing

j-pal | coding best practices

9 of 98

File/Folder Structure

j-pal | coding best practices

10 of 98

Exercise

In pairs, show each other the folder of the latest project you have been working on.

Let your partner guess where the latest data and code you are working on is.

11 of 98

File and Folder Structure Goals

Original data should be preserved (aside from de-identification)

Keep data, coding files, and results separate, make order and degree of data manipulation clear

Data cleaning, analysis, and producing output should be possible to do separately

Version control and archiving of previous versions

j-pal | coding best practices

010110100101

1010111011111

010111100010

000100100011

IN

OUT

12 of 98

Good Folder Structure – Example

j-pal | coding best practices

13 of 98

Not-So-Good Folder Structure - Example

j-pal | coding best practices

14 of 98

What if you inherit a messy folder structure?

Ideally, re-arrange the files/folders logically

Note: Requires an extensive understanding of the current structure/relationship between files
May be too impractical/time consuming

Next best: create a folder map

Outline the key folders and which files are found within them
Outline the relationship between files (e.g., “cleaning.do takes the rawdata.csv and turns it into cleandata.dta”)
Folder maps can be used as the basis for the project’s Readme file
Ideally, the folder map should be consistently updated with a list of all datasets and code (along with the type of data, and the purpose of the dataset/code)

j-pal | coding best practices

15 of 98

Key points -- file/folder structure

There is no one universal “good” folder structure – and there are plenty of bad ones!
Critical: define your folder structure in advance
Utilize an archive folder for storing old/extra files
Keep file/folder names simple, and avoid spaces
If you have multiple survey rounds, create a template folder structure

j-pal | coding best practices

Resources

IPA recommend folder structure (Box account required)

IPA template folder structures

16 of 98

Pop quiz: Folder map

Which of these is not true about folder maps:

They can be used as the basis for a readme
They should be updated regularly
They cannot not be used to outline the relationship between files and data in a project
They should always take the place of a good file/folder structure

17 of 98

Version Control

j-pal | coding best practices

18 of 98

Version Control

j-pal | coding best practices

19 of 98

Version Control that works

j-pal | coding best practices

20 of 98

Version Control and Archiving

Keep the current version of each file in your working directory
Archive code files only.
Date each version using format YYYY_MM_DD (or YYYYMMDD)
To rerun: either copy back to working directory, OR use absolute file paths for inputs and outputs

j-pal | coding best practices

21 of 98

Version control – just the code

j-pal | coding best practices

De-ID data

Raw data

Clean data

Final data

Import.do

De-ID.do

Back-up Raw data

Cleaning.do

HFC.do

Corrections.do

Output

Analysis.do

Graphs.do

Resources

J-PAL Research Resource on data cleaning and management

22 of 98

Version control – code and (some) data

j-pal | coding best practices

De-ID data

Raw data

Clean data

Final data

Import.do

De-ID.do

Back-up Raw data

Cleaning.do

HFC.do

Corrections.do

Output

Analysis.do

Graphs.do

Automatically add today’s date when saving data

In this version control system, the code files are kept under version control, as well as the data files which may change based on decisions and edits made to the cleaning, corrections, and analysis code. One reason you might choose this approach rather than the code-only version control shown on the previous slide is if the code that generates clean data, final data, or output takes a very long time to run. In this case rather than wait hours or days to re-generate the intermediate and final datasets, you can choose to archive a version of the processed data to go with archived versions of the code.

Here, you would see multiple versions of the clean data, final data, and output data files in your folder structure. The code for saving each of these data files would need to include a way to label each newly generated dataset with the date. (click next for example code)

23 of 98

Version control with multiple collaborators

*Easier* on Github

Github will detect and display differences in files

If not using Github, consider:

Blocking time for each person to work on files
Using “temp” or “working” folders
Labeling systems that will indicate most current file versions
Tracking (major) changes outside of code
Code reviews daily/weekly/as needed to merge the code changes and ensure results remain the same

j-pal | coding best practices

Resources

DIME Wiki: Getting Started with Github

Udacity course: Version control with Git

Gentzkow and Shapiro: Version control

24 of 98

Key points -- version control

There is no one universal “good” way to do version control–and there are plenty of bad ones!
Critical: determine version control method at the start of the project

Ask questions if you’re confused!

j-pal | coding best practices

25 of 98

Further resources for file structure and version control

File/folder structure:

IPA recommend folder structure (Box account required)
IPA template folder structures
J-PAL Research Resource on data cleaning and management

Version control/github:

J-PAL Research Resource on data cleaning and management
DIME Wiki: Getting Started with Github
Udacity course: Version control with Git
Gentzkow and Shapiro: Version control

26 of 98

Outline

Good File Organization

Folder structure
Version Control and Archiving

Good Code Organization

Commenting
File Headers
Master .do File
Programs and “for” Loops

Good Data Handling

Data Cleaning
De-Identification
Data Publishing/Sharing

j-pal | coding best practices

27 of 98

Code Organization: Goals

Reader: understand and push forward analysis

Remember what you did (when PI asks about it 4 weeks later)
Continue analysis easily

Writer: accuracy, efficiency, minimize error

Avoid repetitive work, errors, missed steps, or unnecessary re-testing

j-pal | coding best practices

?...!!

28 of 98

When creating new do-files

Remember your folder structure! Put it in a logical location
Name the do-file descriptively (e.g., “import_cleaning.do”)
Add the do-file to your file map

j-pal | coding best practices

29 of 98

Commenting and pseudocode

j-pal | coding best practices

30 of 98

Comments and Readability

j-pal | coding best practices

Type	Stata	R	Purpose
Full-line comments	*	#	Use to create section dividers
Line breaks	///	end the incomplete line with an operator (like +, /, <-, and so on)	To break commands across multiple lines. Note: No longer needed in Stata 16
In-line comments	//	#[........]	Use for “markers” (e.g., //decision point) or short comments
Block comments	/.......... /	Select text, Ctrl + Shift + C	Use to explain the purpose of the section of the do-file

31 of 98

Comments and Readability

j-pal | coding best practices

32 of 98

Pseudocode

j-pal | coding best practices

33 of 98

Pseudocode

j-pal | coding best practices

34 of 98

Pseudocode: exercise

Your PI sends you a folder of 35 datasets named dataset_1.csv, dataset_2.csv, etc..
They know that one of them contains a variable that is just the string “Coding Best Practices Rockz”, and they know this dataset is the one they are looking for
However, they don’t have time to look through all 35 datasets
Write out pseudocode for an easy way to figure out which dataset the PI is looking for
Paste your answer in the chat if comfortable!

35 of 98

Header Content

j-pal | coding best practices

36 of 98

Essential Header Contents

j-pal | coding best practices

37 of 98

Essential Header Contents

j-pal | coding best practices

38 of 98

Essential Header Contents

j-pal | coding best practices

39 of 98

Essential Header Contents

j-pal | coding best practices

40 of 98

Portability between Co-Authors–Option 1

j-pal | coding best practices

41 of 98

Portability between Co-Authors–Option 2

j-pal | coding best practices

42 of 98

Essential Footer Contents

j-pal | coding best practices

43 of 98

Make a template do-file!

Many aspects of do-files within the same project are similar

header structure (title-author-date modified, input/output, etc.) is unlikely to change
Your basic file paths
Version of stata
Etc.

Make a template do-file which you can duplicate and edit when you need a new do-file

j-pal | coding best practices

44 of 98

Master .do File

j-pal | coding best practices

45 of 98

Master .do file

j-pal | coding best practices

Master file calls all .do

files in order

46 of 98

Master .do file

j-pal | coding best practices

47 of 98

Master .do file

j-pal | coding best practices

48 of 98

Some comments on interoperability

49 of 98

Key points

Well-organized code helps your team work better together, but it also saves you time and energy in the future
Pseudocode can help you transition your logic into code
Template do-files can be used to make sure that your settings and macros are consistent. In particular, do-file headings should contain the following:

Software version
File paths
Randomization seed

Master do-files are useful for setting macros and providing overall program instructions

They reduce the cost of an extra do-file, but not to zero

j-pal | coding best practices

50 of 98

Outline

Good File Organization

Folder structure
Version Control and Archiving

Good Code Organization

Commenting
File Header/Footer
Master .do File
Programs and “for” Loops

Good Data Handling

Data Cleaning
De-Identification
Data Publishing/Sharing

j-pal | coding best practices

51 of 98

Good Data Handling

Goals:

Optimized data for analysis
Subject protection

Guiding principle: preserve the greatest possible amount of information that does not put your subjects at risk.

j-pal | coding best practices

52 of 98

Data Cleaning

j-pal | coding best practices

53 of 98

Naming and Labelling Variables

Rule 1: Name variables in self-explanatory ways and label them.

Don’t use var150b if you can use HH_Income_annual�
Ideal variable names: location in survey and short description.

Common abbreviations:

Household = hh, Mother M, Father F, HHh = head of household
Number = #, dollars = $, percentage = %
Common economic variables, eg. Y for income

Stata has a 32-characters limit on variable names and 80-characters limit on variable labels.

Choose SurveyCTO variable names wisely.

j-pal | coding best practices

54 of 98

Coding and Labeling Variable Values

Rule 2: Code and label categorical variables meaningfully.

Categorical variables:

Label all possible categories/values of the variable
For binary variables, (re)code so that 1 “affirms” variable

j-pal | coding best practices

55 of 98

Missing Values

Rule 3: Preserve information on why data is missing.

Data may be missing for many reasons:

The subject wasn’t interviewed
The subjects did not know the answer
The subject refused to answer
The surveyor made an error when entering the answer
The question was not asked (e.g. due to skip patterns)

j-pal | coding best practices

“I can’t remember’s not the same as I don’t know”

Destroyer

56 of 98

Missing Values in Stata

Encode missing values in R using the dplyr package

Stata command to replace survey coding:

mvdecode variables , mv(-999=.d \-998=.r)

j-pal | coding best practices

Values	Numerical	String
Missing (no data recorded)	“.”	“not recorded”
Answer “don’t know”	“.d”	“don’t know”
Refusal to answer	“.r”	“refusal”
Conditional questions that were not asked	“.n”	“N/A”

57 of 98

Make Code Self-Documenting

j-pal | coding best practices

Can you see any potential sources of error in this code?

Say we generate a new variable or merge two datasets and we know something about it should be true

(for example, we standardize a variable, so we know what the mean or std dv should be, or we merge two datasets and we know ex ante how many entries should match between the two data sets)

We can use assert to check our work. We run the command (in this case merge). We generate a new variable binary variable equal to 1. We know ahead of time that at least 95% of our survey sample is in the administrative data, lets say. So we assert that the mean of matched should be 95% or higher.

If less than 95% match, an error message appears (along with a beep).

Coding these types of checks reduces the possibility that you will miss noticing things like this later.

Also can code in checks for duplicates, etc.

58 of 98

Key points -- data cleaning

Clean, label, and recode in a separate .do file before the analysis begins
Pay special attention to missing values and categorical variables
Check that information is preserved frequently:

Is the number of observations in a variable staying the same?
Are averages and frequencies meaningful and consistent with what you would expect?

Document any judgment calls you make extensively and keep them reversible – possibly create a new variable

e.g. HH_income_annual vs. HH_income_annual_corrected

j-pal | coding best practices

57

Resources

Example code:

J-PAL Research Resource on data cleaning and management

IPA data cleaning guide

59 of 98

De-identification

j-pal | coding best practices

60 of 98

De-identification: Goals

Advantages:

Protect your subjects from risk of harm
Follow your own IRB protocols
Protect yourself from legal consequences
Maintain subject trust

Considerations:

Identifiers needed to merge or back-check the data
Unless data is completely anonymized, it is not really possible to prevent identification
Loss of information needed for analysis and research

j-pal | coding best practices

Resources

J-PAL Research Resources

Ethical conduct of randomized evaluations

Institutional review board (IRB) proposals

61 of 98

De-identification: Goals

Data you collect…

can be accessed by governments or law enforcement
may be stolen or hacked
may be seen by authorized or unauthorized persons who know the subjects (e.g. surveyors, NGO program officials, relatives/friends/family of the subject)
Will eventually be published and accessible to everyone

This access can be harmful or risky:

Identity theft
Political or legal repercussions
Embarrassment or social stigma (e.g. STD infections)
Loss of benefits (e.g. medical history, access to insurance, income, loss of social security benefits)
Personal and family repercussions (e.g. sexual history)

j-pal | coding best practices

62 of 98

Personally Identifiable Information (PII) -- Direct Identifiers

The 'official' definition of PII can differ across countries.

Be aware of which regulations apply to your project.

Direct identifiers: unique to a person or household

ID/license number,
medical record number,
biometrics
precise geo-location or address,
email handles and phone numbers,
vehicle identifiers, bank accounts.

j-pal | coding best practices

Resources

J-PAL Research Resource on data de-identification

63 of 98

Replacing Direct Identifiers

Often, direct identifiers are only needed to identify the unit of observation or connections between data points
Examples:

Name identifies main respondent
Address/surname identifies the household or family
Schools, clinics, counties etc. identify observations in the same unit of randomization or cluster

Can be replaced with ID numbers without any loss of information!
Do immediately after receiving the final raw data.

j-pal | coding best practices

64 of 98

Personally Identifiable Information (PII) -- Indirect Identifiers

Indirect identifiers: combination of variables that could potentially be used to identify research participants

Note that these don’t necessarily have to be at the individual level, and that the combinations don’t need to be unique to be problematic.

Examples:

Physical attributes (e.g. albinism, weight, disabilities like blindness)
Religion, occupation, education, number of children and so on identify (some) individuals within small enough groups of people (village, block, zip code...)

Key point: The tradeoff between security and usability is most relevant for indirect identifiers. Consider the following when de-identifying:

What is the distribution of responses to the variable you’re considering?
Is this situated in a data-rich context? Are there outside datasets that could potentially be linked to the one you’re using?
How sensitive is the data?

j-pal | coding best practices

65 of 98

De-identifying Indirect Identifiers

Most variables are indirect identifiers
Removal or scrambling means losing information

De-identification usually involves aggregation, i.e. removing detail so that a larger group of people shares the same indirect identifier
Example: birth date → keep only birth year

Never remove information in an irreversible way.

j-pal | coding best practices

66 of 98

Working with Identifiable Data

Some direct identifiers contain important information.

Key example: GPS location data

Weather, altitude, etc. important in agriculture
Traffic, real estate prices, or crime rates important in urban areas

These direct identifiers cannot (immediately) be removed.�
In this case, work directly with encrypted raw data
Keep data encrypted until identifying info can be removed

Example: after linking GPS code in survey with weather information, remove GPS data
Create a new cross-walk file or include GPS codes in IDlink file

Note: working in an encrypted file can

Slow your computer down
Lead to cloud service synching problems.

🡪 De-identify as early as possible and consider working with encrypted data off the cloud.

j-pal | coding best practices

67 of 98

Reasonable Security Practices and Procedures and Sensitive Personal Data or Information Rules in India

Passwords
Financial information such as bank account or credit card or debit card or other payment instrument details
Physical, physiological and mental health condition(s)
Sexual orientation
Medical records and history
Biometric information

j-pal | coding best practices

Under section 72A of the (Indian) Information Technology Act, 2000, disclosure of information, knowingly and intentionally, without the consent of the person concerned and in breach of the lawful contract has been also made punishable with imprisonment for a term extending to three years and fine extending to Rs 5,00,000 (approx. US$ 8,000).

https://www.mondaq.com/india/data-protection/655034/data-protection-laws-in-india--everything-you-must-know

Deals with protection of "Sensitive personal data or information of a person", which includes such personal information which consists of information relating to the above

The rules provide the reasonable security practices and procedures, which the body corporate or any person who on behalf of body corporate collects, receives, possess, store, deals or handle information is required to follow while dealing with "Personal sensitive data or information".

In case of any breach, the body corporate or any other person acting on behalf of body corporate, the body corporate may be held liable to pay damages to the person so affected.

68 of 98

PII for health data (US HIPAA Privacy Rule)

Name
All geographic subdivisions smaller than a state*
All elements (except years) �of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89)
Telephone numbers
Fax number
Email address
Social Security Number
Medical record number
Health plan beneficiary number
Account number
Certificate or license number
Any vehicle or other device serial number
Web URL
Internet Protocol (IP) Address
Finger or voice print
Photographic image of any identifying feature

j-pal | coding best practices

69 of 98

De-identifying Indirect Identifiers

Encoding

Mapping the variable’s values 1-1 onto a numeric variable that has no other relation to the underlying variable.
A crosswalk should be saved for the research team

Masking

Replacing whole or part of the variable with characters. For example, to mask zip codes, you can keep the first 2 digits and replace the last 3 digits with a “*” (i.e. 94***).

Aggregation/Categorization

Group variables by category. The identifiers are replaced by an aggregated/descriptive statistic.

Group birth dates by keeping only birth month, quarter, or year
Group income by broad income ranges
Group ages in 5-year intervals.

j-pal | coding best practices

Resources

IDEA Handbook chapter on Statistical Privacy

70 of 98

De-identifying GPS Coordinates

GPS data can be de-identified in different ways:

Aggregation

Aggregate GPS coordinates to the town, county, or state level

Relative data

Create a distance variable to a reference point, e.g. for health data: distance to nearest hospital; for location data: distance to town center

Geographic masking (jittering)

Randomly offset points within a radius interval
Should often be combined with aggregation (e.g., a remote village in the desert can still be re-identified if GPS coordinates only jittered)

Discuss with collaborators how to protect GPS data in your project.

j-pal | coding best practices

Scrambling is sometimes found in data – not clear what the advantage is over (a) dropping the last few GPS digits (instead of randomizing them) or (b) aggregating by geographically meaningful units (like villages)

you should never jitter HH coordinates, but the advantage to aggregation+ jittering is that the people collecting the data/with access to the PII haven't always thought through all the potential uses of the GPS data, and people reusing it might want to be able to link to it

Note that village GPS coordinates should not be made public unless they are jittered, as villages are typically small enough that people with enough insider knowledge would be able to reidentify individuals within the village. This is in line with HIPAA's standards and is a standard followed by LSMS, DHS, etc

71 of 98

Key Points -- de-identification

Think of de-identification as a process that reduces the risk of identifying individuals, rather than completely eliminating the potential for re-identification.
To protect human subjects, de-identification should occur as early as possible in the research process.
Data should always be de-identified before being published.
There is a tension between de-identification and data usability.
The possibility of re-identification can almost never be fully ruled out.

j-pal | coding best practices

72 of 98

Pop quiz: De-identification

Consider an exchange economy with two commodities and two consumers. Both consumers have homothetic preferences of the constant elasticity variety. Moreover, the elasticity of substitution is the same for both consumers and is small (i.e. goods are close to perfect complements). Now assume a dataset unrelated to any of the above. Which of these variables could be considered an indirect identifier?

Village/town name
Individual occupation
Medical ailment in the past month (free text)
Social security number

73 of 98

Resources for De-Identification

Programs:

PII-Scan: Scans data folders and finds variable names that may contain direct identifiers.

https://github.com/J-PAL/PII-Scan (R)
https://github.com/J-PAL/stata_PII_scan (Stata)

split_pii.do: Generates randomized IDs and splits the data into three parts; (1)original data, (2) anonymized data, (3) data with the randomized ID that links the original data with the anonymized data. https://github.com/PovertyAction/split_pii

Resources:

J-PAL Research Resources:

IDEA Handbook chapter on Statistical Privacy

j-pal | coding best practices

74 of 98

Data Publication

j-pal | coding best practices

75 of 98

Data publication: Goals

Publishing study data increases transparency in the social sciences and democratizes access to high-quality data
However, it is important that publication is done in a secure and comprehensive manner. This includes:

Thinking carefully about indirect identifiers
Making sure as much documentation is available as possible
Publishing in a secure and accessible trusted repository

j-pal | coding best practices

Resources

J-PAL Research Resources

Trusted repositories

76 of 98

Code Book

Document that contains information about data: variable name and type, labels, min/max/missing values, etc.

Consider a first time user of the data who may be looking for specific information for a project: how can they quickly find out what’s in the data?

Critical for easy interpretation of the data and in furthering analysis

Have do-file that creates codebook from raw data
Stata codebook command
IPA’s cbook_stat command

j-pal | coding best practices

75

77 of 98

Questionnaires

Publish all data collection instruments: questionnaires, consent forms, game sheets etc.

Organize them. Make sure they map to the datasets.

Keep a document format that is searchable (e.g. no image pdf).

Keep all language versions used.

j-pal | coding best practices

78 of 98

ReadMe Files

Project title or publication
Structure of the working directory
Short description of all files included:�dataset, .do files, questionnaires
Instructions for use, e.g. special variables
For more information on what to include in

a readme, see the AEA’s template

j-pal | coding best practices

77

Readme the overall organizing document. If you’re opening a project folder for the first time, the readme is probably the first thing you read

Outlines key information about all published files: data and analysis files, questionnaires, codebooks

E.g., description of the data, date of data collection, system requirements, how to run the code files, list of files with short description, additional notes like contact info, permission to publish the data (DUAs, IRBs), etc.

Describes how data/analysis files interact with one another. e.g. which came first, is one a subset of another?

When: Immediately after each round of data collection. Should be platform independent (e.g., a PDF)

Note that this is not a substitute for labeled data and well-documented code, and sensible naming of files!! Will discuss more on Tuesday

79 of 98

Key points -- data publication

Always publish in trusted repositories
Revisit the indirect identifiers in your dataset to make sure that they can be published
Make sure the data is well-documented by including:

A codebook
Questionnaires
A ReadMe

80 of 98

Resources for data publication

J-PAL Research Resources:

Trusted repositories:

81 of 98

Ten Coding Commandments

foreach x in `commandments’ {;

di “`x’”;

};

Thou shalt not alter the original data;
Thou shalt not start a project without a version control system/discussion;
Thou shalt not use name and date extensions on the latest version of a file;
Thou shalt not name file versions “final”;
Thou shalt not leave variables unlabeled;
Thou shalt not use absolute file references in the body of your .do file;
Thou shalt not leave missing values unassigned;
Thou shalt not share personally-identifiable information;
Thou shalt not publish non-instructional comments on codes;
Thou shalt not work without a project log;

j-pal | coding best practices

82 of 98

Questions?

j-pal | coding best practices

83 of 98

Credits

The original version of this presentation was developed by IPA, based on

Pollock, H, E. Chuang and S. Wykstra. 2015. Best Practices for Data and Code Management, Innovations for Poverty Action.

The latest updates were made by Karl Rubio, Anja Sautmann, and J-PAL Global Research in January 2018. Thanks go to Aprille Knox, Rose Burnam, Samuel Solomon, and others for useful feedback comments.

Some of the ideas are from this excellent guide:

Gentzkow, Matthew and Jesse M. Shapiro. 2014. Code and Data for the Social Sciences: A Practitioner’s Guide. University of Chicago, mimeo.

84 of 98

Additional Resources:

Github:

List of GitHub Guides.
IPA’s Github training.

De-identification:

J-PAL’s GitHub repository on PII detection.
List of identified information and de-identification according to HIPAA standards.

General coding Guidance:

J-PAL’s resource on managing files, data, and documentation for randomized evaluations
J-PAL’s coding resources for randomized evaluations

j-pal | coding best practices

85 of 98

Supplemental Material

86 of 98

Outline

Folder structure examples
Version control with Github
De-Identification: replacing direct identifiers

j-pal | coding best practices

87 of 98

Outline

Folder structure examples
Version control with Github
De-Identification: replacing direct identifiers

j-pal | coding best practices

88 of 98

Good Folder Structure - Example 2

j-pal | coding best practices

89 of 98

Good Folder Structure – Example 3

j-pal | coding best practices

90 of 98

Good Folder Structure – Example 4

j-pal | coding best practices

91 of 98

Outline

Folder structure examples
Version control with Github
De-Identification: replacing direct identifiers

j-pal | coding best practices

92 of 98

Version Control using GitHub

Consider using The OpenScienceFramework (OSF) or GitHub (which saves all versions, branches etc.)

j-pal | coding best practices

93 of 98

What does Github do?

j-pal | coding best practices

94 of 98

Outline

Folder structure examples
Version control with Github
De-Identification: replacing direct identifiers

j-pal | coding best practices

95 of 98

Replacing Direct Identifiers

See .do file: SampleDeIdentify.do

Step 1: Determine which variables contain direct identifiers.

j-pal | coding best practices

96 of 98

Replacing Direct Identifiers

Step 2: Create a cross-walk file “IDlink.dta” with direct identifiers and a new, random ID number:

j-pal | coding best practices

This file needs to be protected just like the original data – always keep in password protected, encrypted volume.

97 of 98

Replacing Direct Identifiers

Step 3: merge the new IDlink file into the raw data file:

j-pal | coding best practices

98 of 98

Replacing Direct Identifiers

Step 4: Drop all the direct identifiers from the raw data and save de-identified data.

j-pal | coding best practices

This file is de-identified and can be saved in an un-encrypted location.

Note: ID_obs gives no information about the district – would require separate scrambling of District