1 of 98

Metadata

Content: Coding Best Practices

Length: 90 minutes lecture

Purpose: Discusses the best practices for file organization, code and data handling and ways of de-identifying data

Note: this lecture has an exercise found here (Not used in 2021)

Learning goals:

    • Learn the best practices for file organization
    • Learn the best practices for code and data handling in Stata
    • Understand different ways of de-identifying data
    • Understand the various requirements for data sharing and publication

Last edited on/by: 10/28/2021, Jack Cavanagh

2 of 98

Assessment Questions

j-pal | coding best practices

3 of 98

Coding Best Practices

Jack Cavanagh, J-PAL Global

J-PAL/IPA Research Staff Training, 2021

4 of 98

Why care about best practices?

j-pal | coding best practices

5 of 98

Imagine that you join a project in between the baseline and midline

  • PI says “the folder is a little messy, would you mind reorganizing it at some point?”
  • You think, “sure” but then–
    • Enumerator training gets delayed due to a bus strike
    • Then you find half your tablets don’t have chargers
    • Then you have to finish the HFC code
    • Then you find out you have to add a question to the survey
    • Etc., etc., etc.,

  • Highly unlikely you will get a chance to fix a messy folder structure or unclear code. Do it right the first time!

j-pal | coding best practices

6 of 98

Why care about best practices?

  • Data Documentation/Organization
    • Golden rule: “do unto others (and yourself!) as you would have them do unto you”.
    • Invest early and reap the benefits later.

  • De-Identification of Identified Data
    • Researchers are bound to protect human subjects.
    • Navigate the tension between obscuring identities and losing information.

  • Data Publication/Sharing
    • Journals and funders may require data publication.
    • Reproduce research for the sake of science.
    • Data should be available for other researchers (incl. students, policy makers).
    • Make code, data, and documentation publishable from the beginning.

j-pal | coding best practices

7 of 98

Outline

  • Good File Organization
    • Folder structure
    • Version Control and Archiving
  • Good Code Organization
    • File Header
    • Commenting
    • Master .do File
    • Programs and “for” Loops
  • Good Data Handling
    • Data Cleaning
    • De-Identification
    • Data Publishing/Sharing

j-pal | coding best practices

8 of 98

Outline

  • Good File Organization
    • Folder structure
    • Version Control and Archiving
  • Good Code Organization
    • Commenting
    • File Header/Footer
    • Master .do File
    • Programs and “for” Loops
  • Good Data Handling
    • Data Cleaning
    • De-Identification
    • Data Publishing/Sharing

j-pal | coding best practices

9 of 98

File/Folder Structure

j-pal | coding best practices

10 of 98

Exercise

In pairs, show each other the folder of the latest project you have been working on.

Let your partner guess where the latest data and code you are working on is.

11 of 98

File and Folder Structure Goals

  • Original data should be preserved (aside from de-identification)

  • Keep data, coding files, and results separate, make order and degree of data manipulation clear

  • Data cleaning, analysis, and producing output should be possible to do separately

  • Version control and archiving of previous versions

j-pal | coding best practices

010110100101

1010111011111

010111100010

000100100011

IN

OUT

12 of 98

Good Folder Structure – Example

j-pal | coding best practices

13 of 98

Not-So-Good Folder Structure - Example

j-pal | coding best practices

14 of 98

What if you inherit a messy folder structure?

  • Ideally, re-arrange the files/folders logically
    • Note: Requires an extensive understanding of the current structure/relationship between files
    • May be too impractical/time consuming
  • Next best: create a folder map
    • Outline the key folders and which files are found within them
    • Outline the relationship between files (e.g., “cleaning.do takes the rawdata.csv and turns it into cleandata.dta”)
    • Folder maps can be used as the basis for the project’s Readme file
    • Ideally, the folder map should be consistently updated with a list of all datasets and code (along with the type of data, and the purpose of the dataset/code)

j-pal | coding best practices

15 of 98

Key points -- file/folder structure

  • There is no one universal “good” folder structure – and there are plenty of bad ones!
  • Critical: define your folder structure in advance
  • Utilize an archive folder for storing old/extra files
  • Keep file/folder names simple, and avoid spaces
  • If you have multiple survey rounds, create a template folder structure

j-pal | coding best practices

Resources

16 of 98

Pop quiz: Folder map

Which of these is not true about folder maps:

  • They can be used as the basis for a readme
  • They should be updated regularly
  • They cannot not be used to outline the relationship between files and data in a project
  • They should always take the place of a good file/folder structure

17 of 98

Version Control

j-pal | coding best practices

18 of 98

Version Control

j-pal | coding best practices

19 of 98

Version Control that works

j-pal | coding best practices

20 of 98

Version Control and Archiving

  • Keep the current version of each file in your working directory
  • Archive code files only.
  • Date each version using format YYYY_MM_DD (or YYYYMMDD)
  • To rerun: either copy back to working directory, OR use absolute file paths for inputs and outputs

j-pal | coding best practices

21 of 98

Version control – just the code

j-pal | coding best practices

De-ID data

Raw data

Clean data

Final data

Import.do

De-ID.do

Back-up Raw data

Cleaning.do

HFC.do

Corrections.do

Output

Analysis.do

Graphs.do

Resources

J-PAL Research Resource on data cleaning and management

22 of 98

Version control – code and (some) data

j-pal | coding best practices

De-ID data

Raw data

Clean data

Final data

Import.do

De-ID.do

Back-up Raw data

Cleaning.do

HFC.do

Corrections.do

Output

Analysis.do

Graphs.do

Automatically add today’s date when saving data

23 of 98

Version control with multiple collaborators

  • *Easier* on Github
    • Github will detect and display differences in files

  • If not using Github, consider:
    • Blocking time for each person to work on files
    • Using “temp” or “working” folders
    • Labeling systems that will indicate most current file versions
    • Tracking (major) changes outside of code
    • Code reviews daily/weekly/as needed to merge the code changes and ensure results remain the same

j-pal | coding best practices

Resources

DIME Wiki: Getting Started with Github

Udacity course: Version control with Git

Gentzkow and Shapiro: Version control

24 of 98

Key points -- version control

  • There is no one universal “good” way to do version control–and there are plenty of bad ones!
  • Critical: determine version control method at the start of the project
    • Ask questions if you’re confused!

j-pal | coding best practices

25 of 98

Further resources for file structure and version control

File/folder structure:

Version control/github:

26 of 98

Outline

  • Good File Organization
    • Folder structure
    • Version Control and Archiving
  • Good Code Organization
    • Commenting
    • File Headers
    • Master .do File
    • Programs and “for” Loops
  • Good Data Handling
    • Data Cleaning
    • De-Identification
    • Data Publishing/Sharing

j-pal | coding best practices

27 of 98

Code Organization: Goals

  • Reader: understand and push forward analysis
    • Remember what you did (when PI asks about it 4 weeks later)
    • Continue analysis easily

  • Writer: accuracy, efficiency, minimize error
    • Avoid repetitive work, errors, missed steps, or unnecessary re-testing

j-pal | coding best practices

?...!!

28 of 98

When creating new do-files

  • Remember your folder structure! Put it in a logical location
  • Name the do-file descriptively (e.g., “import_cleaning.do”)
  • Add the do-file to your file map

j-pal | coding best practices

29 of 98

Commenting and pseudocode

j-pal | coding best practices

30 of 98

Comments and Readability

j-pal | coding best practices

Type

Stata

R

Purpose

Full-line comments

*

#

Use to create section dividers

Line breaks

///

end the incomplete line with an operator (like +, /, <-, and so on)

To break commands across multiple lines. Note: No longer needed in Stata 16

In-line comments

//

#[........]

Use for “markers” (e.g., //decision point) or short comments

Block comments

/*.......... */

Select text, Ctrl + Shift + C

Use to explain the purpose of the section of the do-file

31 of 98

Comments and Readability

j-pal | coding best practices

32 of 98

Pseudocode

j-pal | coding best practices

33 of 98

Pseudocode

j-pal | coding best practices

34 of 98

Pseudocode: exercise

  • Your PI sends you a folder of 35 datasets named dataset_1.csv, dataset_2.csv, etc..
  • They know that one of them contains a variable that is just the string “Coding Best Practices Rockz”, and they know this dataset is the one they are looking for
  • However, they don’t have time to look through all 35 datasets
  • Write out pseudocode for an easy way to figure out which dataset the PI is looking for
  • Paste your answer in the chat if comfortable!

35 of 98

Header Content

j-pal | coding best practices

36 of 98

Essential Header Contents

j-pal | coding best practices

37 of 98

Essential Header Contents

j-pal | coding best practices

38 of 98

Essential Header Contents

j-pal | coding best practices

39 of 98

Essential Header Contents

j-pal | coding best practices

40 of 98

Portability between Co-Authors–Option 1

j-pal | coding best practices

41 of 98

Portability between Co-Authors–Option 2

j-pal | coding best practices

42 of 98

Essential Footer Contents

j-pal | coding best practices

43 of 98

Make a template do-file!

  • Many aspects of do-files within the same project are similar
    • header structure (title-author-date modified, input/output, etc.) is unlikely to change
    • Your basic file paths
    • Version of stata
    • Etc.
  • Make a template do-file which you can duplicate and edit when you need a new do-file

j-pal | coding best practices

44 of 98

Master .do File

j-pal | coding best practices

45 of 98

Master .do file

j-pal | coding best practices

Master file calls all .do

files in order

46 of 98

Master .do file

j-pal | coding best practices

47 of 98

Master .do file

j-pal | coding best practices

48 of 98

Some comments on interoperability

49 of 98

Key points

  • Well-organized code helps your team work better together, but it also saves you time and energy in the future
  • Pseudocode can help you transition your logic into code
  • Template do-files can be used to make sure that your settings and macros are consistent. In particular, do-file headings should contain the following:
    • Software version
    • File paths
    • Randomization seed
  • Master do-files are useful for setting macros and providing overall program instructions
    • They reduce the cost of an extra do-file, but not to zero

j-pal | coding best practices

50 of 98

Outline

  • Good File Organization
    • Folder structure
    • Version Control and Archiving
  • Good Code Organization
    • Commenting
    • File Header/Footer
    • Master .do File
    • Programs and “for” Loops
  • Good Data Handling
    • Data Cleaning
    • De-Identification
    • Data Publishing/Sharing

j-pal | coding best practices

51 of 98

Good Data Handling

Goals:

  • Optimized data for analysis
  • Subject protection

Guiding principle: preserve the greatest possible amount of information that does not put your subjects at risk.

j-pal | coding best practices

52 of 98

Data Cleaning

j-pal | coding best practices

53 of 98

Naming and Labelling Variables

Rule 1: Name variables in self-explanatory ways and label them.

  • Don’t use var150b if you can use HH_Income_annual
  • Ideal variable names: location in survey and short description.
    • Common abbreviations:
      • Household = hh, Mother M, Father F, HHh = head of household
      • Number = #, dollars = $, percentage = %
      • Common economic variables, eg. Y for income

  • Stata has a 32-characters limit on variable names and 80-characters limit on variable labels.

  • Choose SurveyCTO variable names wisely.

j-pal | coding best practices

54 of 98

Coding and Labeling Variable Values

Rule 2: Code and label categorical variables meaningfully.

  • Categorical variables:
    • Label all possible categories/values of the variable
    • For binary variables, (re)code so that 1 “affirms” variable

j-pal | coding best practices

55 of 98

Missing Values

Rule 3: Preserve information on why data is missing.

  • Data may be missing for many reasons:
    • The subject wasn’t interviewed
    • The subjects did not know the answer
    • The subject refused to answer
    • The surveyor made an error when entering the answer
    • The question was not asked (e.g. due to skip patterns)

j-pal | coding best practices

“I can’t remember’s not the same as I don’t know”

  • Destroyer

56 of 98

Missing Values in Stata

Encode missing values in R using the dplyr package

  • Stata command to replace survey coding:

mvdecode variables , mv(-999=.d \-998=.r)

j-pal | coding best practices

Values

Numerical

String

Missing (no data recorded)

“.”

“not recorded”

Answer “don’t know”

“.d”

“don’t know”

Refusal to answer

“.r”

“refusal”

Conditional questions that were not asked

“.n”

“N/A”

57 of 98

Make Code Self-Documenting

j-pal | coding best practices

Can you see any potential sources of error in this code?

58 of 98

Key points -- data cleaning

  • Clean, label, and recode in a separate .do file before the analysis begins
  • Pay special attention to missing values and categorical variables
  • Check that information is preserved frequently:
    • Is the number of observations in a variable staying the same?
    • Are averages and frequencies meaningful and consistent with what you would expect?
  • Document any judgment calls you make extensively and keep them reversible – possibly create a new variable
    • e.g. HH_income_annual vs. HH_income_annual_corrected

j-pal | coding best practices

57

Resources

Example code:

J-PAL Research Resource on data cleaning and management

IPA data cleaning guide

59 of 98

De-identification

j-pal | coding best practices

60 of 98

De-identification: Goals

  • Advantages:
    • Protect your subjects from risk of harm
    • Follow your own IRB protocols
    • Protect yourself from legal consequences
    • Maintain subject trust

  • Considerations:
    • Identifiers needed to merge or back-check the data
    • Unless data is completely anonymized, it is not really possible to prevent identification
    • Loss of information needed for analysis and research

j-pal | coding best practices

61 of 98

De-identification: Goals

  • Data you collect…
    • can be accessed by governments or law enforcement
    • may be stolen or hacked
    • may be seen by authorized or unauthorized persons who know the subjects (e.g. surveyors, NGO program officials, relatives/friends/family of the subject)
    • Will eventually be published and accessible to everyone

  • This access can be harmful or risky:
    • Identity theft
    • Political or legal repercussions
    • Embarrassment or social stigma (e.g. STD infections)
    • Loss of benefits (e.g. medical history, access to insurance, income, loss of social security benefits)
    • Personal and family repercussions (e.g. sexual history)

j-pal | coding best practices

62 of 98

Personally Identifiable Information (PII) -- Direct Identifiers

The 'official' definition of PII can differ across countries.

Be aware of which regulations apply to your project.

  • Direct identifiers: unique to a person or household
    • ID/license number,
    • medical record number,
    • biometrics
    • precise geo-location or address,
    • email handles and phone numbers,
    • vehicle identifiers, bank accounts.

j-pal | coding best practices

Resources

J-PAL Research Resource on data de-identification

63 of 98

Replacing Direct Identifiers

  • Often, direct identifiers are only needed to identify the unit of observation or connections between data points
  • Examples:
    • Name identifies main respondent
    • Address/surname identifies the household or family
    • Schools, clinics, counties etc. identify observations in the same unit of randomization or cluster
  • Can be replaced with ID numbers without any loss of information!
  • Do immediately after receiving the final raw data.

j-pal | coding best practices

64 of 98

Personally Identifiable Information (PII) -- Indirect Identifiers

  • Indirect identifiers: combination of variables that could potentially be used to identify research participants
    • Note that these don’t necessarily have to be at the individual level, and that the combinations don’t need to be unique to be problematic.

Examples:

    • Physical attributes (e.g. albinism, weight, disabilities like blindness)
    • Religion, occupation, education, number of children and so on identify (some) individuals within small enough groups of people (village, block, zip code...)
  • Key point: The tradeoff between security and usability is most relevant for indirect identifiers. Consider the following when de-identifying:
    • What is the distribution of responses to the variable you’re considering?
    • Is this situated in a data-rich context? Are there outside datasets that could potentially be linked to the one you’re using?
    • How sensitive is the data?

j-pal | coding best practices

65 of 98

De-identifying Indirect Identifiers

  • Most variables are indirect identifiers
  • Removal or scrambling means losing information
    • De-identification usually involves aggregation, i.e. removing detail so that a larger group of people shares the same indirect identifier
    • Example: birth date → keep only birth year

  • Never remove information in an irreversible way.

j-pal | coding best practices

66 of 98

Working with Identifiable Data

  • Some direct identifiers contain important information.
    • Key example: GPS location data
      • Weather, altitude, etc. important in agriculture
      • Traffic, real estate prices, or crime rates important in urban areas
  • These direct identifiers cannot (immediately) be removed.�
  • In this case, work directly with encrypted raw data
  • Keep data encrypted until identifying info can be removed
    • Example: after linking GPS code in survey with weather information, remove GPS data
    • Create a new cross-walk file or include GPS codes in IDlink file

  • Note: working in an encrypted file can
    • Slow your computer down
    • Lead to cloud service synching problems.

🡪 De-identify as early as possible and consider working with encrypted data off the cloud.

j-pal | coding best practices

67 of 98

Reasonable Security Practices and Procedures and Sensitive Personal Data or Information Rules in India

  • Passwords
  • Financial information such as bank account or credit card or debit card or other payment instrument details
  • Physical, physiological and mental health condition(s)
  • Sexual orientation
  • Medical records and history
  • Biometric information

j-pal | coding best practices

68 of 98

PII for health data (US HIPAA Privacy Rule)

  • Name
  • All geographic subdivisions smaller than a state*
  • All elements (except years) �of dates related to an individual (including birthdate, admission date, discharge date, date of death, and exact age if over 89)
  • Telephone numbers
  • Fax number
  • Email address
  • Social Security Number
  • Medical record number
  • Health plan beneficiary number
  • Account number
  • Certificate or license number
  • Any vehicle or other device serial number
  • Web URL
  • Internet Protocol (IP) Address
  • Finger or voice print
  • Photographic image of any identifying feature

j-pal | coding best practices

69 of 98

De-identifying Indirect Identifiers

  • Encoding
    • Mapping the variable’s values 1-1 onto a numeric variable that has no other relation to the underlying variable.
    • A crosswalk should be saved for the research team

  • Masking
    • Replacing whole or part of the variable with characters. For example, to mask zip codes, you can keep the first 2 digits and replace the last 3 digits with a “*” (i.e. 94***).

  • Aggregation/Categorization
    • Group variables by category. The identifiers are replaced by an aggregated/descriptive statistic.
      • Group birth dates by keeping only birth month, quarter, or year
      • Group income by broad income ranges
      • Group ages in 5-year intervals.

j-pal | coding best practices

Resources

IDEA Handbook chapter on Statistical Privacy

70 of 98

De-identifying GPS Coordinates

GPS data can be de-identified in different ways:

  • Aggregation
    • Aggregate GPS coordinates to the town, county, or state level
  • Relative data
    • Create a distance variable to a reference point, e.g. for health data: distance to nearest hospital; for location data: distance to town center
  • Geographic masking (jittering)
    • Randomly offset points within a radius interval
    • Should often be combined with aggregation (e.g., a remote village in the desert can still be re-identified if GPS coordinates only jittered)

Discuss with collaborators how to protect GPS data in your project.

j-pal | coding best practices

71 of 98

Key Points -- de-identification

  • Think of de-identification as a process that reduces the risk of identifying individuals, rather than completely eliminating the potential for re-identification.
  • To protect human subjects, de-identification should occur as early as possible in the research process.
  • Data should always be de-identified before being published.
  • There is a tension between de-identification and data usability.
  • The possibility of re-identification can almost never be fully ruled out.

j-pal | coding best practices

72 of 98

Pop quiz: De-identification

Consider an exchange economy with two commodities and two consumers. Both consumers have homothetic preferences of the constant elasticity variety. Moreover, the elasticity of substitution is the same for both consumers and is small (i.e. goods are close to perfect complements). Now assume a dataset unrelated to any of the above. Which of these variables could be considered an indirect identifier?

  • Village/town name
  • Individual occupation
  • Medical ailment in the past month (free text)
  • Social security number

73 of 98

Resources for De-Identification

Programs:

  • PII-Scan: Scans data folders and finds variable names that may contain direct identifiers.
  • split_pii.do: Generates randomized IDs and splits the data into three parts; (1)original data, (2) anonymized data, (3) data with the randomized ID that links the original data with the anonymized data. https://github.com/PovertyAction/split_pii

Resources:

j-pal | coding best practices

74 of 98

Data Publication

j-pal | coding best practices

75 of 98

Data publication: Goals

  • Publishing study data increases transparency in the social sciences and democratizes access to high-quality data
  • However, it is important that publication is done in a secure and comprehensive manner. This includes:
    • Thinking carefully about indirect identifiers
    • Making sure as much documentation is available as possible
    • Publishing in a secure and accessible trusted repository

j-pal | coding best practices

Resources

J-PAL Research Resources

Data publication

Trusted repositories

J-PAL Dataverse

IPA Dataverse

ICPSR

76 of 98

Code Book

  • Document that contains information about data: variable name and type, labels, min/max/missing values, etc.

  • Consider a first time user of the data who may be looking for specific information for a project: how can they quickly find out what’s in the data?

  • Critical for easy interpretation of the data and in furthering analysis
    • Have do-file that creates codebook from raw data
    • Stata codebook command
    • IPA’s cbook_stat command

j-pal | coding best practices

75

77 of 98

Questionnaires

  • Publish all data collection instruments: questionnaires, consent forms, game sheets etc.

    • Organize them. Make sure they map to the datasets.

    • Keep a document format that is searchable (e.g. no image pdf).

    • Keep all language versions used.

j-pal | coding best practices

78 of 98

ReadMe Files

  • Project title or publication
  • Structure of the working directory
  • Short description of all files included:�dataset, .do files, questionnaires
  • Instructions for use, e.g. special variables
  • For more information on what to include in
    • a readme, see the AEA’s template

j-pal | coding best practices

77

79 of 98

Key points -- data publication

  • Always publish in trusted repositories
  • Revisit the indirect identifiers in your dataset to make sure that they can be published
  • Make sure the data is well-documented by including:
    • A codebook
    • Questionnaires
    • A ReadMe

80 of 98

Resources for data publication

  • J-PAL Research Resources:

  • Trusted repositories:

81 of 98

Ten Coding Commandments

foreach x in `commandments’ {;

di “`x’”;

};

    • Thou shalt not alter the original data;
    • Thou shalt not start a project without a version control system/discussion;
    • Thou shalt not use name and date extensions on the latest version of a file;
    • Thou shalt not name file versions “final”;
    • Thou shalt not leave variables unlabeled;
    • Thou shalt not use absolute file references in the body of your .do file;
    • Thou shalt not leave missing values unassigned;
    • Thou shalt not share personally-identifiable information;
    • Thou shalt not publish non-instructional comments on codes;
    • Thou shalt not work without a project log;

j-pal | coding best practices

82 of 98

Questions?

j-pal | coding best practices

83 of 98

Credits

The original version of this presentation was developed by IPA, based on

Pollock, H, E. Chuang and S. Wykstra. 2015. Best Practices for Data and Code Management, Innovations for Poverty Action.

The latest updates were made by Karl Rubio, Anja Sautmann, and J-PAL Global Research in January 2018. Thanks go to Aprille Knox, Rose Burnam, Samuel Solomon, and others for useful feedback comments.

Some of the ideas are from this excellent guide:

Gentzkow, Matthew and Jesse M. Shapiro. 2014. Code and Data for the Social Sciences: A Practitioner’s Guide. University of Chicago, mimeo.

84 of 98

Additional Resources:

j-pal | coding best practices

85 of 98

Supplemental Material

86 of 98

Outline

  • Folder structure examples
  • Version control with Github
  • De-Identification: replacing direct identifiers

j-pal | coding best practices

87 of 98

Outline

  • Folder structure examples
  • Version control with Github
  • De-Identification: replacing direct identifiers

j-pal | coding best practices

88 of 98

Good Folder Structure - Example 2

j-pal | coding best practices

89 of 98

Good Folder Structure – Example 3

j-pal | coding best practices

90 of 98

Good Folder Structure – Example 4

j-pal | coding best practices

91 of 98

Outline

  • Folder structure examples
  • Version control with Github
  • De-Identification: replacing direct identifiers

j-pal | coding best practices

92 of 98

Version Control using GitHub

Consider using The OpenScienceFramework (OSF) or GitHub (which saves all versions, branches etc.)

j-pal | coding best practices

93 of 98

What does Github do?

j-pal | coding best practices

94 of 98

Outline

  • Folder structure examples
  • Version control with Github
  • De-Identification: replacing direct identifiers

j-pal | coding best practices

95 of 98

Replacing Direct Identifiers

See .do file: SampleDeIdentify.do

  • Step 1: Determine which variables contain direct identifiers.

j-pal | coding best practices

96 of 98

Replacing Direct Identifiers

  • Step 2: Create a cross-walk file “IDlink.dta” with direct identifiers and a new, random ID number:

j-pal | coding best practices

This file needs to be protected just like the original data – always keep in password protected, encrypted volume.

97 of 98

Replacing Direct Identifiers

  • Step 3: merge the new IDlink file into the raw data file:

j-pal | coding best practices

98 of 98

Replacing Direct Identifiers

  • Step 4: Drop all the direct identifiers from the raw data and save de-identified data.

j-pal | coding best practices

This file is de-identified and can be saved in an un-encrypted location.

Note: ID_obs gives no information about the district – would require separate scrambling of District