1 of 40

ESM 206 - Lecture 2

Part 1: Naming objects & entering data

Part 2: R troubleshooting & resources

1

2 of 40

First, some terms:

3 of 40

Part 1: Naming things & entering data

4 of 40

Some resources for basics of data science & coding good practices:

5 of 40

“Call him Voldemort, Harry. Always use the proper name for things.”

  • Albus Dumbledore, Harry Potter and the Sorcerer’s Stone by JK Rowling

6 of 40

“There are only two hard things in Computer Science: cache invalidation, and naming things.”

  • Phil Karlton

7 of 40

Naming things

When naming variables, observations, data frames, or files, make them:

  1. Meaningful
  2. Consistent
  3. Concise
  4. Code & coder friendly

8 of 40

Naming things

When naming variables, observations, data frames, or files, make them:

  • Meaningful
  • Consistent
  • Concise
  • Code & coder friendly

  • Names of variables, data frames, and files should not be so generic/vague that a user must need a glossary to know what they contain

  • Names should be specific to the data/experiment/project, and the more intuitive their interpretation the better

  • Bad examples: File-1.xlsx, file-2.csv, indicator1, indicator2, ExperimentA.R, ExperimentB.R

  • Better examples: taco_nutrients.csv, ca-demographics, mice_1a_mass, sb_channel_spatial.shp

9 of 40

Naming things

When naming variables, observations, data frames, or files, make them:

  • Meaningful
  • Consistent
  • Concise
  • Code & coder friendly

  • Keep names perfectly identical for identical entries (e.g. “burrito-32” and “Burrito 32” are completely different things to R)

  • Be consistent across data frames - your life will be easier if you have year called ‘year’ in both sets, instead of ‘year’ in one and ‘YEAR_NEW’ in the other

  • Use logical suffixes (if necessary), consistently formatted. Like: temp_water_surface, temp_water_sub, temp_water_bottom

10 of 40

Naming things

When naming variables, observations, data frames, or files, make them:

  • Meaningful
  • Consistent
  • Concise
  • Code & coder friendly

  • Balance meaningfulness w/conciseness

  • Better to be descriptive than not know what a variable is

  • Longer names = tedious coding, but less effort to look through metadata for column/identifier names

  • Bad examples: ‘First dive temp readings Celsius’, greatblueheron_observations_2019_09_20, ‘Allison final figures version 4.xlsx’

  • Better examples: goleta_temp, USTotalPop, PercClay

11 of 40

Naming things

When naming variables, observations, data frames, or files, make them:

  • Meaningful
  • Consistent
  • Concise
  • Code & coder friendly

  • Avoid punctuation (%, !, ~, ( ), #) in names - more challenging to type & can mean things in code that you don’t want it to (or just break it)

  • Avoid spaces (makes coding much more difficult)

  • Generally, avoid starting object names with numbers (but could be useful for file names in sequence)

  • Pick (and be consistent with) a choice of case, like:
    • lowercase_snake_case (my favorite)
    • camelCase
    • UpperCamelCase
    • kebab-case
    • SCREAMING_SNAKE_CASE

12 of 40

Other naming considerations:

  • Avoid object names that are common/used function names (e.g., don’t name something ‘filter’)

  • Make a name uniquely searchable (would I be able to find this if I searched for it, e.g. in a GitHub repo?)

  • Consider making object names nouns, & function names verbs

  • It’s never the end of the world if you give something a bad name, but it will save you time & effort to strive for good names

13 of 40

Entering things

We’ll consider three bins for now:

  • Quantitative data: numeric observations, can be continuous (measured) or discrete (usually counts or ordinal data)

  • Nominal data: labels, usually in words (e.g. “purple”, “blue”, or “orange”)

  • Date/times: a time variable with weird formatting that Excel is determined to mess with

14 of 40

Entering things

  • The outcomes for a variable (whether values or descriptions) should exist alone in a column

  • Be very consistent when entering descriptions (e.g. “Purple” v. “purple” v. “purple_”)

  • Avoid formatting and symbols (if it’s hard to type, then it’s hard to type...and might cause issues beyond that)

  • Put any additional information (units, notes, etc.) in columns separate from the value/description

  • If there are missings, enter the same exact thing for each missing value (common: -9999, NA, -- or -)

  • Don’t leave cells blank (enter your missing indicator)

15 of 40

Bad:

Better:

16 of 40

Keep a clear record of your naming system -

and plan on forgetting your naming system

  • The name of the variable in your data, and a longer description of what it means (e.g. surface_temp = surface temperature measurements from NOAA buoy data)

  • What are their units (as relevant)?

  • How are you indicating a missing record, and is it the same regardless of why a record is missing?

  • If levels (records) for nominal data are abbreviated, how are the levels stored (e.g. ABUR = Arroyo Burro Beach)?

17 of 40

Entering dates/times

  • International Organization for Standardization (ISO) 8601

    • Dates: eliminate ambiguity with yyyy-mm-dd format (this is called ‘extended format’ - unextended is just yyyymmdd)
      • 2019-09-30

    • Times as hh:mm:ss.ffff (hours, minutes, seconds, fractions of seconds)
      • 06:30:22.4033

    • Datetimes: yyyy-mm-ddThh:mm:ss.ffff
      • 2001-03-12T14:38:02.8725

18 of 40

Part 2: Help you help yourself in R:

Finding resources & troubleshooting tips

19 of 40

  • Efficiently finding solutions & useful tools (e.g. R packages & functions) is an important skill for a data scientist

  • Troubleshooting is part of every data scientist’s life - there is no programmer in the world who does not have to deal with bugs & errors

Fun fact: “...the very first instance of a computer bug was recorded at 3:45 pm (15:45) on the 9th of September 1947. This "bug" was an actual real-life, well ex-moth, that was extracted from the number 70 relay, Panel F, of the Harvard Mark II Aiken Relay Calculator.” (Christopher McFadden, Interesting Engineering)

20 of 40

If you’re asking: What package or function should I use to do this thing?

  • Google it:
    • Search with the keywords and package/function name if known, and include .R in the search keywords (e.g. “dplyr::mutate add column in .R” instead of “mutate variable”)
    • “R” is generic - so consider using “R software” or “.R”
    • Start learning and using language common in R communities & publications (e.g. R4DS), like data frame instead of spreadsheet
    • For now, anything reading, wrangling, ggplot2 related: considering navigating to documentation from tidyverse.org, rdocumentation, or RStudio community.

  • Don’t know what package/function to use for your purpose? Use CRAN Task Views to help you find it (grouped by topic), or crantastic to search for packages by keyword

  • Use R-specific search tools:
    • rdocumentation.org
    • rseek.org

  • For more getting help & searching recommendations, see Ch. 1 of the R Cookbook 2nd Edition by JD Long and Paul Teers

21 of 40

TROUBLESHOOTING, A FACT:

22 of 40

How do I know there’s an error, and where to look for it?

Sometimes R tries to give you some hints that things are awry

  • When you save a script, lines of code with some errors (e.g. unmatched parentheses) will have a red circle with an x in it next to the line number.

  • You might also see a red squiggly line under part of your code, indicating a syntax issue. You can hover over the squiggly to see a pop-up hint about what’s going on.

23 of 40

Error messages will show up* in the Console when you try to run the broken code:

*usually/hopefully

24 of 40

There are multiple types of messages that R will print. Read the message to figure out what it’s trying to tell you.

Error: There’s a fatal error in your code that prevented it from being run through successfully. You need to fix it for the code to run.

Warning: Non-fatal errors (don’t stop the code from running, but this is a potential problem that you should know about).

Message: Here’s some helpful information about the code you just ran (you can hide these if you want to)

25 of 40

  • Read the error message. Did you read the error message? Read the error message. Sometimes it will be infuriatingly vague, but often it will tell you exactly how to fix it (e.g. “do you need ==?”).

When you get an error message in R:

26 of 40

Some common errors/issues to keep an eye out for at the beginning (and forever and ever...)

27 of 40

If R...can’t find a function that you know exists:

Symptom: ‘Error in _________: could not find function “_________”

Likely diagnoses:

  • The library containing the function you’re trying to use hasn’t been attached
  • You’ve misspelled or mistyped the function name

Possible solutions:

  • Make sure you’ve attached the required package with library(package_name) - and remember this line should exist in your script before the code that uses a function from that package
  • Make sure you’ve run the line of code that attaches the necessary package
  • Check the function spelling/formatting very carefully

28 of 40

Symptom: Error in ____ %>% ____ : could not find function "%>%"’

Likely diagnoses:

  • Haven’t attached the tidyverse (w/ library(tidyverse)) before using the pipe
  • Haven’t run the line of code to attach the tidyverse

Possible solutions:

  • Make sure you’ve attached the tidyverse with library(tidyverse) - and remember this line should exist in your script before the code that uses %>%
  • Make sure you’ve run the line of code that attaches the tidyverse

If R...can’t find the pipe operator:

29 of 40

If R...can’t find an object (e.g. an object or variable) that you know you’ve stored:

Symptom: ‘Error in ____ : object ‘_____’ not found’

Likely diagnoses:

  • The object hasn’t been created or stored
  • You’ve mistyped the object name

Possible solutions:

  • Make sure you’ve run the line(s) of code where you read-in or create the object
  • Make sure you’ve spelled/typed the object name exactly as it exists in the Environment
  • Use ls() to check which objects exist in your current workspace (and if it’s not there, then it hasn’t been created/stored yet)

30 of 40

If R…tells you it’s ignoring an argument within a function

Symptom: ‘Warning: Ignoring unknown parameters: ____’

Possible diagnoses:

  • You’ve included an argument that doesn’t exist for that function
  • You’ve mistyped an argument that does exist for that function

Possible solutions:

  • Check to ensure that the argument you’re trying to use for that function (a) exists, and (b) is entered exactly how R expects it to be in your code - especially checking for spelling, abbreviation & capitalization

How to find out what arguments are accepted by which functions:

  • Viewing the R documentation with ?function_name, and look in the ‘Arguments’ section (or ‘Aesthetics’ section for geoms in ggplot2)

31 of 40

If you…are trying to make a basic ggplot2 graph and you accidentally use %>% between layers instead of a +

Symptom: ‘Error: `mapping` must be created by `aes()`

Did you use %>% instead of +?’

Diagnosis:

  • Used the pipe operator %>% instead of + to add ggplot2 layers?

Possible solutions:

  • Switch to + for ggplot2!

32 of 40

If you…think your ggplot code looks perfect and you’re not getting an error message, but only an empty graph is showing up:

Symptom:

Possible diagnoses:

  • Did you check what the data you’re trying to plot looks like? For example, did you accidentally filter out all observations in a previous step?
  • Did you forget a plus sign to add the geom_* layer?

Possible solutions:

  • Make sure there is a plus sign (+) between all ggplot layers
  • Look at the data you’re trying to plot to ensure it exists

(dang)

33 of 40

If you…are trying to change some aesthetic in a ggplot graph, but you’re getting an error:

Symptom: Error in rep(value[[k]], length.out = n) :

attempt to replicate an object of type 'closure'

Possible diagnoses:

  • Did you forget that when you’re referencing a variable in ggplot, it needs to be within an aes() function?

Possible solutions:

  • Make sure that when you’re updating a graph aesthetic based on a variable in the data frame, you have that argument within aes().

34 of 40

If you...are trying to find a summary value for a variable that you know contains numbers, but you’re getting an NA result and/or a warning message:

Symptom(s):

  • NA returned when summary statistic value (e.g. mean) expected
  • In ____ : argument is not numeric or logical: returning NA’

Possible diagnoses:

  • The default argument of the function is na.rm = “FALSE”
  • The class of non-NA values is not numeric (e.g.., there are words in the column or R otherwise doesn’t know the class should be “numeric”)

Possible solutions:

  • If the variable is numeric (check class), update argument to na.rm = “TRUE”
  • Coerce variable class to “numeric” if appropriate/possible

35 of 40

Can’t figure out what’s going on from the error message directly? My process:

  1. Look over code very carefully - character-by-character and space-by-space. Run line-by-line to see where it breaks. Some things to pay close attention to at this point:
  2. Are all parentheses matching pairs?
  3. Have you typed in all functions, objects, and conditions in exactly correctly?
  4. If you run something and it doesn’t show up, did you call it to have it show up, or have you just asked R to store it?
  5. Have you looked at all intermediate data frames during wrangling to make sure data are being subset & transformed as expected?

36 of 40

Can’t figure out what’s going on from the error message directly? My process:

  • Look over code very carefully - character-by-character and space-by-space. Run line-by-line to see where it breaks. Some things to pay close attention to at this point:
  • Are all parentheses matching pairs?
  • Have you typed in all functions, objects, and conditions in exactly correctly?
  • If you run something and it doesn’t show up, did you call it to have it show up, or have you just asked R to store it?
  • Have you looked at all intermediate data frames during wrangling to make sure data are being subset & transformed as expected?
  • Google the copied & pasted error message. Someone else has encountered and solved it before. Find them (often on Stack Overflow). Beware rabbit holes and grumps.

37 of 40

Can’t figure out what’s going on from the error message directly? My process:

  • Look over code very carefully - character-by-character and space-by-space. Run line-by-line to see where it breaks. Some things to pay close attention to at this point:
  • Are all parentheses matching pairs?
  • Have you typed in all functions, objects, and conditions in exactly correctly?
  • If you run something and it doesn’t show up, did you call it to have it show up, or have you just asked R to store it?
  • Have you looked at all intermediate data frames during wrangling to make sure data are being subset & transformed as expected?
  • Google the copied & pasted error message. Someone else has encountered and solved it before. Find them (often on Stack Overflow). Beware rabbit holes and grumps.
  • Take a break and come back to it. (Another reason to start assignments early…)

38 of 40

Can’t figure out what’s going on from the error message directly? My process:

  • Look over code very carefully - character-by-character and space-by-space. Run line-by-line to see where it breaks. Some things to pay close attention to at this point:
  • Are all parentheses matching pairs?
  • Have you typed in all functions, objects, and conditions in exactly correctly?
  • If you run something and it doesn’t show up, did you call it to have it show up, or have you just asked R to store it?
  • Have you looked at all intermediate data frames during wrangling to make sure data are being subset & transformed as expected?
  • Google the copied & pasted error message. Someone else has encountered and solved it before. Find them (often on Stack Overflow). Beware rabbit holes and grumps.
  • Take a break and come back to it. (Another reason to start assignments early…)
  • Make a small, reproducible example (see reprex!) and see if I can recreate the error. I realize and resolve many errors by trying to make something work in a simpler, self-contained example.

39 of 40

Don’t forget the flip-side!

Just because you don’t get an error message doesn’t mean that you did things correctly - it just means that the code is running.

So LOOK AT YOUR RAW DATA, INTERMEDIATE DATA AND RESULTS - especially just after reading it in and after wrangling steps - to ensure that what you *think* your code is supposed to be doing with/to your data is *actually* what your code is doing with/to your data.

40 of 40

  • Especially for reading/wrangling/viz, look for tidyverse-style solutions first

  • Possibly even include “tidyverse” when you search for examples

  • If the syntax looks nightmarish, be skeptical

  • If you can’t understand what the code is doing, reconsider using it

There are often many solutions that work - try to focus on and use solutions that work and are clear, well-organized, and that use consistent/familiar syntax