Basic R
1
Bioinformatic analysis for cancer genomics
Why we use R?
2
Contents
3
4
What is R?
● A Programming/Statistical Language
● A powerful language for statistical computing and data analysis.
● Widely used in bioinformatics and life sciences
5
Why we use R?
1. Comprehensive Statistical Tools
2. Rich Ecosystem of Bioinformatics Packages
3. Data Visualization
4. Handling High-Dimensional Data
5. Open Source and Active Community
6. Reproducibility
7. Free
6
https://roelverbelen.netlify.app/resources/r/packages/
Why we use R?
7
Adapted from mr. Duy slide
Why we use R?
8
Install R
9
https://cran.rstudio.com/
R and Rstudio
10
https://posit.co/download/rstudio-desktop/
Orthers platforms for R
11
Visual code
https://code.visualstudio.com/download
Orthers platforms for R
12
Google colab
https://colab.research.google.com
13
2. Basic R
Work directory
# First, have a look at the current working directory �getwd()
# Change to your desired directory �setwd()
# List the file in the directory �dir()
14
https://www.r-bloggers.com/2020/01/rstudio-projects-and-working-directories-a-beginners-guide/
Install and load package
# Get the list of installed packages �installed.packages()
# Install package �install.packages()
# Import package �library()
# get all packages currently loaded in the R environment. �search()
# Check installed packages location�libPaths()
# Update package �update.packages()
15
https://www.javatpoint.com/r-packages
Orther ways to install package
16
Bioconductor
Github
Install and load package
17
Search and download these packages:
● tidyverse
● readr
● ggplot2
Help and manual
# Access the help file �?mean
# If unsure of the precise name �# search doc across all installed packages �??mean
18
Package tutorial
19
AI
Document
Workflow in R tutorial
20
Workflow in R tutorial
21
Loading and Saving CSV Files in R
#Standard use, small files (base R)
#Load a CSV file
data <- read.csv("gene_expression.csv", header = TRUE) head(data) # View first few rows
#Save a CSV file
write.csv(data, "output.csv", row.names = FALSE)
22
header = TRUE: Treats the first row as column names.
sep = ",": (Default) Assumes comma-separated values.
Loading and Saving CSV Files in R
#Using read.table() (More Control)
#Load CSV with custom delimiter
data <- read.table("gene_expression.csv", sep = ",", header = TRUE)
#Save CSV with write.table()
write.table(data, "output.csv", sep = ",", row.names = FALSE, quote = FALSE)
23
header = TRUE: Treats the first row as column names.
sep = ",": (Default) Assumes comma-separated values.
Loading and Saving CSV Files in R
#Standard use, small files
read.csv() / write.csv()
#Custom delimiters (e.g., tab-separated)
read.table() / write.table()
#Tidyverse compatibility, easy use
read_csv() / write_csv()
24
Save and quit
25
An R workspace image contains all the information held in the R session at the time of exit and is saved as a .RData file
# Save current workspace save.image(file="mysession.RData")
# exit R
q()
# Load workspace
load('myession.RData')
R-base overview
26
https://www.geeksforgeeks.org/r-tutorial/
27
3. Operators
Operator
28
https://www.tutorialkart.com/r-tutori al/r-operators/#gsc.tab=0
Arithmetic Operator
29
# R Arithmetic Operators Example for integers
a <- 7.5
b <- 2
print ( a+b ) #1 Addition
print ( a-b ) #2 Subtraction
print ( a*b ) #3 Multiplication
print ( a/b ) #4 Division
print ( a%%b ) #5 Reminder
print ( a%/%b ) #6 Quotient
print ( a^b ) #7 Power of
$ Rscript r_op_arithmetic.R
[1] 9.5
[2] 5.5
[3] 15
[4] 3.75
[5] 1.5
[6] 3
[7] 56.25
Arithmetic Operator
30
# R Operators - R Arithmetic Operators Example for vectors
a <- c(8, 9, 6)
b <- c(2, 4, 5)
print ( a+b )#1 addition
print ( a-b ) #2 subtraction
print ( a*b ) #3 multiplication
print ( a/b ) #4 Division
print ( a%%b )#5 Reminder
print ( a%/%b )#6 Quotient
print ( a^b )#7 Power of
$ Rscript r_op_arithmetic.R
[1] 10 13 11
[2] 6 5 1
[3] 16 36 30
[4] 4.00 2.25 1.20
[5] 0 1 1
[6] 4 2 1
[7] 64 6561 7776
Arithmetic Operator
31
Arithmetic Operator
Classwork
32
Relational Operator
33
# R Operators - R Relational Operators Example for Numbers
a <- 7.5
b <- 2
print ( a>b ) #1 greater than
print ( a<b ) #2 less than
print ( a==b ) #3 equal to
print ( a<=b ) #4 less than or equal to
print ( a>=b ) #5 greater than or equal to
print ( a!=b ) #6 not equal to
Relational Operator
34
# R Operators - R Relational Operators Example for Numbers
a <- 7.5
b <- 2
print ( a>b ) #1 greater than
print ( a<b ) #2 less than
print ( a==b ) #3 equal to
print ( a<=b ) #4 less than or equal to
print ( a>=b ) #5 greater than or equal to
print ( a!=b ) #6 not equal to
$ Rscript r_op_relational.R
[1] TRUE
[2] FALSE
[3] FALSE
[4] FALSE
[5] TRUE
[6] TRUE
Logical Operator
35
# R Operators - R Logical Operators Example for basic logical elements
a <- 0 #(TRUE)
b <- 2 #(FALES)
print ( a & b ) #1 logical AND element wise
print ( a | b ) #2 logical OR element wise
print ( !a ) #3 logical NOT element wise
print ( a && b ) #4 logical AND consolidated for all elements
print ( a || b ) #5 logical OR consolidated for all elements
Logical Operator
36
# R Operators - R Logical Operators Example for basic logical elements
a <- 0 #(TRUE)
b <- 2 #(FALES)
print ( a & b ) #1 logical AND element wise
print ( a | b ) #2 logical OR element wise
print ( !a ) #3 logical NOT element wise
print ( a && b ) #4 logical AND consolidated for all elements
print ( a || b ) #5 logical OR consolidated for all elements
$ Rscript r_op_logical.R
[1] FALSE
[2] TRUE
[3] TRUE
[4] FALSE
[5] TRUE
Assignment Operator
#Assign variable
x = 'hello'
print(x)
[1] "hello"
x <- 'learn r'
print(x)
[1] "learn r"
'r programming language' -> x; print(x)
[1] "r programming language"
37
R Variable can be assigned a value using one of the following three operators :
Miscellaneous Operator
a = 23:31
print ( a )
[1] 23 24 25 26 27 28 29 30 31
a = c(25, 27, 76)
b = 27
print ( b %in% a )
[1] TRUE
38
Operator | Description | Usage |
: | Creates series of numbers from left operand to right operand | a:b |
%in% | Identifies if an element(a) belongs to a vector(b) | a %in% b |
%*% | Performs multiplication of a vector with its transpose | A %*% t(A) |
Miscellaneous Operator
mat = matrix(c(1,2,3,4,5,6),nrow=2,ncol=3)
print (mat)
print( t(mat))
pro = mat %*% t(mat)
print(pro)
Output :[,1] [,2] [,3] #original matrix of order 2x3
[1,] 1 3 5
[2,] 2 4 6
[,1] [,2] #transposed matrix of order 3x2
[1,] 1 2
[2,] 3 4
[3,] 5 6
[,1] [,2] #product matrix of order 2x2
[1,] 35 44
[2,] 44 56
39
Special Value(Inf, NaN, NA, NULL)
40
Classwork
41
Classwork : Replicate all the operation codes above
42
3. Data Type and Data Structure
DataType
> x <- TRUE
> print(class(x))
[1] "logical"
> x <- 67.54
> print(class(x))
[1] "numeric"
x <- 63L
> print(class(x))
[1] "integer"
43
> x <- 6 + 4i
> print(class(x))
[1] "complex"
> x <- "hello"
> print(class(x))
[1] "character"
> x <- charToRaw("hello")
> print(class(x))
[1] "raw"
Data structure
44
Vector and index
45
# we can use the c function to combine the values as a vector.
# By default the type will be double
X<- c(61, 4, 21, 67, 89, 2)
X
[1] 61 4 21 67 89 2
# seq() function for creating
# a sequence of continuous values.
# length.out defines the length of vector.
Y<- seq(1, 10, length.out = 5)
Y
[1] 1.00 3.25 5.50 7.75 10.00
# use':' to create a vector
# of continuous values.
Z<- 2:7
Z
[1] 2 3 4 5 6 7
vector<-seq(10,100,by=10)
vector[1]
[1] 10
vector[c(1,3)]
[1] 10 30
Vector[7:10]
[1] 70 80 90 100
Operator in Vector
46
# Numeric vector
numbers <-c(1,2,3,4,5)
# Character vector names<-c("Alice","Bob","Charlie")
# Addition
result <- numbers + 2
print(result)
[1] 3 4 5 6 7
# Multiplication
result <- numbers *2
print(result)
[1] 2 4 6 8 10
# Adding two vectors
vector1 <-c(1,2,3)
vector2 <-c(4,5,6)
result <- vector1 + vector2
print(result)
[1] 5 7 9
Vector
47
numbers<-c(1,2,3,4,5,6,7,8)
# Access the second element second_element <- numbers[2] print(second_element)
[1] 2
numbers[3]<-5
numbers
[1] 1 2 -5 4 5 6 7 8
numbers<-numbers[-3]
number [1] 1 2 4 5 6 7 8
# Get elements greater than 3 (logical subset)
gt_than_3 <- numbers[numbers >3] print(greater_than_three) [1] 4 5
# Get a subset of the first three elements
subset_vector <- numbers[1:3] print(subset_vector) [1] 1 2 3
Accessing Elements:
Subset Vector
Logical Subset Vector
Vector
48
# Name the elements of the vector names(numbers)<-c("First","Second","Third","Fourth","Fifth")
print(numbers)
[1] First Second Third Fourth Fifth
# 1 2 3 4 5
# Get the type of the vector
vector_type <- typeof(numbers) print(vector_type)
[1] "double"
# Combine vectors
vector1<-c(1,2,3)
vector2<-c(4,5,6)
combined_vector <-c(vector1, vector2)
print(combined_vector)
[1] 1 2 3 4 5 6
Vector Naming
Vector factor
49
# Creating a factor from a character vector
colors <-c("red","green","blue","red","green")
color_factor <- factor(colors)
print(color_factor)
[1] red green blue red green
Levels: blue green red
# Specifying the order of levels
ordered_factor <- factor(colors, levels =c("red","green","blue"))
print(ordered_factor)
[1] red green blue red green
Levels: red green blue
Recycle rule
50
# Shorter vector is recycled to match the length of the longer vector short_vector
short_vector <-c(1,2)
long_vector <-c(10,20,30,40)
result <- long_vector + short_vector
print(result)
[1] 11 22 31 42
The Recycling Rule
→ How R handles operations between vectors of unequal lengths.
→ R will "recycle" the shorter vector by repeating its elements until it matches the length of the longer vector.
https://www.gastonsanchez.com/R-coding-basics/vectors4.html#recycling
Vector functions
51
# Sequences with seq()
> seq(from=3, to=27, by=3)
[1] 3 6 9 12 15 18 21 24 27
# Repetition with rep()
> rep(x=1,times=4) [1] 1 1 1 1
> rep(x=c(3,62,8.3),times=3)
[1] 3.0 62.0 8.3 3.0 62.0 8.3 3.0 62.0 8.3
# Sorting with sort()
> sort(x=c(2.5,-1,-10,3.44),decreasing=FALSE)
[1] -10.00 -1.00 2.50 3.44
> sort(x=c(2.5,-1,-10,3.44),decreasing=TRUE)
[1] 3.44 2.50 -1.00 -10.00
# Finding a Vector length with length()
> length(x=c(3,2,8,1))
[1] 4
Data frame
52
Definition:
A data frame is a table or a 2-dimensional array-like structure in R, where each column can contain different types of data (numeric, character, factor, etc.).
Structure:
Similar to a spreadsheet or SQL table, with rows representing observations and columns representing variables.
Creating Data frame
53
# Create a data frame with three columns
df <- data.frame(ID =1:4,
Name=c("Alice","Bob","Charlie","Diana"),
Score =c(85,92,88,76))
print(df)
ID Name Score
1 1 Alice 85
2 2 Bob 92
3 3 Charlie 88
4 4 Diana 76
Accessing data in Data frame
54
# Access the 'Name' Column
names<- df$Name
print(names)
[1] "Alice" "Bob" "Charlie" "Diana"
Using $ to Access Columns:
# Access the element in the 2nd row, 3rd column
element <- df[2,3]
print(element)
[1] 92
Using Indexing
Data frame
55
# Add a new column 'Passed'
df$Passed <- df$Score >80
print(df)
ID Name Score Passed
1 1 Alice 85 TRUE
2 2 Bob 92 TRUE
3 3 Charlie 88 TRUE
4 4 Diana 76 FALSE
# Subsetting a dataframe with condition
high_scores <- df[df$Score >80,] print(high_scores)
ID Name Score Passed 1 1 Alice 85 TRUE
2 2 Bob 92 TRUE
3 3 Charlie 88 TRUE
Adding a New Column
Subsetting Data Frames:
Data frame
56
# Combine data frames by adding rows
df_new <- data.frame(ID=5,
Name="Eve", Score=90)
combined_df <- rbind(df, df_new)
print(combined_df)
ID Name Score Passed
1 1 Alice 85 TRUE
2 2 Bob 92 TRUE
3 3 Charlie 88 TRUE
4 4 Diana 76 FALSE
5 5 Eve 90 TRUE
# Combine data frames by adding columns
extra_info<-data.frame(Age=c(23,25,22,21,24))
full_df <- cbind(combined_df,extra_info)
print(full_df)
ID Name Score Passed Age
1 1 Alice 85 TRUE 23
2 2 Bob 92 TRUE 25
3 3 Charlie 88 TRUE 22
4 4 Diana 76 FALSE 21
5 5 Eve 90 TRUE 24
Row Binding
Column Binding
Viewing and Inspecting Data Frames
57
# Viewing data
View(df)
# Explore the structure of the data
str(df)
'data.frame': 4 obs. of 4 variables:
$ ID : int 1 2 3 4
$ Name : chr "Alice" "Bob" "Charlie" "Diana"
$ Score : num 85 92 88 76 $ Passed: logi TRUE TRUE TRUE FALSE
Accessing data in Data frame
58
summary(df)
ID Name Score Passed
Min. :1.00 Length:4 Min. :76.00 Mode :logical
1st Qu.:1.75 Class :character 1st Qu.:82.75 FALSE:1
Median :2.50 Mode :character Median :86.50 TRUE :3
Mean :2.50 Mean :85.25
3rd Qu.:3.25 3rd Qu.:89.00
Max. :4.00 Max. :92.00
To get a summary of each column.
Summary Statistics
Data frame
59
# Get rows where Score is greater than 80
high_scores <- df[df$Score >80,] print(high_scores)
ID Name Score Passed
1 1 Alice 85 TRUE
2 2 Bob 92 TRUE
3 3 Charlie 88 TRUE
Subset Rows Based on Conditions
Select Specific Columns
# Select only the 'Name' and 'Score' columns
name_score <- df[,c("Name","Score")]
print(name_score)
Name Score
1 Alice 85
2 Bob 92
3 Charlie 88
4 Diana 76
Adding and Modifying Columns
60
# Add a column indicating if the score is above average
df$Above_Average <- df$Score > mean(df$Score)
print(df)
ID Name Score Passed Above_Average
1 1 Alice 85 TRUE FALSE
2 2 Bob 92 TRUE TRUE
3 3 Charlie 88 TRUE TRUE
4 4 Diana 76 FALSE FALSE
Modify an Existing Column
# Adjust the score by adding 5 points to each student
df$Score <- df$Score + 5
print(df)
ID Name Score Passed Above_Average
1 1 Alice 90 TRUE FALSE
2 2 Bob 97 TRUE TRUE
3 3 Charlie 93 TRUE TRUE
4 4 Diana 81 FALSE FALSE
Add a New Column
Accessing data in Data frame
61
# Sort the data frame by 'Score' in descending order
df_sorted <- df[order(-df$Score),] print(df_sorted)
ID Name Score Passed Above_Average
2 2 Bob 97 TRUE TRUE
3 3 Charlie 93 TRUE TRUE
1 1 Alice 90 TRUE FALSE
4 4 Diana 81 FALSE FALSE
Sort by a Single Column
Sort by Multiple Columns Binding
# Sort by 'Passed' (descending) and then by 'Score' (ascending)
df_sorted_multi <- df[order(-df$Passed,df$Score),]
print(df_sorted_multi)
ID Name Score Passed Above_Average
1 1 Alice 90 TRUE FALSE
3 3 Charlie 93 TRUE TRUE
2 2 Bob 97 TRUE TRUE
4 4 Diana 81 FALSE FALSE
Accessing data in Data frame
62
# Bind new data frame rows to an existing one
new_students <- data.frame (ID = 5,
Name = "Eve",
Score = 89,
Passed = TRUE,
Above_Average = FALSE)
df_combined <- rbind(df, new_students)
print(df_combined)
ID Name Score Passed Above_Average
1 1 Alice 90 TRUE FALSE
2 2 Bob 97 TRUE TRUE
3 3 Charlie 93 TRUE TRUE
4 4 Diana 81 FALSE FALSE
5 5 Eve 89 TRUE FALSE
Row Binding
Data frame
63
# Add a new column for student
Age ages <- data.frame(Age = c(23,25,22,21,24))
df_with_age <- cbind(df_combined, ages)
print(df_with_age)
ID Name Score Passed Above_Average Age
1 1 Alice 90 TRUE FALSE 23
2 2 Bob 97 TRUE TRUE 25
3 3 Charlie 93 TRUE TRUE 22
4 4 Diana 81 FALSE FALSE 21
5 5 Eve 89 TRUE FALSE 24
Column Binding
Data frame
64
# Remove the 'Passed' column
df_no_passed <- df[,!(names(df)%in%"Passed")]
print(df_no_passed)
ID Name Score Above_Average
1 1 Alice 90 FALSE
2 2 Bob 97 TRUE
3 3 Charlie 93 TRUE
4 4 Diana 81 FALSE
Remove a Column
# Rename 'Score' to 'Final_Score' names(df)[names(df)=="Score"]<-"Final_Score"
print(df)
ID Name Final_Score Passed Above_Average
1 1 Alice 90 TRUE FALSE
2 2 Bob 97 TRUE TRUE
3 3 Charlie 93 TRUE TRUE
4 4 Diana 81 FALSE FALSE
Rename a Column
Data frame
65
Accessing data in Data frame
66
# Merge two data frames by the 'ID' column
df_info <- data.frame(ID =1:4, Gender = c("F","M","M","F"))
df_merged <- merge(df, df_info, by ="ID")
print(df_merged)
ID Name Score Passed Above_Average Gender
1 1 Alice 90 TRUE FALSE F
2 2 Bob 97 TRUE TRUE M
3 3 Charlie 93 TRUE TRUE M
4 4 Diana 81 FALSE FALSE F
Merging Data Frames
Key Functions in Data frame
67
Other Key Functions
Matrix and array
68
Definition: A matrix is a two-dimensional (2D) data structure in R where all elements are of the same data type (numeric, character, or logical).
Structure: Consists of rows and columns.
Definition: An array is a multi-dimensional data structure in R that can have more than two dimensions. All elements must be of the same type.
Structure: Arrays can be thought of as matrices extended to more dimensions.
Matrix
Array
Matrix and array
69
# Create a 3x3 numeric
matrix mat <- matrix(1:9, nrow =3, ncol =3)
print(mat)
# Create a 3x3x2 array
arr <- array(1:18,dim=c(3,3,2))
print(arr)
Matrix
Array
70
71
72
73
4. Functions
R function
74
https://www.statmethods.net/management/functions.html
https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf
Useful Built-in function
75
Data Manipulation
● subset(): Extract subsets of data.
● merge(): Combine data frames by common columns or row names.
● apply(): Apply a function over the margins of an array or matrix.
● tapply(): Apply a function over subsets of a vector.
● reshape(): Reshape data between wide and long formats.
● cut(): Divide continuous variables into intervals.
● aggregate(): Compute summary statistics over subsets of data.
Statistical Analysis
● summary(): Provide a summary of an object.
● cor(): Calculate correlation between variables.
● lm(): Fit linear models.
● table(): Create a contingency table of counts.
Useful Built-in function
76
Data Cleaning
● na.omit(): Remove missing values from an object.
● is.na(): Identify missing values.
● duplicated(): Identify duplicate elements.
Data Visualization
● plot(): Generic X-Y plotting.
● hist(): Create a histogram.
● boxplot(): Create a boxplot.
● pairs(): Create a matrix of scatterplots.
Useful Built-in function
77
Utility Functions
● str(): Display the structure of an R object.
● paste(): Concatenate strings.
● seq(): Generate a sequence of numbers.
● rep(): Repeat elements of a vector.
78
5. Decision Making
Decision making
79
80
6. Control Flow
Control flow
81
R cheat sheet
82
https://iqss.github.io/dss-workshops/R/Rintro/base-r-cheat-sheet.pdf
Summary in R tutorial
83
84
Thanks you