Data Wrangling
with DataFrames.jl
Cheat Sheet
(for version 1.x)
Create DataFrame
Reshape Data - changing layout
Tidy Data - the foundation of data wrangling
Select Observations (rows)
Select Variables (columns)
Each variable is saved�in its own column.
In a tidy data set:
Each observation is
saved in its own row.
&
Tidy data makes data analysis easier and more intuitive. DataFrames.jl can help you tidy up your data.
Sort Data
stack(df, [:sibsp, :parch])
Stack columns data as rows
with new variable and value columns�
unstack(df, :variable, :value)
Unstack rows into columns� using variable and value columns
sort(df, :age)
Sort by age
sort(df, :age, rev = true)� Sort by age in reverse order
sort(df, [:age, order(:sibsp, rev = true)])� Sort by in ascending age and descending sibsp order
DataFrame(x = [1,2,3], y = 4:6, z = 9)
Create data frame with column data� from vector, range, or constant.
�DataFrame([(x=1, y=2), (x=3, y=4)])
Create data frame from a vector of
named tuples.�
DataFrame("x" => [1,2], "y" => [3,4])
Create data frame from pairs of� column name and data.�
DataFrame(rand(5, 3), [:x, :y, :z])
DataFrame(rand(5, 3), :auto)� Create data frame from a matrix.�
DataFrame()� Create an empty data frame without� any columns.
DataFrame(x = Int[], y = Float64[])
Create an empty data frame with
typed columns.
DataFrame(mytable)
Create data frame from any data� source that supports Tables.jl interface.
Describe DataFrame
describe(df)
Summary stats for all columns.
describe(df, :mean, :std)
Specific stats for all columns.
describe(df, extrema => :extrema)
Apply custom function to all� columns.
Function syntax
first(df, 5) or last(df, 5)
First 5 rows or last 5 rows
unique(df)
unique(df, [:pclass, :survived])
Return data frame with unique rows.
filter(:sex => ==("male"), df)�filter(row -> row.sex == "male", df)
Return rows having sex equals “male”.
Note: the first syntax performs better.
subset(df, :survived)
subset(df, :sex => x -> x .== "male")
Return rows for which value is true.
Note: the “survived” column is Bool type
�
Function syntax
select(df, :sex)
select(df, "sex")
select(df, [:sex, :age])
Select desired column(s).
select(df, 2:5)
Select columns by index.
select(df, r"^s")
Select columns by regex.
select(df, Not(:age))
Select all columns except the� age column.
select(df, Between(:name, :age))
Select all columns between
name and age columns.
�
View Metadata
names(df)
propertynames(df)
Column names.
columnindex(df, "sex")
Index number of a
column.
�
Handle Missing Data
dropmissing(df)
dropmissing(df, [:age, :sex])
Return rows without any missing data.
allowmissing(df)
allowmissing(df, :sibsp)
Allow missing data in column(s).
disallowmissing(df)
disallowmissing(df, :sibsp)
Do not allow missing data in column(s).
completecases(df)
completecases(df, [:age, :sex])
Return Bool array with true entries� for rows without any missing data.
�
This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 1 / 2
Mutation: use sort!
Mutation: use select!
Mutation: use dropmissing!, allowmissing!, or disallowmissing!
nrow(df)
ncol(df)
Number of
rows and
columns.
Mutation: use unique!, filter!, or subset!
Indexing syntax
df[6:10, :]
Return rows 6 to 10
df[df.sex .== "male", :]
Return rows having sex equals “male”.
df[findfirst(==(30), df.age), :]
Return first row having age equals 30.
df[findall(==(1), df.pclass), :]
Return all rows having pclass equals 1.
Indexing syntax
df[:, [:sex, :age]]
Select a copy of columns.
df[!, [:sex, :age]]
Select original column vectors.
P.S. Indexing syntax can select observations and variables at the same time!
Moving Stats (a.k.a Rolling Stats)
select(df, :x => (v -> runmean(v, n)))
select(df, :x => (v -> runmedian(v, n)))
select(df, :x => (v -> runmin(v, n)))
select(df, :x => (v -> runmax(v, n)))
Moving mean, medium, minimu, and maximum
for column x with window size n
The run* functions (and more) are available from RollingFunctions.jl package.
@pipe df |>
filter(:sex => ==("male"), _) |>
groupby(_, :pclass) |>
combine(_, :age => mean)
Cumulative and Moving Stats
Summarize Data
Aggregating variables
combine(df, :survived => sum)
combine(df, :survived => sum => :survived)
Apply a function to a column; optionally assign colum name.
combine(df, :age => (x -> mean(skipmissing(x))))
Apply an anonymous function to a column.
combine(df, [:parch, :sibsp] .=> maximum)
Apply a function to multiple columns using broadcasting syntax.
Combine Data Sets
innerjoin(df1, df2, on = :id)
This cheat sheet is inspired by the data wrangling cheat sheets from RStudio and pandas. Examples are based on the Kaggle Titanic data set. Created by Tom Kwong, May 2021. v1.x rev1 Page 2 / 2
gdf = groupby(df, :pclass)
gdf = groupby(df, [:pclass, :sex])
Group data frame by one or more columns.
keys(gdf)
Get the keys for looking up SubDataFrame’s in the group.
gdf[(1,)]
Look up a specific group using a tuple of key values.
combine(gdf, :survived => sum)
Apply a function over a column for every group. Returns a single data frame.
combine(gdf) do sdf
DataFrame(survived = sum(sdf.survived))
end
Apply a function to each SubDataFrame in the group and combine results.
combine(gdf, AsTable(:) => t -> sum(t.parch .+ t.sibsp))
Apply a function to each SubDataFrame in the group and combine results.
Group Data Sets
leftjoin(df1, df2, on = :id)
rightjoin(df1, df2, on = :id)
outerjoin(df1, df2, on = :id)
semijoin(df1, df2, on = :id)
antijoin(df1, df2, on = :id)
vcat(df1, df2)
hcat(df1, df2)
Build Data Pipeline
Cumulative Stats
select(df, :x => cumsum)
select(df, :x => cumprod)
Cumulative sum and product of column x.
select(df, :x => v -> accumulate(min, v))
select(df, :x => v -> accumulate(max, v))
Cumulative minimum/maximum of column x.
select(df, :x => v -> cumsum(v) ./ (1:length(v)))
Cumulative mean of column x.
The @pipe macro comes from Pipe.jl package. Underscores are automatically replaced by return value from the previous operation before the |> operator.
select(df, :x => ordinalrank) # 1234
select(df, :x => competerank) # 1224
select(df, :x => denserank) # 1223
select(df, :x => tiedrank) # 1 2.5 2.5 4
The *rank functions come from StatsBase.jl package.
select(df, :x => lead) # shift up
select(df, :x => lag) # shift down
The lead and lag functions come from ShiftedArrays.jl package.
Ranking and Lead/Lag Functions
Data frames can be combined vertically or horizontally.
Tips:�You can also use
these functions to add summarized data to all rows:
Tips: Use skipmissing function to remove missing values.
Adding variables with aggregation results
transform(df, :fare => mean => :average_fare)
Add a new column that is populated with the aggregated value.
select(df, :name, :fare, :fare => mean => :average_fare)
Select any columns and add new ones with the aggregated value.
Adding variables by row
transform(df, [:parch, :sibsp] => ByRow(+) => :relatives)
Add new column by applying a function over existing column(s).
transform(df, :name => ByRow(x -> split(x, ",")) => [:lname, :fname])
Add new columns by applying a function that returns multiple values.