This third part of this series is dedicated to data creation & wrangling.
Wrangling is the entry point to most data analysis workflows. If the {tidyverse}
ecosystem offers a bench of features, to filter, mutate …etc, particularly via the {dplyr}
package, you can combine some base R functions to achieve the same results.
Data creation
Functions | Tasks / Examples |
---|---|
c() |
It is the generic function which combines its arguments to make atomic vector1. Ex: x <- c(1, 2, 3) ,x <- c("Paris", "Bordeaux", "Le Mans") , x <- c(TRUE, TRUE, FALSE) |
from:to |
generates a sequence of integers. Ex: 1:3 produce the vector c(1,2,3) 0:-2 produces the vector c(0,-1,-2) |
seq(from, to, by = 1) |
generates a sequence from from to to with a step of value passed to by .Ex: seq(1,8,by=2) produces the vector c(1,3,5,7) . The function has other arguments that you should sometimes use separately with by argument.Ex: seq(1,10,length.out= 3) produces the vector c(1.0, 5.5, 10.0) . The length.out argument tells to generate a sequence of length.out equally spaced values from from to to . |
seq_along(x) |
produces, where x is a vector, the integer sequence 1, 2, ..., length(x) .Ex: seq_along(c(4,5,6,7)) produces the vector c(1,2,3,4) . The function is very helpful for the for loops . |
seq_len(length.out) |
generates the integer sequence 1,2, ..., length.out unless length.out = 0 , when it generates integer(0) .Ex: seq_len(4) produces the vector c(1,2,3,4) . |
rep(x, times) |
repeats the x vector times times.Ex: rep(c(2,3), 2) produces c(2,3,2,3) .You can use the argument each separately with times to specify how many times each element of x should be repeated. Ex:rep(c(2,3), each=3) produces c(2,2,2,3,3,3) |
factor(x, levels) |
transforms a vector x in factor.2Ex: factor(c("Apple", "Strawberry", "Raspberry", "Apple")) creates a factor of 3 categories: Apple , Strawberry and Raspberry . |
list(...) |
creates a list with arguments named or not, which can have different lengths and types. Ex: list(x=1:2, b ="Myriam", c= TRUE) |
data.frame() |
creates a data frame with named or not vectors. Ex: data.frame(sutdent = c("Adam Kennington", "Pamela Ritchie"), mark = c("A-", "A+")) . Shorten vectors are recycled to fit the length of the longest. |
rbind() |
combines arguments by row. |
cbind() |
combines arguments by column. |
Data wrangling
Functions | Tasks / Examples |
---|---|
x[i] |
returns the ith element of a vector. If x is list, i can be an argument, so x[i] returns the element(s) of i .Ex: x <- list(a = "") x[["a"]] |
x[[n]] |
returns the nth element of a list 3. |
x[-n] |
returns all the elements of x vector except the nth . |
x[1:n] |
returns the first n elements of the x vector. |
x[-(1:n)] |
returns all elements of x vector except the first n . |
x[c(1,3)] |
returns the 1st and the 3rd elements of x vector. |
x[-c(1,3)] |
returns all elements except the 1st and the 3rd elements of x vector. |
x[["name"]] / x$name |
returns the column named name when x is a data frame. |
as.data.frame(x) , as.numeric(x) , as.logical(x) , as.character(x) … etc |
converts x respectively to type data.frame , numeric , logical or logical .You can view all conversion methods with methods(as) . |
is.na(x) ,is.null(x) ,is.array(x) ,is.data.frame(x) , is.numeric(x) , is.logical(x) , is.character(x) … etc |
returns TRUE or FALSE if x is of the type.Ex: is.logical(4) returns FALSE , is.double(4.4) return TRUE . Here too, you can view all type checking methods with methods(is) . |
nchar(x) |
takes a character4 vector and returns a vector the number of characters of the vector elements. Ex: nchar(c("banana", "strawberry", 27)) returns c(6,10,2) . |
length(x) |
gets or sets the length of vectors (including list). Ex: length(1:4) returns 4 . length(list(1:5,1:4)) returns 2. |
lengths(x) |
gets the length of each element of a list or atomic vector. Ex: lengths(list(1:3,1:5)) returns c(3,5) . |
append(x,y, after = length(x)) |
add elements of y vector to x vector after the subscript after .Ex: append(1:8, 0:1, after = 2) that can be translated by add the vector c(0,1) after the second element of the vector c(1,2,3,4,5,6,7,8) returns c(1,2,0,1,3,4,5,6,7,8) . |
nrow(x) /NROW(x) |
returns the number of rows of x when x is a data frame or a matrix. When x is a vector, NROW considers it as a matrix and returns the number of elements. |
ncol(x) /NCOL(x) |
same as nrow /NROW but for columns |
which.min(x) |
returns the index of the (first) minimum of a numeric (logical) vector. Ex: which.min(c(1,0,4,5,0)) returns 2 which corresponds to the index of the first 0 (the minimum of the numeric vector). |
which.max(x) |
same as which.min but for the maximum.Ex: which.max(c(5,3,-1,2,5)) returns 1 . |
which(x == a) |
return the index of x for which the result of the logical operation x == a is TRUE .Ex: x <- c(0,1,3,4,2,4,5) which(x %% 2 == 0) returns c(1,4,5,6) which corresponds to the index of the even elements of the numeric vector. |
rev(x) |
returns the reversed version of a vector or a reversible object x .Ex: rev(c("Roger","Rafa","Novak")) returns c("Novak","Rafa","Roger") . |
sort(x, decreasing = F) |
sorts a vector of factor into ascending (decreasing = FALSE ) or descending order (decreasing = TRUE ).Ex: sort(c(1,4,3,7,-1,8), decreasing = T) returns c(8,7,4,3,1,-1) |
order(..., decreasing = FALSE) |
returns a permutation which rearranges its first argument into ascending or descending order, breaking ties by further arguments. Ex: my_df <- data.frame(name = c("Oliver","Frank", "Mohsen"), age = c(15, 25, 21)) my_df[order(my_df$age),] returns data.frame(name = c("Oliver","Mohsen","Frank"), age = c(15,21,25)) |
rank(x) |
returns the sample ranks of the values in a vector. Ex: rank(c(8,4,2,5)) returns c(4,2,1,3) . Equal and missing values can be handled via ties.method and na.last arguments. Feel free to read the documentation to know the different options. |
cut(x, breaks) |
converts the numeric vector x to factor by dividing the range of x into intervals and codes the values.Ex: cut(1:5, 3) returns the factor vector c((0.996,2.33],(0.996,2.33],(2.33,3.67],(3.67,5], (3.67,5]) where levels are: (0.996,2.33] (2.33,3.67] (3.67,5] cut(1:5, breaks = c(0,2,3,5)) returns the factor vector c((0,2],(0,2], (2,3],(3,5],(3,5]) where levels are: (0,2] (2,3] (3,5] . There are many other arguments to set the results labels, to include the lowest … etc. |
unique(x) |
returns a vector/data frame where x duplicate elements/rows are removed. |
table(...) |
returns a contingency table of the counts at each combination of factor levels. |
Well, we come to the end of this third and penultimate part of this series on data structures and wrangling with Base R. Of course again, I cannot claim that I cover all the existing functions, that is not the goal, but I am listing a few that I consider essential to be aware of for basic data wrangling with R. Many people may note the absence of functions to perform mathematical operations. They will be covered in the last part of the series.
Footnotes
Citation
@online{issabida2023,
author = {Abdoul ISSA BIDA},
title = {Base {R} {Essentials} - {Part} 3},
date = {2023-02-18},
url = {https://www.abdoulblog.com},
langid = {en}
}