library(knitr)
opts_chunk$set(echo=T, warning=T, fig.width=7, height=7)
library(dplyr)
dplyr is a great package to process data
in a more smooth way: the output of one function can be inject to another
function using the operator %>%
easily. Today, I would like to share my
experience on how to process grouped data and return multiple columns/rows
data.
Data
We will generate some fake data for this test.
g1<-rep(c("a","b","c"), each=10)
g2<-rep(c("A","B"), each=5, length.out=30)
x<-rnorm(30,0,10)
dat<-data.frame(g1,g2,x)
kable(head(dat))
g1 | g2 | x |
---|---|---|
a | A | 0.9489863 |
a | A | -0.5877112 |
a | A | 11.6968621 |
a | A | 7.0672042 |
a | A | 12.6570049 |
a | B | 8.1088123 |
As you can see, this dataset includes 3 columns:
g1: a category variable with 3 levels, “a”, “b”, and “c”
g2: another variable with 2 levels, “A”,“B”, which are evenly balanced within each value of g1
x: the value column, which will be statistically tested
Tests
We will run a statistical test for the value x
between “A” and “B” (column g2
)
within each category of g1
. For this, we will use dplyer’s group_by()
function
to divide data, and then run tests on subsets and each subset returns a data.frame.
This data.frame can be multi-row or single-row, and see how dplyr handles the return
results.
First, let’s use the summarize()
function to collapse the results.
# this function return a single-row data.frame
my_test<-function(v, g) {
res<-t.test(v ~ g)
df<-data.frame(mean1=res$estimate[1], mean2=res$estimate[2], P=res$p.value)
return(df)
}
dat %>%
group_by(g1) %>%
summarize(with(.data, my_test(x, g2) )) %>%
kable()
g1 | mean1 | mean2 | P |
---|---|---|---|
a | 6.356469 | 3.681912 | 0.5029446 |
b | -1.226555 | -4.937334 | 0.4937346 |
c | 6.793084 | 2.280872 | 0.3598557 |
As you can see, the functions do well, and return a new data.frame with group name and the result from each group.
Let’s also try a function returning multiple rows.
# this function returns multiple-row data.frame
my_quantiles<-function(v) {
probs<-seq(0,1,0.25)
qt<-quantile(x, probs = probs)
data.frame(quant=qt, prob=probs)
}
dat %>%
group_by(g1) %>%
summarize(with(.data, my_quantiles(x) )) %>%
kable()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
## always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'g1'. You can override using the `.groups`
## argument.
g1 | quant | prob |
---|---|---|
a | -13.250311 | 0.00 |
a | -2.206370 | 0.25 |
a | 1.124772 | 0.50 |
a | 9.544496 | 0.75 |
a | 16.158428 | 1.00 |
b | -13.250311 | 0.00 |
b | -2.206370 | 0.25 |
b | 1.124772 | 0.50 |
b | 9.544496 | 0.75 |
b | 16.158428 | 1.00 |
c | -13.250311 | 0.00 |
c | -2.206370 | 0.25 |
c | 1.124772 | 0.50 |
c | 9.544496 | 0.75 |
c | 16.158428 | 1.00 |
The results remain good but with deprecating warning, which asks me to replace
summarize()
with reframe()
, so let’s try this new function.
dat %>%
group_by(g1) %>%
reframe(with(.data, my_quantiles(x) )) %>%
kable()
g1 | quant | prob |
---|---|---|
a | -13.250311 | 0.00 |
a | -2.206370 | 0.25 |
a | 1.124772 | 0.50 |
a | 9.544496 | 0.75 |
a | 16.158428 | 1.00 |
b | -13.250311 | 0.00 |
b | -2.206370 | 0.25 |
b | 1.124772 | 0.50 |
b | 9.544496 | 0.75 |
b | 16.158428 | 1.00 |
c | -13.250311 | 0.00 |
c | -2.206370 | 0.25 |
c | 1.124772 | 0.50 |
c | 9.544496 | 0.75 |
c | 16.158428 | 1.00 |
Great, everything looks good.
Conclusions
We can combine the function group_by()
, summarize()
and a function returning a data.frame
to easily analyze data by groups via dplyr
. When the returned data.frame is
multi-row, the function summarize()
should be replaced with reframe()
.
References
reframe(): https://dplyr.tidyverse.org/reference/reframe.html
mutate() with multi-row results: https://stackoverflow.com/questions/73398676/dplyrmutate-when-custom-function-return-a-vector
Last modified on 2023-05-27