[R] Compute stats on grouped data using dplyr

library(knitr)
opts_chunk$set(echo=T, warning=T, fig.width=7, height=7)
library(dplyr)

dplyr is a great package to process data in a more smooth way: the output of one function can be inject to another function using the operator %>% easily. Today, I would like to share my experience on how to process grouped data and return multiple columns/rows data.

Data

We will generate some fake data for this test.

g1<-rep(c("a","b","c"), each=10)
g2<-rep(c("A","B"), each=5, length.out=30)
x<-rnorm(30,0,10)
dat<-data.frame(g1,g2,x)
kable(head(dat))

g1	g2	x
a	A	0.9489863
a	A	-0.5877112
a	A	11.6968621
a	A	7.0672042
a	A	12.6570049
a	B	8.1088123

As you can see, this dataset includes 3 columns:

g1: a category variable with 3 levels, “a”, “b”, and “c”
g2: another variable with 2 levels, “A”,“B”, which are evenly balanced within each value of g1
x: the value column, which will be statistically tested

Tests

We will run a statistical test for the value x between “A” and “B” (column g2) within each category of g1. For this, we will use dplyer’s group_by() function to divide data, and then run tests on subsets and each subset returns a data.frame. This data.frame can be multi-row or single-row, and see how dplyr handles the return results.

First, let’s use the summarize() function to collapse the results.

# this function return a single-row data.frame
my_test<-function(v, g) {
  res<-t.test(v ~ g)
  df<-data.frame(mean1=res$estimate[1], mean2=res$estimate[2], P=res$p.value)
  return(df)
}

dat %>%
  group_by(g1) %>%
  summarize(with(.data, my_test(x, g2) )) %>%
  kable()

g1	mean1	mean2	P
a	6.356469	3.681912	0.5029446
b	-1.226555	-4.937334	0.4937346
c	6.793084	2.280872	0.3598557

As you can see, the functions do well, and return a new data.frame with group name and the result from each group.

Let’s also try a function returning multiple rows.

# this function returns multiple-row data.frame
my_quantiles<-function(v) {
  probs<-seq(0,1,0.25)
  qt<-quantile(x, probs = probs)
  data.frame(quant=qt, prob=probs)
}
dat %>%
  group_by(g1) %>%
  summarize(with(.data, my_quantiles(x) )) %>%
  kable()

## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `summarise()` has grouped output by 'g1'. You can override using the `.groups`
## argument.

g1	quant	prob
a	-13.250311	0.00
a	-2.206370	0.25
a	1.124772	0.50
a	9.544496	0.75
a	16.158428	1.00
b	-13.250311	0.00
b	-2.206370	0.25
b	1.124772	0.50
b	9.544496	0.75
b	16.158428	1.00
c	-13.250311	0.00
c	-2.206370	0.25
c	1.124772	0.50
c	9.544496	0.75
c	16.158428	1.00

The results remain good but with deprecating warning, which asks me to replace summarize() with reframe(), so let’s try this new function.

dat %>%
  group_by(g1) %>%
  reframe(with(.data, my_quantiles(x) )) %>%
  kable()

g1	quant	prob
a	-13.250311	0.00
a	-2.206370	0.25
a	1.124772	0.50
a	9.544496	0.75
a	16.158428	1.00
b	-13.250311	0.00
b	-2.206370	0.25
b	1.124772	0.50
b	9.544496	0.75
b	16.158428	1.00
c	-13.250311	0.00
c	-2.206370	0.25
c	1.124772	0.50
c	9.544496	0.75
c	16.158428	1.00

Great, everything looks good.

Conclusions

We can combine the function group_by(), summarize() and a function returning a data.frame to easily analyze data by groups via dplyr. When the returned data.frame is multi-row, the function summarize() should be replaced with reframe().

References

reframe(): https://dplyr.tidyverse.org/reference/reframe.html
mutate() with multi-row results: https://stackoverflow.com/questions/73398676/dplyrmutate-when-custom-function-return-a-vector

Last modified on 2023-05-27