Zhenguo Zhang's Blog
Sharing makes life better
[R] Compute stats on grouped data using dplyr
library(knitr)
opts_chunk$set(echo=T, warning=T, fig.width=7, height=7)
library(dplyr)

dplyr is a great package to process data in a more smooth way: the output of one function can be inject to another function using the operator %>% easily. Today, I would like to share my experience on how to process grouped data and return multiple columns/rows data.

Data

We will generate some fake data for this test.

g1<-rep(c("a","b","c"), each=10)
g2<-rep(c("A","B"), each=5, length.out=30)
x<-rnorm(30,0,10)
dat<-data.frame(g1,g2,x)
kable(head(dat))
g1 g2 x
a A 0.9489863
a A -0.5877112
a A 11.6968621
a A 7.0672042
a A 12.6570049
a B 8.1088123

As you can see, this dataset includes 3 columns:

  • g1: a category variable with 3 levels, “a”, “b”, and “c”

  • g2: another variable with 2 levels, “A”,“B”, which are evenly balanced within each value of g1

  • x: the value column, which will be statistically tested

Tests

We will run a statistical test for the value x between “A” and “B” (column g2) within each category of g1. For this, we will use dplyer’s group_by() function to divide data, and then run tests on subsets and each subset returns a data.frame. This data.frame can be multi-row or single-row, and see how dplyr handles the return results.

First, let’s use the summarize() function to collapse the results.

# this function return a single-row data.frame
my_test<-function(v, g) {
  res<-t.test(v ~ g)
  df<-data.frame(mean1=res$estimate[1], mean2=res$estimate[2], P=res$p.value)
  return(df)
}

dat %>%
  group_by(g1) %>%
  summarize(with(.data, my_test(x, g2) )) %>%
  kable()
g1 mean1 mean2 P
a 6.356469 3.681912 0.5029446
b -1.226555 -4.937334 0.4937346
c 6.793084 2.280872 0.3598557

As you can see, the functions do well, and return a new data.frame with group name and the result from each group.

Let’s also try a function returning multiple rows.

# this function returns multiple-row data.frame
my_quantiles<-function(v) {
  probs<-seq(0,1,0.25)
  qt<-quantile(x, probs = probs)
  data.frame(quant=qt, prob=probs)
}
dat %>%
  group_by(g1) %>%
  summarize(with(.data, my_quantiles(x) )) %>%
  kable()
## Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
## dplyr 1.1.0.
## ℹ Please use `reframe()` instead.
## ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
##   always returns an ungrouped data frame and adjust accordingly.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `summarise()` has grouped output by 'g1'. You can override using the `.groups`
## argument.
g1 quant prob
a -13.250311 0.00
a -2.206370 0.25
a 1.124772 0.50
a 9.544496 0.75
a 16.158428 1.00
b -13.250311 0.00
b -2.206370 0.25
b 1.124772 0.50
b 9.544496 0.75
b 16.158428 1.00
c -13.250311 0.00
c -2.206370 0.25
c 1.124772 0.50
c 9.544496 0.75
c 16.158428 1.00

The results remain good but with deprecating warning, which asks me to replace summarize() with reframe(), so let’s try this new function.

dat %>%
  group_by(g1) %>%
  reframe(with(.data, my_quantiles(x) )) %>%
  kable()
g1 quant prob
a -13.250311 0.00
a -2.206370 0.25
a 1.124772 0.50
a 9.544496 0.75
a 16.158428 1.00
b -13.250311 0.00
b -2.206370 0.25
b 1.124772 0.50
b 9.544496 0.75
b 16.158428 1.00
c -13.250311 0.00
c -2.206370 0.25
c 1.124772 0.50
c 9.544496 0.75
c 16.158428 1.00

Great, everything looks good.

Conclusions

We can combine the function group_by(), summarize() and a function returning a data.frame to easily analyze data by groups via dplyr. When the returned data.frame is multi-row, the function summarize() should be replaced with reframe().


Last modified on 2023-05-27

Comments powered by Disqus