Skip to content

Measuring user retention using cohort analysis with R

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson.

So how to do it using R and how to visualize it. Inspired by examples described in “Retention, Cohorts, and Visualizations” I came up with the following solution.

First, get the data in a suitable format, like this:

cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754

Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who have been active in the same month as they registered, active_m1 – number of users who have been active in the following month, and so forth. For newest cohorts you’ll be getting zeroes in some of active_mN columns, since there’s no data on them yet. This is taken into account in processing scripts.


# Load SystematicInvestor's plot.table (
con = gzcon(url('', 'rb'))

# Read the data
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == " 0%"] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)

This code produces nice visualizations of the cohort analysis as a table:

I used articles “Visualizing Tables with plot.table” and “Response to Flowingdata Challenge: Graphing obesity trends” as an inspiration for this R code.

If you want to get nice colours as in the example above, you’ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:

plot.table.helper.color <- edit(plot.table.helper.color)
 temp # matrix to plot
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[] = NA

Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.
plot.table.helper.colorbar <- edit(plot.table.helper.colorbar)

 plot.matrix # matrix to plot
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)

Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.

Now if you want to draw the cycle-like graph:

# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == "0"]
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt="n",col=pal[1],type="o",ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab="Cohort by Month",ylab="Retention",main="Retention by Cohort")

for(i in 2:length(colnames(temp))) {

abline(h=(seq(0,1,0.1)), col="lightgray", lty="dotted")

This code produces nice visualizations of the cohort analysis as multicolour cycle graph:



  1. David Linder wrote:

    This is great stuff. I’ve been asked to do produce subscription-based cohort models, i.e. How many people signed up? vs How many people subscribed in the following days? I’ve been doing most of my work in Excel, but I’m finding it hard to automate. Your code looks like something I’ll be able to learn a lot from. Aside from R, do you have any recommendations for how you could build something in Microsoft office after pulling data from a SQL query? Thanks for sharing!

    Saturday, October 20, 2012 at 0:15 | Permalink
  2. Great piece, Ivan! Thanks for this :)

    Sunday, August 16, 2015 at 12:40 | Permalink
  3. max wrote:

    thanks for this article, when i try to run your script , it tell me that cohorts not found
    > con = gzcon(url(‘’, ‘rb’))
    > source(con)
    > close(con)
    > cohorts
    Error: object ‘cohorts’ not found
    thanks in advance

    Tuesday, October 13, 2015 at 19:06 | Permalink
  4. Boisao wrote:

    Good idea, meaning no harm at all I find as much critaly as possible for governors is vital, including avoiding acronyms as far as possible (easier said than done in this world of data and targets).But you’ve hit on a more interesting point of small sample sizes. Wen dealing with small groups of children at what number of children do you think it is reasonable to start holding staff to account? Is it fair to hold a SENCO to account on the performance of 5 statements students for example. One could argue that with such small numbers its easy to track and intervene, but realistically it’s a statistically insignificant number, comparing those 5 students to, say, 7 the year before is statistically dodgy in the extreme. I still haven’t worked out the answer to this myself. My gut feeling as a mathematician says at 30 pupils upwards its fair to compare but below that you’re on shaky ground. Opinions?

    Friday, December 4, 2015 at 21:14 | Permalink
  5. I don't have that book, but I must hunt it down. Your aunt sounds great! Disturbing as it might be, that poster would be great at that age–all that detail to be obsessed over and decoded. Viva Molesworth! (resists temptation to write in his voice)

    Friday, April 29, 2016 at 3:48 | Permalink
  6. I think that they should not let people under the age of 18. I think that it is very bad to let other kids drink more than they can handle because it effects their health.

    Saturday, June 4, 2016 at 20:36 | Permalink

One Trackback/Pingback

  1. Cohort Analysis: The Beginners Guide on Saturday, August 15, 2015 at 17:05

    […] Measuring user retention using cohort analysis with R | Ivan Kuznetsov […]

Post a Comment

Your email is never published nor shared. Required fields are marked *