“High Performance MySQL” by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko; O’Reilly Media

lrgPeter and Vadim are long-term contributors to MySQL open source code base and are the founders of Percona, MySQL consultancy. This is really THE book about MySQL written by engineers who have spent more than a decade working on MySQL code.

The book is equally valuable to devops and DB admins, as well as software developers. It assumes that you have a good working knowledge of MySQL, as it dives straight into the intricate details of the system architecture and performance fine-tuning.

I have first-hand experience using second edition of this book, and I can say that it paid itself back many times over when I was debugging critical performance issues I came across in HeiaHeia.com operations. Now this third edition (published in 2012) covers also new features in MySQL version 5.5.

In this new edition authors make real effort to explain not only how MySQL works, but why certain features in MySQL work the way they do.

Case studies peppered throughout the book are particularly interesting, as they give insight into how to deal with MySQL in large scale setups with high load and demand for high availability.

This book covers a lot of very complex topics from replication, clustering and sharding to high availability. Practical advice on how to profile your server, find and fix the bottlenecks in server setup as well as on the application side in SQL queries is invaluable.

I very much liked that book is organized in such a way that it can easily be used as a reference.

This is a must read for anyone who has a somewhat significant load on their MySQL DB server or is considering moving from a single server setup to a cluster.

Get your copy of the book at O’Reilly.


Measuring user retention using cohort analysis with R

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson.

So how to do it using R and how to visualize it. Inspired by examples described in “Retention, Cohorts, and Visualizations” I came up with the following solution.

First, get the data in a suitable format, like this:

cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754

Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who have been active in the same month as they registered, active_m1 – number of users who have been active in the following month, and so forth. For newest cohorts you’ll be getting zeroes in some of active_mN columns, since there’s no data on them yet. This is taken into account in processing scripts.


# Load SystematicInvestor's plot.table (https://github.com/systematicinvestor/SIT)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))

# Read the data
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == " 0%"] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)

This code produces nice visualizations of the cohort analysis as a table:

I used articles “Visualizing Tables with plot.table” and “Response to Flowingdata Challenge: Graphing obesity trends” as an inspiration for this R code.

If you want to get nice colours as in the example above, you’ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:

plot.table.helper.color <- edit(plot.table.helper.color)
 temp # matrix to plot
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!is.na(highlight)])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[is.na(temp)] = NA

Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.
plot.table.helper.colorbar <- edit(plot.table.helper.colorbar)

 plot.matrix # matrix to plot
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)

Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.

Now if you want to draw the cycle-like graph:

# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == "0"]
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt="n",col=pal[1],type="o",ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab="Cohort by Month",ylab="Retention",main="Retention by Cohort")

for(i in 2:length(colnames(temp))) {

abline(h=(seq(0,1,0.1)), col="lightgray", lty="dotted")

This code produces nice visualizations of the cohort analysis as multicolour cycle graph:


Book review: Programming Amazon EC2

Should you buy a book on a new technology or just read technology provider’s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn’t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all.

After a short historical introduction to the philosophy of the  Amazon platform authors proceed to basics of the EC2 instance management (using Ubuntu as example) and describe how several real life web applications benefited from migration to Amazon infrastructure. High level architecture descriptions help to understand how all pieces of the Amazon platform come together – ELB, EC2, RDS, S3, SQS, SNS, etc.  Examples are provided in PHP, Ruby and Java.

Don’t expect any secret knowledge to be revealed in this book. A lot of intricate details and answers to the questions that you will face when planning a migration to AWS or designing architecture for a new web application are left out of this book. At the same time the book gives a fairly good overview of how to make your application to be scaleble and highly available on AWS and can serve as a good starting point in your AWS journey.

Some of the recommendations and descriptions given in the book are outright sloppy. A couple of examples: the book recommends “You can choose an availability zone, but that doesn’t matter very much because network latency between zones is comparable to that within a zone“. But tests indicate that Cross Availability Zones latency can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.

Another example where you might want to go run your own tests before making a decision or look for another opinion “A disk is always slower than memory. If you run your own MySQL using local disks, that’s slow as well. But using disk-based operations in RDS is just horrible. Minimizing disk usage means implementing proper indexes, something you always want to do.” A very strong yet vague statement on RDS performance, where I’d really love to see a comparison in performance of MySQL installation on EC2+EBS vs. RDS and a list of what kind of MySQL fine-tuning will and will not work on RDS.

Another annoying bit about the book is that authors keep promoting their commercial Android client for AWS monitoring throughout all chapters. As I see it – if there are ads, the book should be free.

Bottom line: I see only two reasons why you might want to buy and read this book – learning about the history of AWS and learning how five selected web services designed their architecture when migrating to AWS.


Setting up and securing web server (+MySQL +Rails/PHP) on Ubuntu 10.04 LTS

After repeating these operations many times in various setups, I decided to create a public set of instructions and share them with the world. This should be suitable for most of the simple web sites, utilizing Ruby on Rails or PHP.

The setup works on Ubuntu 10.04 Server LTS (scheduled end of life April 2015). Other components of the setup are Nginx as the web server, Phusion Passenger as application server.

I’m using this setup most often on Linode VPS, however none of the instructions are Linode specific.

Continue reading “Setting up and securing web server (+MySQL +Rails/PHP) on Ubuntu 10.04 LTS”


Informal notes from Strata 2012 conference on Big Data and Data Science

It’s been almost a month since I came back from California, and I just got around to sorting the notes from O’Reilly Strata conference. Spending time in the Valley is always inspiring – lots of interesting people, old friends, new contacts, new start-ups – it is the center of IT universe.

Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in HeiaHeia. It’s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.

Unfortunately HeiaHeia was the only company from Finland attending the conference. But I’m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.

By the way, Trulia (co-founded by Sami Inkinen) had an excellent demo in the halls of the conference venue – check it out in their blog.

Below are my notes from the conference – I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” (see videos below).

Continue reading “Informal notes from Strata 2012 conference on Big Data and Data Science”


Notes from Gothenburg – Nordic Ruby 2011 conference

Here are my notes from Nordic Ruby conference in Göteborg, Sweden.

I’d like to say big thanks to the organisers of the conference (especially CJ @cjkihlbom) – everything went really smooth, even though there’s been 150 people attending this year compared to 90 last year.
Some points that I’d really like to highlight are:

  • a lot of time to meet people and discuss: 30 minutes talks followed by 30 minutes breaks, no q&a – those who had questions had an opportunity to talk to the speakers during the breaks
  • venue was great (of course, the boat 🙂 – there was enough space for everyone to move around, but at the same time it was compact enough not to get lost also everyone had an opportunity to have lunch and dinner together
  • “job board” a huge white board where anyone can post information about open positions in their companies – it got filled withing firts few hours – job market is really hot
  • lightning talks that any participant can give – 5 minute talks in the end of the day – it was really great
  • real coffee 🙂 espresso, latte, cappuccino, americano – you name it – professional baristas were at your service
  • 5K Nordic Ruby run organised on the second day’s morning

Continue reading “Notes from Gothenburg – Nordic Ruby 2011 conference”


vim is the best editor, also for RoR development

vim is a natural choice when you’re starting a new programming project (if you’re emacs or textmate adept – you can stop reading now 🙂 If you’re starting a Ruby on Rails project there are a couple of scripts/configurations you might want to install to make development with vim an even more pleasant experience.

1. Rails.vim by Tim Pope

To install just copy autoload/rails.vim, plugin/rails.vim, and doc/rails.txt to corresponding directories in ~/.vim

Vim scripts section has a full description of the functionality. Some highlights:

  • Easy navigation between models, controllers and views with :Rmodel, :Rview, :Rcontroller commands
  • Syntax highlighting
  • CTRL-X CTRL-U for autocompletion
  • :Rtree for project tree (see item 2)

2. NERD Tree by Marty Grenfell

Another must-have. Provides you with an easy way to navigate your project tree. Rails.vim nicely integrates with this one.

3. Colour schemas

If you want to save your eyes – use dark background when developing. Personally I prefer one of the standard vim themes – Desert , but Ocean Deep is also very good.

Copy theme file to ~/.vim/colors and then use

:colorscheme oceandeep

command to apply it.


Increasing Ruby interpreter performance by adjusting garbage collector settings

According to Evan Weaver from Twitter it is possible for a typical production Rails app on Ruby 1.8 to recover 20% to 40% of user CPU by simply adjusting Ruby garbage collector settings. In August I set out on a quest to verify that statement on HeiaHeia servers. Results have really exceeded my expectations. Time to execute application tests locally decreased by 46%. On production servers CPU utilisation decreased by almost 40%.


Call to undefined function: imagecreatefromjpeg()

While installing new Joomla modules I came across this PHP error (yep, still have to deal with PHP occasionally). I had PHP compiled from source on Ubuntu 10.04 as per earlier instructions. Quick check of phpinfo() indicated that while gd module was compiled in, it didn’t have JPEG support:

GD Support         enabled
GD Version         bundled (2.0.34 compatible)
GIF Read Support   enabled
GIF Create Support enabled
PNG Support        enabled
WBMP Support       enabled
XBM Support        enabled

Making sure that JPEG libraries are installed

sudo aptitude install libjpeg libjpeg-dev

and reconfiguring PHP with –with-jpeg-dir flag (the rest of the compilation process remains the same as here)

./configure --enable-fastcgi --enable-fpm --with-mcrypt --with-zlib 
--enable-mbstring --with-openssl --with-mysql --with-mysql-sock 
--with-gd --without-sqlite --disable-pdo --with-jpeg-dir=/usr/lib

and then restarting nginx

sudo /etc/init.d/nginx restart

helped to solve the problem.


Running Rails applications using Nginx with Passenger on Ubuntu Server

If you’re planning to run Rails applications on Nginx using Phusion Passenger, and do it on Ubuntu Linux, here’s what needs to be done.

Even though there’s Ubuntu nginx package available (which works perfectly when you’re running PHP apps using FCGI), if you want to take into use Phusion Passenger, you’ll need to recompile Nginx from sources.

Instructions below were verified on Ubuntu 10.04 (Lucid Lynx) Server Edition.

Continue reading “Running Rails applications using Nginx with Passenger on Ubuntu Server”