Skip to content

“High Performance MySQL” by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko; O’Reilly Media

lrgPeter and Vadim are long-term contributors to MySQL open source code base and are the founders of Percona, MySQL consultancy. This is really THE book about MySQL written by engineers who have spent more than a decade working on MySQL code.

The book is equally valuable to devops and DB admins, as well as software developers. It assumes that you have a good working knowledge of MySQL, as it dives straight into the intricate details of the system architecture and performance fine-tuning.

I have first-hand experience using second edition of this book, and I can say that it paid itself back many times over when I was debugging critical performance issues I came across in HeiaHeia.com operations. Now this third edition (published in 2012) covers also new features in MySQL version 5.5.

In this new edition authors make real effort to explain not only how MySQL works, but why certain features in MySQL work the way they do.

Case studies peppered throughout the book are particularly interesting, as they give insight into how to deal with MySQL in large scale setups with high load and demand for high availability.

This book covers a lot of very complex topics from replication, clustering and sharding to high availability. Practical advice on how to profile your server, find and fix the bottlenecks in server setup as well as on the application side in SQL queries is invaluable.

I very much liked that book is organized in such a way that it can easily be used as a reference.

This is a must read for anyone who has a somewhat significant load on their MySQL DB server or is considering moving from a single server setup to a cluster.

Get your copy of the book at O’Reilly.

Share

Measuring user retention using cohort analysis with R

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson.

So how to do it using R and how to visualize it. Inspired by examples described in “Retention, Cohorts, and Visualizations” I came up with the following solution.

First, get the data in a suitable format, like this:

cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754

Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who have been active in the same month as they registered, active_m1 – number of users who have been active in the following month, and so forth. For newest cohorts you’ll be getting zeroes in some of active_mN columns, since there’s no data on them yet. This is taken into account in processing scripts.

require(plyr)

# Load SystematicInvestor's plot.table (https://github.com/systematicinvestor/SIT)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
source(con)
close(con)

# Read the data
cohorts
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == " 0%"] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)

This code produces nice visualizations of the cohort analysis as a table:

I used articles “Visualizing Tables with plot.table“ and “Response to Flowingdata Challenge: Graphing obesity trends” as an inspiration for this R code.

If you want to get nice colours as in the example above, you’ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:

plot.table.helper.color <- edit(plot.table.helper.color)
function
(
 temp # matrix to plot
){
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!is.na(highlight)])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[is.na(temp)] = NA
 return(highlight)
}

Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.
plot.table.helper.colorbar <- edit(plot.table.helper.colorbar)

function
(
 plot.matrix # matrix to plot
)
{
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)
}

Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.

Now if you want to draw the cycle-like graph:

# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == "0"]
library(RColorBrewer)
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt="n",col=pal[1],type="o",ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab="Cohort by Month",ylab="Retention",main="Retention by Cohort")

for(i in 2:length(colnames(temp))) {
 points(temp[,i],pch=19,xaxt="n",col=pal[i])
 lines(temp[,i],pch=19,xaxt="n",col=pal[i])
}

axis(1,at=1:length(cohort_p$cohort),labels=as.vector(cohort_p$cohort),cex.axis=0.75)
legend("bottomleft",legend=colnames(temp),col=pal,lty=1,pch=19,bty="n")
abline(h=(seq(0,1,0.1)), col="lightgray", lty="dotted")

This code produces nice visualizations of the cohort analysis as multicolour cycle graph:

Share

Book review: Programming Amazon EC2

Should you buy a book on a new technology or just read technology provider’s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn’t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all.

After a short historical introduction to the philosophy of the  Amazon platform authors proceed to basics of the EC2 instance management (using Ubuntu as example) and describe how several real life web applications benefited from migration to Amazon infrastructure. High level architecture descriptions help to understand how all pieces of the Amazon platform come together – ELB, EC2, RDS, S3, SQS, SNS, etc.  Examples are provided in PHP, Ruby and Java.

Don’t expect any secret knowledge to be revealed in this book. A lot of intricate details and answers to the questions that you will face when planning a migration to AWS or designing architecture for a new web application are left out of this book. At the same time the book gives a fairly good overview of how to make your application to be scaleble and highly available on AWS and can serve as a good starting point in your AWS journey.

Some of the recommendations and descriptions given in the book are outright sloppy. A couple of examples: the book recommends “You can choose an availability zone, but that doesn’t matter very much because network latency between zones is comparable to that within a zone“. But tests indicate that Cross Availability Zones latency can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.

Another example where you might want to go run your own tests before making a decision or look for another opinion “A disk is always slower than memory. If you run your own MySQL using local disks, that’s slow as well. But using disk-based operations in RDS is just horrible. Minimizing disk usage means implementing proper indexes, something you always want to do.” A very strong yet vague statement on RDS performance, where I’d really love to see a comparison in performance of MySQL installation on EC2+EBS vs. RDS and a list of what kind of MySQL fine-tuning will and will not work on RDS.

Another annoying bit about the book is that authors keep promoting their commercial Android client for AWS monitoring throughout all chapters. As I see it – if there are ads, the book should be free.

Bottom line: I see only two reasons why you might want to buy and read this book – learning about the history of AWS and learning how five selected web services designed their architecture when migrating to AWS.

Share

Heat map visualization of sick day trends in Finland with R, ggplot2 and Google Correlate

Inspired by Margintale’s post “ggplot2 Time Series Heatmaps” and Google Flu Trends I decided to use a heat map to visualize sick days logged by HeiaHeia.com Finnish users.

I got the data from our database, filtering results by country (Finnish users only) in a tab separated form with the first line as the header. Three columns contained date, count of sick days logged on that date and count of Finnish users in the service on that date.

date count(*) user_cnt
2011-01-01 123 12345
2011-01-02 456 67890
...

Below is R source code for plotting the heat map. I made some small changes to the original code:

  • data normalization (line 9): this is specific to the data used in this example
  • days of the week have to be 1..7, not 0..6 as returned by $wday (line 19): dat$weekday = as.numeric(format(as.POSIXlt(dat$date),”%u”))
  • date format (line 31): week of year calculation required date conversion to POSIX dat$week <- as.numeric(format(as.POSIXlt(dat$date),”%W”))
  • custom header for the legend (line 39): adding + labs(fill=”per user per day”) allows you to customize legend header
require(zoo)
require(ggplot2)
require(plyr)

dat<-read.csv("~/data/sick_days_per_day.txt",header=TRUE,sep="\t")</pre>
colnames(dat) <- c("date", "count", "user_cnt")

# normalize data by number of users on each date
dat$norm_count <- dat$count / dat$user_cnt

# facet by year ~ month, and each subgraph will show week-of-month versus weekday the year is simple
dat$year<-as.numeric(as.POSIXlt(dat$date)$year+1900)
dat$month<-as.numeric(as.POSIXlt(dat$date)$mon+1)

# turn months into ordered facors to control the appearance/ordering in the presentation
dat$monthf<-factor(dat$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)

# the day of week is again easily found
dat$weekday = as.numeric(format(as.POSIXlt(dat$date),"%u"))

# again turn into factors to control appearance/abbreviation and ordering
# I use the reverse function rev here to order the week top down in the graph
# you can cut it out to reverse week order
dat$weekdayf<-factor(dat$weekday,levels=rev(1:7),labels=rev(c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")),ordered=TRUE)

# the monthweek part is a bit trickier - first a factor which cuts the data into month chunks
dat$yearmonth<-as.yearmon(dat$date)
dat$yearmonthf<-factor(dat$yearmonth)

# then find the "week of year" for each day
dat$week <- as.numeric(format(as.POSIXlt(dat$date),"%W"))

# and now for each monthblock we normalize the week to start at 1
dat<-ddply(dat,.(yearmonthf),transform,monthweek=1+week-min(week))

# Now for the plot
P<- ggplot(dat, aes(monthweek, weekdayf, fill = dat$norm_count)) +
 geom_tile(colour = "white") + facet_grid(year~monthf) + scale_fill_gradient(low="green", high="red") +
 opts(title = "Time-Series Calendar Heatmap - HeiaHeia.com sick days logged") + xlab("Week of Month") + ylab("") + labs(fill="per user per day")
P

Here are the results. Green indicates the healthiest days with lowest values of sick days logged per user, red indicates the worst days with highest values of sick days logged per user. It’s quite clear that there are seasonal peaks around February, and 2011 was a lot worse than 2012 (one should note that January-February of 2011 were exceptionally cold in Finland). It matches quite well with the coverage in the national press: Flu season reaching peak (Feb’2012), Employers grapple with sick leaves brought by flu wave (Feb’2012).

It’s interesting that there are less sick days logged on the weekends than on the work days, and traditional holiday month of July is the healthiest month of all.


(click to see full-sized image)

To get a more formal validation of the data logged by HeiaHeia users, I used Google Correlate lab tool to check that heat map results make sense. I uploaded sick days per user weekly time series and plotted a correlation with Google search queries for “kuumeen hoito” (treatment of fever in Finnish).


(click to see full-sized image)

Pearson Correlation Coefficient r between HeiaHeia sick days time series and Google search activity σ (both normalized so that mean is 0 and standard deviation is 1) is 0.8257 – this is a pretty good match.

Share

Setting up and securing web server (+MySQL +Rails/PHP) on Ubuntu 10.04 LTS

After repeating these operations many times in various setups, I decided to create a public set of instructions and share them with the world. This should be suitable for most of the simple web sites, utilizing Ruby on Rails or PHP.

The setup works on Ubuntu 10.04 Server LTS (scheduled end of life April 2015). Other components of the setup are Nginx as the web server, Phusion Passenger as application server.

I’m using this setup most often on Linode VPS, however none of the instructions are Linode specific.

(Continued)

Share

Book review: European Founders At Work

A book by Pedro Santos follows the format of Jessica Livingstone’s “Founders at Work”, offering a series of interviews with the founders of European start-ups.

Entrepreneurs, such as Illya Segalovich (co-founder of Yandex), Loic LeMeur (founder of Seesmic and LeWeb), Peter Arvai (co-founder of Prezi) and many others (see full list on the book’s website: www.europeanfoundersatwork.com) tell about how they started, built, pivoted and drove their businesses to success.

The book gives a unique first-hand perspective on how to grow a successful business from Europe, what is the importance of US market, what are the challenges European start-ups are facing and what are the competitive advantages of being in Europe.

It is an inspiring book, and it is very relevant to European entrepreneurs. While stories of US start-ups quite often start with “we got $N mln in funding and started growing from there”, in Europe it’s more about bootstrapping and building a profit-generating machine. I would definitely recommend it to anyone who is thinking of starting a technology company in Europe or is already running one.

(Continued)

Share

Informal notes from Strata 2012 conference on Big Data and Data Science

It’s been almost a month since I came back from California, and I just got around to sorting the notes from O’Reilly Strata conference. Spending time in the Valley is always inspiring – lots of interesting people, old friends, new contacts, new start-ups – it is the center of IT universe.

Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in HeiaHeia. It’s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.

Unfortunately HeiaHeia was the only company from Finland attending the conference. But I’m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.

By the way, Trulia (co-founded by Sami Inkinen) had an excellent demo in the halls of the conference venue – check it out in their blog.

Below are my notes from the conference – I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” (see videos below).

(Continued)

Share

Book review: The start-up of you

I bought this book because it was written by Reid Hoffman, co-founder of LinkedIn, and because it’s so damn easy to buy books on Kindle.

It was a quick read, and I should say I’m a bit disappointed. If you want to save time and money – go to the book’s web page and you’ll get all main ideas that are described in the full version.

Yes, the world is changing very fast. You don’t want to become a Detroit of the modern age. You should not expect a life-time employment in one company. Go learn new stuff, go meet people to find interesting opportunities. I was expecting a more insightful book with less direct LinkedIn service promotion, but in the end got a help-yourself, very Silicon Valley-centric guide on building a network using LinkedIn.

Unfortunately I cannot recommend this book, unless it’s news for you that you need to invest in continuous self-education and network building to succeed.

Share

Book review: The Information Diet: A Case for Conscious Consumption

I read “Information Diet” by Clay Johnson last Christmas. Central ideas of the book:
- information is like food – bad consumption habits are bad for your health
- it’s too easy to get yourself into information bubble: “When we tell ourselves, and listen to, only what we want to hear, we can end up so far from reality that we start making poor decisions”
- fight your addictions: disable all notifications on your computer and check email only once in a while, not every 5 minutes to improve your attention span
- media serves what people want to consume, if you want to et objective picture – learn to go down to the facts instead of relying on pre-processed information

I’m conflicted about recommending this book. On one hand – it’s 4 hours of your time that can be spent on better. On the other hand – it made me reconsider my own information diet and I see how I can now do more in a day because of that. So next time when you want to open IM, GMail, Twitter or Facebook, consider reading this book instead.

Update: after  months trying to follow the recommendations of this book – I have noticed that I have significantly improved my productivity. Topic of responsible information consumption also came up many times in conversations I had with many smart people in the past few months. There’s definitely a tectonic shift going on when it comes to information consumption habits.

Share

REE segfaults when Rails application has too many localisation files

We ran into an interesting problem – at some point of time our Rails application started to fail occaionally because of REE segfaults on startup. Even starting the console with ‘script/console production’ was occasionally failing with REE segfault. Application was growing, new features were added and segfaults started happening more and more often. There was no one single place where crashes occurred, so there was no clear understanding how to tackle this problem.

Examples of crashes we observed:

/vendor/rails/actionpack/lib/action_controller/routing/route.rb:205):2:
   [BUG] Segmentation fault
/opt/ruby-enterprise-1.8.7-2011.03/lib/ruby/1.8/yaml.rb:133: 
   [BUG] Segmentation fault
/vendor/rails/activesupport/lib/active_support/vendor/i18n-0.3.7/i18n/
   backend/base.rb:257: [BUG] Segmentation fault
/vendor/rails/actionpack/lib/action_view/template.rb:226: [BUG] Segmentation fault
/opt/ruby-enterprise-1.8.7-2011.03/lib/ruby/gems/1.8/gems/pauldix-sax-machine-0.0.14/
   lib/sax-machine/sax_document.rb:30: [BUG] Segmentation fault
/vendor/rails/activesupport/lib/active_support/memoizable.rb:32: [BUG] Segmentation fault

After banging my head against the wall for a week I found a solution (even two) and what might seem to be a likely reason for the segfaults. Two “suspects” – lack of available memory and incorrect version of libxml were ruled out. What seems to be the actual reason is the total size of the localisation files in config/locales loaded upon startup:

$ du -shb config/locales
1665858    config/locales
$ cd config/locales
$ find . -type f | wc -l
805

So ~1.6Mb in 805 files give occasional segfaults. Adding 200Kb of localisation files more started giving 100% segfaults on script/console startup.

Now I’ve found two workarounds for this problem.

1. Recompile REE with –no-tcmalloc flag

./ruby-enterprise-1.8.7-2011.03/installer --no-tcmalloc

Note that on 64-bit platforms tcmalloc is disabled by default.

2. Enable large pages feature in tcmalloc

This is described in tcmalloc documentation as: “Internally, tcmalloc divides its memory into “pages.”  The default page size is chosen to minimize memory use by reducing fragmentation. The cost is that keeping track of these pages can cost tcmalloc time. We’ve added a new, experimental flag to tcmalloc that enables a larger page size.  In general, this will increase the memory needs of applications using tcmalloc.  However, in many cases it will speed up the applications as well, particularly if they allocate and free a lot of memory.  We’ve seen average speedups of 3-5% on Google applications.”

There’s a warning – “this feature is still very experimental”, but it works to solve the problem with too many localisation files.

To compile REE with tcmalloc with large pages enables I just edited ruby-enterprise-1.8.7-2011.03/source/distro/google-perftools-1.7/src/common.h – replaced

#if defined(TCMALLOC_LARGE_PAGES)
static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 << 20;
#else
static const size_t kPageShift  = 12;
static const size_t kNumClasses = 61;
static const size_t kMaxThreadCacheSize = 2 << 20;
#endif

with

static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 << 20;

On production servers I opted for no tcmalloc for now – but I hope there’ll be a better way to deal with this issue soon.

Share