Informal notes from Strata 2012 conference on Big Data and Data Science

It’s been almost a month since I came back from California, and I just got around to sorting the notes from O’Reilly Strata conference. Spending time in the Valley is always inspiring – lots of interesting people, old friends, new contacts, new start-ups – it is the center of IT universe.

Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in HeiaHeia. It’s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.

Unfortunately HeiaHeia was the only company from Finland attending the conference. But I’m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.

By the way, Trulia (co-founded by Sami Inkinen) had an excellent demo in the halls of the conference venue – check it out in their blog.

Below are my notes from the conference – I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” (see videos below).

Continue reading “Informal notes from Strata 2012 conference on Big Data and Data Science”


Book review: The start-up of you

I bought this book because it was written by Reid Hoffman, co-founder of LinkedIn, and because it’s so damn easy to buy books on Kindle.

It was a quick read, and I should say I’m a bit disappointed. If you want to save time and money – go to the book’s web page and you’ll get all main ideas that are described in the full version.

Yes, the world is changing very fast. You don’t want to become a Detroit of the modern age. You should not expect a life-time employment in one company. Go learn new stuff, go meet people to find interesting opportunities. I was expecting a more insightful book with less direct LinkedIn service promotion, but in the end got a help-yourself, very Silicon Valley-centric guide on building a network using LinkedIn.

Unfortunately I cannot recommend this book, unless it’s news for you that you need to invest in continuous self-education and network building to succeed.


Book review: The Information Diet: A Case for Conscious Consumption

I read “Information Diet” by Clay Johnson last Christmas. Central ideas of the book:
– information is like food – bad consumption habits are bad for your health
– it’s too easy to get yourself into information bubble: “When we tell ourselves, and listen to, only what we want to hear, we can end up so far from reality that we start making poor decisions”
– fight your addictions: disable all notifications on your computer and check email only once in a while, not every 5 minutes to improve your attention span
– media serves what people want to consume, if you want to et objective picture – learn to go down to the facts instead of relying on pre-processed information

I’m conflicted about recommending this book. On one hand – it’s 4 hours of your time that can be spent on better. On the other hand – it made me reconsider my own information diet and I see how I can now do more in a day because of that. So next time when you want to open IM, GMail, Twitter or Facebook, consider reading this book instead.

Update: after  months trying to follow the recommendations of this book – I have noticed that I have significantly improved my productivity. Topic of responsible information consumption also came up many times in conversations I had with many smart people in the past few months. There’s definitely a tectonic shift going on when it comes to information consumption habits.


REE segfaults when Rails application has too many localisation files

We ran into an interesting problem – at some point of time our Rails application started to fail occaionally because of REE segfaults on startup. Even starting the console with ‘script/console production’ was occasionally failing with REE segfault. Application was growing, new features were added and segfaults started happening more and more often. There was no one single place where crashes occurred, so there was no clear understanding how to tackle this problem.

Examples of crashes we observed:

   [BUG] Segmentation fault
   [BUG] Segmentation fault
   backend/base.rb:257: [BUG] Segmentation fault
/vendor/rails/actionpack/lib/action_view/template.rb:226: [BUG] Segmentation fault
   lib/sax-machine/sax_document.rb:30: [BUG] Segmentation fault
/vendor/rails/activesupport/lib/active_support/memoizable.rb:32: [BUG] Segmentation fault

After banging my head against the wall for a week I found a solution (even two) and what might seem to be a likely reason for the segfaults. Two “suspects” – lack of available memory and incorrect version of libxml were ruled out. What seems to be the actual reason is the total size of the localisation files in config/locales loaded upon startup:

$ du -shb config/locales
1665858    config/locales
$ cd config/locales
$ find . -type f | wc -l

So ~1.6Mb in 805 files give occasional segfaults. Adding 200Kb of localisation files more started giving 100% segfaults on script/console startup.

Now I’ve found two workarounds for this problem.

1. Recompile REE with –no-tcmalloc flag

./ruby-enterprise-1.8.7-2011.03/installer --no-tcmalloc

Note that on 64-bit platforms tcmalloc is disabled by default.

2. Enable large pages feature in tcmalloc

This is described in tcmalloc documentation as: “Internally, tcmalloc divides its memory into “pages.”  The default page size is chosen to minimize memory use by reducing fragmentation. The cost is that keeping track of these pages can cost tcmalloc time. We’ve added a new, experimental flag to tcmalloc that enables a larger page size.  In general, this will increase the memory needs of applications using tcmalloc.  However, in many cases it will speed up the applications as well, particularly if they allocate and free a lot of memory.  We’ve seen average speedups of 3-5% on Google applications.”

There’s a warning – “this feature is still very experimental”, but it works to solve the problem with too many localisation files.

To compile REE with tcmalloc with large pages enables I just edited ruby-enterprise-1.8.7-2011.03/source/distro/google-perftools-1.7/src/common.h – replaced

static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 << 20;
static const size_t kPageShift  = 12;
static const size_t kNumClasses = 61;
static const size_t kMaxThreadCacheSize = 2 << 20;


static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 << 20;

On production servers I opted for no tcmalloc for now – but I hope there’ll be a better way to deal with this issue soon.


Pitfalls of Rails fragment caching with memcached

Fragment caching is a powerful technique for improving performance of your web application. Rails site describes in detail how to apply this technique.

Rails are providing developers with really excellent abstractions, but it’s always good to know what’s under the hood and how it all works.

There are a few things that might potentially cause bugs in your code, or waste your time (speaking from my own experience). So here goes:

1. Beware of globally keyed fragments

Let’s take example from Rails tutorial:

<% cache do %>
  All available products:
  <% Product.all.each do |p| %>
    <%= link_to, product_url(p) %>
  <% end %>
<% end %>

Now if you need to deal with a multi-language site you might want to make cache fragment language dependent. What might seem a convenient solution:

<%- cache([user.locale.to_s]) do -%>

will turn into a source of very interesting problems. While calling the cache method without parameters will automatically create a controller/action specific cache key, calling it with a key will make this fragment a globally keyed fragment. Cache key in the first case is going to look like “views/localhost:3000/controller-name”, and in the other case “views/en” – this is not as unique identifier any more.

While automatic cache key naming provided by rails is very convenient, it is very easy to run into a problem with duplicate cache key names used in different places.

2. Another pitfall of automatic cache key naming is that you shall never assume that when creating a cache with global key you can later find it using e.g. telnet interface to memcache. Example – add

<%- cache('unique_cache_key') do -%>
<%- end -%>

in your view and then try to read directly from memcache:

$ telnet localhost 11211
GET unique_cache_key

At the same time

GET views/unique_cache_key

will work. It’s easy to make this mistake trying to check or delete cache keys directly from memcache when using Rails cache methods.

3. delete_matched is not supported by memcached (see rails/activesupport/lib/active_support/cache/mem_cache_store.rb)

In practice that means that if you’re using memcached as Rails cache engine and trying to delete or expire fragment cache using standard Rails methods and regexp – you’ll fail.


will fail miserably. Ideal solution is not to use explicit cache expiration, but rather create cache keys in such a way that doesn’t require expiration. Alternatively it’s possible to use extensions implementing delete_matched for memcached (haven’t tried it myself though).

Tip: one very useful tool for checking memcached is peep by Evan Weaver – allows you to peek into the cache and see what’s really cached and how it is used.


Notes from Gothenburg – Nordic Ruby 2011 conference

Here are my notes from Nordic Ruby conference in Göteborg, Sweden.

I’d like to say big thanks to the organisers of the conference (especially CJ @cjkihlbom) – everything went really smooth, even though there’s been 150 people attending this year compared to 90 last year.
Some points that I’d really like to highlight are:

  • a lot of time to meet people and discuss: 30 minutes talks followed by 30 minutes breaks, no q&a – those who had questions had an opportunity to talk to the speakers during the breaks
  • venue was great (of course, the boat 🙂 – there was enough space for everyone to move around, but at the same time it was compact enough not to get lost also everyone had an opportunity to have lunch and dinner together
  • “job board” a huge white board where anyone can post information about open positions in their companies – it got filled withing firts few hours – job market is really hot
  • lightning talks that any participant can give – 5 minute talks in the end of the day – it was really great
  • real coffee 🙂 espresso, latte, cappuccino, americano – you name it – professional baristas were at your service
  • 5K Nordic Ruby run organised on the second day’s morning

Continue reading “Notes from Gothenburg – Nordic Ruby 2011 conference”


vim is the best editor, also for RoR development

vim is a natural choice when you’re starting a new programming project (if you’re emacs or textmate adept – you can stop reading now 🙂 If you’re starting a Ruby on Rails project there are a couple of scripts/configurations you might want to install to make development with vim an even more pleasant experience.

1. Rails.vim by Tim Pope

To install just copy autoload/rails.vim, plugin/rails.vim, and doc/rails.txt to corresponding directories in ~/.vim

Vim scripts section has a full description of the functionality. Some highlights:

  • Easy navigation between models, controllers and views with :Rmodel, :Rview, :Rcontroller commands
  • Syntax highlighting
  • CTRL-X CTRL-U for autocompletion
  • :Rtree for project tree (see item 2)

2. NERD Tree by Marty Grenfell

Another must-have. Provides you with an easy way to navigate your project tree. Rails.vim nicely integrates with this one.

3. Colour schemas

If you want to save your eyes – use dark background when developing. Personally I prefer one of the standard vim themes – Desert , but Ocean Deep is also very good.

Copy theme file to ~/.vim/colors and then use

:colorscheme oceandeep

command to apply it.


Installing Flash player browser plugin in 64-bit Ubuntu 11.04

I got a new hard drive for my laptop and decided to make a leap of faith and move to 64-bit version of Ubuntu, since I had to install a fresh system anyway.

In case you didn’t know – Adobe doesn’t have stable Flash player version for 64-bit Linux. Adobe Labs offer preview release codenamed “Square” for 64-bit platforms. It can’t be installed via standard Ubuntu repositories, so get ready to get your hands dirty in the terminal.

To install 64-bit Flash player plugin do the following:

1. Download the latest version of the plugin at (currently it is v.10.2 preview 3 from November 30th, 2010)
2. Go to your downloads directory and extract the plugin binary

tar xvzf flashplayer10_2_p3_64bit_linux_111710.tar.gz

3. Create a directory for browser plugins in your users home directory

mkdir -p ~/.mozilla/plugins

4. Move extracted in step 2 Flash player plugin binary to its new location

mv ~/.mozilla/plugins

5. Close all browser windows and restart the browser.

6. In Firefox or Chrome go to about:plugins to verify that there’s Shockwave Flash plugin available



Increasing Ruby interpreter performance by adjusting garbage collector settings

According to Evan Weaver from Twitter it is possible for a typical production Rails app on Ruby 1.8 to recover 20% to 40% of user CPU by simply adjusting Ruby garbage collector settings. In August I set out on a quest to verify that statement on HeiaHeia servers. Results have really exceeded my expectations. Time to execute application tests locally decreased by 46%. On production servers CPU utilisation decreased by almost 40%.


Call to undefined function: imagecreatefromjpeg()

While installing new Joomla modules I came across this PHP error (yep, still have to deal with PHP occasionally). I had PHP compiled from source on Ubuntu 10.04 as per earlier instructions. Quick check of phpinfo() indicated that while gd module was compiled in, it didn’t have JPEG support:

GD Support         enabled
GD Version         bundled (2.0.34 compatible)
GIF Read Support   enabled
GIF Create Support enabled
PNG Support        enabled
WBMP Support       enabled
XBM Support        enabled

Making sure that JPEG libraries are installed

sudo aptitude install libjpeg libjpeg-dev

and reconfiguring PHP with –with-jpeg-dir flag (the rest of the compilation process remains the same as here)

./configure --enable-fastcgi --enable-fpm --with-mcrypt --with-zlib 
--enable-mbstring --with-openssl --with-mysql --with-mysql-sock 
--with-gd --without-sqlite --disable-pdo --with-jpeg-dir=/usr/lib

and then restarting nginx

sudo /etc/init.d/nginx restart

helped to solve the problem.