California Lottery – Scratchers – picking the winning tickets using math

I’ve always been fascinated by how the lottery works. From a mathematical point of view spending money on lottery tickets is a complete waste of time and money. There are a few exceptions – when a lottery is poorly designed, it is possible to game the system and actually earn money – here’s a famous …

Monty Hall Problem – How Randomness Rules Our World and Why We Cannot See It

Ever since I read about Monty Hall problem in “The Drunkard’s Walk: How Randomness Rules Our Lives” book by Leonard Mlodinow from of the California Institute of Technology, I always wanted to try and run a simulation to see that the math is correct. It is one of those problems, where the first answer that comes …

Are People in Colder Countries Taller? in Julia

Continuing to play with Julia and data visualizations. This time I decided to replicate a scatterplot created by Matt Stiles examining the relationship between a country’s average temperature and its male residents’ average height. Data comes from WorldBank and NCD-RisC. The size of the bubbles is linearly proportional to country population. Color indicates new World Bank income categories. People seem to …

Life Expectancy by Country

I was inspired by Andrew Collier’s blog post Life Expectancy by Country where he illustrated how to create a bubble chart that compares female and male life expectancies for a number of countries based on the data scraped from Wikipedia using R and charts. I decided to replicate these results using another popular language for technical computing – Julia. Scraping Wikipedia in …

Heat map visualization of sick day trends in Finland with R, ggplot2 and Google Correlate

Inspired by Margintale’s post “ggplot2 Time Series Heatmaps” and Google Flu Trends I decided to use a heat map to visualize sick days logged by Finnish users. I got the data from our database, filtering results by country (Finnish users only) in a tab separated form with the first line as the header. Three columns …

Informal notes from Strata 2012 conference on Big Data and Data Science

It’s been almost a month since I came back from California, and I just got around to sorting the notes from O’Reilly Strata conference. Spending time in the Valley is always inspiring – lots of interesting people, old friends, new contacts, new start-ups – it is the center of IT universe.

Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in HeiaHeia. It’s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.

Unfortunately HeiaHeia was the only company from Finland attending the conference. But I’m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.

By the way, Trulia (co-founded by Sami Inkinen) had an excellent demo in the halls of the conference venue – check it out in their blog.

Below are my notes from the conference – I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” (see videos below).

Continue readingInformal notes from Strata 2012 conference on Big Data and Data Science