California Lottery – Scratchers – picking the winning tickets using math

I’ve always been fascinated by how the lottery works. From a mathematical point of view spending money on lottery tickets is a complete waste of time and money. There are a few exceptions – when a lottery is poorly designed, it is possible to game the system and actually earn money – here’s a famous example of how MIT Students Won $8 Million in the Massachusetts Lottery.

The North American lottery system is a $70 billion-a-year business, an industry bigger than movie tickets, music, and porn combined. So general public seems to ignore the probability theory and continues to spend money in pursuite of the elusive multimillion grand prizes of the numerous lotteries.

I decided to take a closer look at Califortia State Lottery and came across a very nice analysis done by the Wizard of Odds. I do wholeheartedly agree with the advice given by the Wizard: “Don’t play in the first place. Every state lottery offers terrible odds. With few exceptions, it is the worst bet you can make.”

But I was intrigued by the following passage in the article: “The California Lottery is nice enough to indicate how many tickets for each win have already been cashed. If there is a game that is almost sold out, as evidenced by the small wins, with a high ratio of large wins still unclaimed, then it may mean the remaining unsold tickets are rich in big winners. The same principle as card counting in blackjack.”

I decided to take a close look at the odds of winning and the average return for each of the scratch card games CA Lottery offers and see if there are substantial changes in probability of winning and return on investment with the additional data provided by state lottery.

Here is a summary of my findings:

  1. More expensive scratchers have higher odds of winning and higher return:
    Scratch Card Average Returns
    Bet Average Return
    $1 52.49%
    $2 56.06%
    $3 57.57%
    $5 62.37%
    $10 67.75%
    $20 69.02%
    $30 72.63%
  2. There is indeed a potential advantage play in scratch card games, as with the information about the number of claimed prizes that CA Lottery publishes it is possible to calculate updated winning odds and returns – see the spreadsheet for details.
  3. Over time changes in the odds of winning and expected return can go up or down ~20%, so if you gamble, it definitely makes sense to run the numbers first
  4. As of February 10, 2018 Set For Life scratchers have the highest estimated return of 75.65%
  5.  As of February 10, 2018 $10 Million Dazzler scratchers have the highest estimated chance of winning of 36.5%
  6. In the unlikely event you win a grand prize – be aware of the fine print – the lottery is not going to pay it to you straight away. You’ll have a chance of taking home ~1/2 of it before taxes immediately or getting it paid out as installments over 25 years or so.

My full analysis is in this spreadsheet – it automatically downloads the latest stats on claimed prizes and winning odds from, so results you see there might differ from the examples I provided above.

And as a closing note – my general advice is “do not gamble”. If you do, then California Lottery is a good choice – its mission is to maximize supplemental funding for the state’s public schools, which is very similar to what we have as a mission for Veikkaus – the Finnish national betting agency. At least the money you lose will go to a good cause.


Life 3.0: Being Human in the Age of Artificial Intelligence by Max Tegmark

The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom. – Isaac Asimov

Just finished reading Life 3.0 and I can say it does deserve being put on a mandatory reading list for anyone working in AI field.

This is not a popular science book, but a great summary of the current thinking about the AI technology, its implications and potential scenarios for the future of our species.  Tegmark is doing an excellent job in clarifying the terms and dispelling common myths surrounding the AI.

While there is a fair bit of futurology in the book, it also lists a number of near-term issues that we need to start thinking about. This is by far the most balanced book on the subject that I have read.

The debate on whether regulation of AI is a) needed and b) timely is still ongoing, and Tegmark’s book is providing a solid foundation to build on.

Whether this discussion is timely – is a great question. One thing is certain, it’s better to have it early than late. When Henri Becquerel and Marie and Pierre Curie discovered radioactivity in the end of the 19th century – was it premature to discuss Nuclear Nonproliferation Treaty? Probably yes, as it took another 50 years to develop the first nuclear bomb.

However with AI we might not have the luxury of time – once the first general artificial intelligence (AGI) is created it might be too late. And even before that happens we as a society will need to make many important decisions that will direct the research, regulate technology, adjust legal frameworks, prepare the economy for the inevitable mass redundancy of human labor. For that we need the governments to understand the AI and

It is very important to start this conversation now.


One of the strangest books I read lately: “Russia Rising” by Seth Chanowitz

“Russia Rising” by Seth Chanowitz is probably one of the strangest books I read in the past few years. I don’t even remember how I came across it – must’ve been the magic of Amazon’s recommendation algorithms. It caught my eye as the description said that “It takes places in Finland, Estonia, Russia, Belarus, and the United States and is written by a former intelligence professional.”

The book is filled with mistakes – both spelling (like Koviposti instead of Kaivopuisto, Alvar Alto instead of Alvar Aalto, Esplandi instead of Esplanadi, Mannerheim Katu instead of Mannerheimintie, etc.) and factual (like ‘coastal town of Lohja’, while in fact Lohja is about 30km from the coast, Swedish imperial design of government buildings in Helsinki, while in fact Helsinki became a capital only in 1812 and government buildings there were commissioned by Russia, mixups of the main cathedrals in Helsinki – the red Orthodox and white Lutheran ones, etc.).

The author claims to be a US intelligence analyst who lived in Finland for a long time and speaks Finnish, which makes the mistakes even more embarrassing.

The writing itself is extraordinarily bad, starting from the flat characters to a very dry and mechanical description of events. At times I even thought that this is a well-trained generative neural network that synthesized the book.

Unfortunately a very poor execution completely ruins an otherwise entertaining plot of the story.


Monty Hall Problem – How Randomness Rules Our World and Why We Cannot See It

Ever since I read about Monty Hall problem in “The Drunkard’s Walk: How Randomness Rules Our Lives” book by Leonard Mlodinow from of the California Institute of Technology, I always wanted to try and run a simulation to see that the math is correct.

It is one of those problems, where the first answer that comes to mind is usually wrong, and the correct answer to the problem is counterintuitive, but it clearly comes through after doing the math or running the code.

As Wikipedia describes the problem, quoting a letter by Steve Selvin to the American Statistician in 1975:

“Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?”

I won’t go on explaining the mathematical solution here – there are plenty of those on the internet using either intuitive solution/tree diagram or conditional probability/Bayes theorem.

However here’s the code to run the simulation. The output is as expected – switching the door does indeed increase the winning probability.

Hold strategy wins: 3287, win probability: 0.33
Switch strategy wins: 6713, win probability: 0.67

Are People in Colder Countries Taller? in Julia


Are people in colder countries taller?

Continuing to play with Julia and data visualizations. This time I decided to replicate a scatterplot created by Matt Stiles examining the relationship between a country’s average temperature and its male residents’ average height.

Data comes from WorldBank and NCD-RisC. The size of the bubbles is linearly proportional to country population. Color indicates new World Bank income categories.

People seem to indeed be taller in colder countries. It would be interesting to explore deeper the relationship between wealth and height, especially adding time dimension – are people growing taller as countries become wealthier?


Interactive plot is available on, so you can play with the data yourself.



Life Expectancy by Country

Life Expectancy by CountryI was inspired by Andrew Collier’s blog post Life Expectancy by Country where he illustrated how to create a bubble chart that compares female and male life expectancies for a number of countries based on the data scraped from Wikipedia using R and charts.

I decided to replicate these results using another popular language for technical computing – Julia.

Scraping Wikipedia in Julia proved to be less elegant, as it is missing a convenient package for ingesting tabular data from web pages into data frames, but otherwise it was a relatively simple task.

The dotted line in the chart corresponds to equal female and male life expectancy. The size of the bubbles is linearly proportional to country population.

Interactive plot is available on



“The Nordic Theory of Everything: In Search of a Better Life” by Anu Partanen

The Nordic Theory of Everything: In Search of a Better LifeAnu Partanen is a Finnish journalist now living and working in the United States.

In her new book “The Nordic Theory of Everything. In Search of a Better Life” she compares how Nordic/Finnish and American societies address key issues such as healthcare, education, parental leaves, unemployment.

This books hits close to home. I’m a naturalized Finnish citizen, and spent most of my adult life in Finland and Norway before relocating with family to California.

tl;dr; The book is a nuanced, well-argumented critical analysis of the strengths and weaknesses of American and Nordic societies. It debunks the myths that affordable universal healthcare, free quality education, improvement of women’s participation in economic life by providing affordable daycare and paid parental leaves can only be achieved in “nanny states” that discourage individuality.

As Anu writes:

The Nordic countries demonstrate that building strong public services can create economic growth, and that pooling the risks everyone faces in life – sickness, unemployment, old age, the need to be educated to secure a decent living – into one system funded by everyone is more efficient, and more effective, than each person saving individually to ensure security and survive misfortune, especially in today’s age of global economic uncertainty and competition.

Continue reading ““The Nordic Theory of Everything: In Search of a Better Life” by Anu Partanen”


“High Performance MySQL” by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko; O’Reilly Media

lrgPeter and Vadim are long-term contributors to MySQL open source code base and are the founders of Percona, MySQL consultancy. This is really THE book about MySQL written by engineers who have spent more than a decade working on MySQL code.

The book is equally valuable to devops and DB admins, as well as software developers. It assumes that you have a good working knowledge of MySQL, as it dives straight into the intricate details of the system architecture and performance fine-tuning.

I have first-hand experience using second edition of this book, and I can say that it paid itself back many times over when I was debugging critical performance issues I came across in operations. Now this third edition (published in 2012) covers also new features in MySQL version 5.5.

In this new edition authors make real effort to explain not only how MySQL works, but why certain features in MySQL work the way they do.

Case studies peppered throughout the book are particularly interesting, as they give insight into how to deal with MySQL in large scale setups with high load and demand for high availability.

This book covers a lot of very complex topics from replication, clustering and sharding to high availability. Practical advice on how to profile your server, find and fix the bottlenecks in server setup as well as on the application side in SQL queries is invaluable.

I very much liked that book is organized in such a way that it can easily be used as a reference.

This is a must read for anyone who has a somewhat significant load on their MySQL DB server or is considering moving from a single server setup to a cluster.

Get your copy of the book at O’Reilly.


Measuring user retention using cohort analysis with R

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson.

So how to do it using R and how to visualize it. Inspired by examples described in “Retention, Cohorts, and Visualizations” I came up with the following solution.

First, get the data in a suitable format, like this:

cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754

Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who have been active in the same month as they registered, active_m1 – number of users who have been active in the following month, and so forth. For newest cohorts you’ll be getting zeroes in some of active_mN columns, since there’s no data on them yet. This is taken into account in processing scripts.


# Load SystematicInvestor's plot.table (
con = gzcon(url('', 'rb'))

# Read the data
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == " 0%"] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)

This code produces nice visualizations of the cohort analysis as a table:

I used articles “Visualizing Tables with plot.table” and “Response to Flowingdata Challenge: Graphing obesity trends” as an inspiration for this R code.

If you want to get nice colours as in the example above, you’ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:

plot.table.helper.color <- edit(plot.table.helper.color)
 temp # matrix to plot
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[] = NA

Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.
plot.table.helper.colorbar <- edit(plot.table.helper.colorbar)

 plot.matrix # matrix to plot
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)

Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.

Now if you want to draw the cycle-like graph:

# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == "0"]
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt="n",col=pal[1],type="o",ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab="Cohort by Month",ylab="Retention",main="Retention by Cohort")

for(i in 2:length(colnames(temp))) {

abline(h=(seq(0,1,0.1)), col="lightgray", lty="dotted")

This code produces nice visualizations of the cohort analysis as multicolour cycle graph:


Book review: Programming Amazon EC2

Should you buy a book on a new technology or just read technology provider’s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn’t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all.

After a short historical introduction to the philosophy of the  Amazon platform authors proceed to basics of the EC2 instance management (using Ubuntu as example) and describe how several real life web applications benefited from migration to Amazon infrastructure. High level architecture descriptions help to understand how all pieces of the Amazon platform come together – ELB, EC2, RDS, S3, SQS, SNS, etc.  Examples are provided in PHP, Ruby and Java.

Don’t expect any secret knowledge to be revealed in this book. A lot of intricate details and answers to the questions that you will face when planning a migration to AWS or designing architecture for a new web application are left out of this book. At the same time the book gives a fairly good overview of how to make your application to be scaleble and highly available on AWS and can serve as a good starting point in your AWS journey.

Some of the recommendations and descriptions given in the book are outright sloppy. A couple of examples: the book recommends “You can choose an availability zone, but that doesn’t matter very much because network latency between zones is comparable to that within a zone“. But tests indicate that Cross Availability Zones latency can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.

Another example where you might want to go run your own tests before making a decision or look for another opinion “A disk is always slower than memory. If you run your own MySQL using local disks, that’s slow as well. But using disk-based operations in RDS is just horrible. Minimizing disk usage means implementing proper indexes, something you always want to do.” A very strong yet vague statement on RDS performance, where I’d really love to see a comparison in performance of MySQL installation on EC2+EBS vs. RDS and a list of what kind of MySQL fine-tuning will and will not work on RDS.

Another annoying bit about the book is that authors keep promoting their commercial Android client for AWS monitoring throughout all chapters. As I see it – if there are ads, the book should be free.

Bottom line: I see only two reasons why you might want to buy and read this book – learning about the history of AWS and learning how five selected web services designed their architecture when migrating to AWS.