Life 3.0: Being Human in the Age of Artificial Intelligence by Max Tegmark

The saddest aspect of life right now is that science gathers knowledge faster than society gathers wisdom. – Isaac Asimov

Just finished reading Life 3.0 and I can say it does deserve being put on a mandatory reading list for anyone working in AI field.

This is not a popular science book, but a great summary of the current thinking about the AI technology, its implications and potential scenarios for the future of our species.  Tegmark is doing an excellent job in clarifying the terms and dispelling common myths surrounding the AI.

While there is a fair bit of futurology in the book, it also lists a number of near-term issues that we need to start thinking about. This is by far the most balanced book on the subject that I have read.

The debate on whether regulation of AI is a) needed and b) timely is still ongoing, and Tegmark’s book is providing a solid foundation to build on.

Whether this discussion is timely – is a great question. One thing is certain, it’s better to have it early than late. When Henri Becquerel and Marie and Pierre Curie discovered radioactivity in the end of the 19th century – was it premature to discuss Nuclear Nonproliferation Treaty? Probably yes, as it took another 50 years to develop the first nuclear bomb.

However with AI we might not have the luxury of time – once the first general artificial intelligence (AGI) is created it might be too late. And even before that happens we as a society will need to make many important decisions that will direct the research, regulate technology, adjust legal frameworks, prepare the economy for the inevitable mass redundancy of human labor. For that we need the governments to understand the AI and

It is very important to start this conversation now.

Share

One of the strangest books I read lately: “Russia Rising” by Seth Chanowitz

“Russia Rising” by Seth Chanowitz is probably one of the strangest books I read in the past few years. I don’t even remember how I came across it – must’ve been the magic of Amazon’s recommendation algorithms. It caught my eye as the description said that “It takes places in Finland, Estonia, Russia, Belarus, and the United States and is written by a former intelligence professional.”

The book is filled with mistakes – both spelling (like Koviposti instead of Kaivopuisto, Alvar Alto instead of Alvar Aalto, Esplandi instead of Esplanadi, Mannerheim Katu instead of Mannerheimintie, etc.) and factual (like ‘coastal town of Lohja’, while in fact Lohja is about 30km from the coast, Swedish imperial design of government buildings in Helsinki, while in fact Helsinki became a capital only in 1812 and government buildings there were commissioned by Russia, mixups of the main cathedrals in Helsinki – the red Orthodox and white Lutheran ones, etc.).

The author claims to be a US intelligence analyst who lived in Finland for a long time and speaks Finnish, which makes the mistakes even more embarrassing.

The writing itself is extraordinarily bad, starting from the flat characters to a very dry and mechanical description of events. At times I even thought that this is a well-trained generative neural network that synthesized the book.

Unfortunately a very poor execution completely ruins an otherwise entertaining plot of the story.

Share

Monty Hall Problem – How Randomness Rules Our World and Why We Cannot See It

Ever since I read about Monty Hall problem in “The Drunkard’s Walk: How Randomness Rules Our Lives” book by Leonard Mlodinow from of the California Institute of Technology, I always wanted to try and run a simulation to see that the math is correct.

It is one of those problems, where the first answer that comes to mind is usually wrong, and the correct answer to the problem is counterintuitive, but it clearly comes through after doing the math or running the code.

As Wikipedia describes the problem, quoting a letter by Steve Selvin to the American Statistician in 1975:

“Suppose you’re on a game show, and you’re given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what’s behind the doors, opens another door, say No. 3, which has a goat. He then says to you, “Do you want to pick door No. 2?” Is it to your advantage to switch your choice?”

I won’t go on explaining the mathematical solution here – there are plenty of those on the internet using either intuitive solution/tree diagram or conditional probability/Bayes theorem.

However here’s the code to run the simulation. The output is as expected – switching the door does indeed increase the winning probability.

Hold strategy wins: 3287, win probability: 0.33
Switch strategy wins: 6713, win probability: 0.67
Share

Are People in Colder Countries Taller? in Julia

 

Are people in colder countries taller?

Continuing to play with Julia and data visualizations. This time I decided to replicate a scatterplot created by Matt Stiles examining the relationship between a country’s average temperature and its male residents’ average height.

Data comes from WorldBank and NCD-RisC. The size of the bubbles is linearly proportional to country population. Color indicates new World Bank income categories.

People seem to indeed be taller in colder countries. It would be interesting to explore deeper the relationship between wealth and height, especially adding time dimension – are people growing taller as countries become wealthier?

 

Interactive plot is available on Plot.ly, so you can play with the data yourself.

 

Share

Life Expectancy by Country

Life Expectancy by CountryI was inspired by Andrew Collier’s blog post Life Expectancy by Country where he illustrated how to create a bubble chart that compares female and male life expectancies for a number of countries based on the data scraped from Wikipedia using R and Plot.ly charts.

I decided to replicate these results using another popular language for technical computing – Julia.

Scraping Wikipedia in Julia proved to be less elegant, as it is missing a convenient package for ingesting tabular data from web pages into data frames, but otherwise it was a relatively simple task.

The dotted line in the chart corresponds to equal female and male life expectancy. The size of the bubbles is linearly proportional to country population.

Interactive plot is available on Plot.ly.

 

Share

“The Nordic Theory of Everything: In Search of a Better Life” by Anu Partanen

The Nordic Theory of Everything: In Search of a Better LifeAnu Partanen is a Finnish journalist now living and working in the United States.

In her new book “The Nordic Theory of Everything. In Search of a Better Life” she compares how Nordic/Finnish and American societies address key issues such as healthcare, education, parental leaves, unemployment.

This books hits close to home. I’m a naturalized Finnish citizen, and spent most of my adult life in Finland and Norway before relocating with family to California.

tl;dr; The book is a nuanced, well-argumented critical analysis of the strengths and weaknesses of American and Nordic societies. It debunks the myths that affordable universal healthcare, free quality education, improvement of women’s participation in economic life by providing affordable daycare and paid parental leaves can only be achieved in “nanny states” that discourage individuality.

As Anu writes:

The Nordic countries demonstrate that building strong public services can create economic growth, and that pooling the risks everyone faces in life – sickness, unemployment, old age, the need to be educated to secure a decent living – into one system funded by everyone is more efficient, and more effective, than each person saving individually to ensure security and survive misfortune, especially in today’s age of global economic uncertainty and competition.

Continue reading ““The Nordic Theory of Everything: In Search of a Better Life” by Anu Partanen”

Share

“High Performance MySQL” by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko; O’Reilly Media

lrgPeter and Vadim are long-term contributors to MySQL open source code base and are the founders of Percona, MySQL consultancy. This is really THE book about MySQL written by engineers who have spent more than a decade working on MySQL code.

The book is equally valuable to devops and DB admins, as well as software developers. It assumes that you have a good working knowledge of MySQL, as it dives straight into the intricate details of the system architecture and performance fine-tuning.

I have first-hand experience using second edition of this book, and I can say that it paid itself back many times over when I was debugging critical performance issues I came across in HeiaHeia.com operations. Now this third edition (published in 2012) covers also new features in MySQL version 5.5.

In this new edition authors make real effort to explain not only how MySQL works, but why certain features in MySQL work the way they do.

Case studies peppered throughout the book are particularly interesting, as they give insight into how to deal with MySQL in large scale setups with high load and demand for high availability.

This book covers a lot of very complex topics from replication, clustering and sharding to high availability. Practical advice on how to profile your server, find and fix the bottlenecks in server setup as well as on the application side in SQL queries is invaluable.

I very much liked that book is organized in such a way that it can easily be used as a reference.

This is a must read for anyone who has a somewhat significant load on their MySQL DB server or is considering moving from a single server setup to a cluster.

Get your copy of the book at O’Reilly.

Share

Measuring user retention using cohort analysis with R

Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There’s a good write up on that subject “Cohorts, Retention, Churn, ARPU” by Matt Johnson.

So how to do it using R and how to visualize it. Inspired by examples described in “Retention, Cohorts, and Visualizations” I came up with the following solution.

First, get the data in a suitable format, like this:

cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754

Cohort here is in “YYYY-MM” format, signed_up is the number of users who have created accounts in the given month, active_m0 – number of users who have been active in the same month as they registered, active_m1 – number of users who have been active in the following month, and so forth. For newest cohorts you’ll be getting zeroes in some of active_mN columns, since there’s no data on them yet. This is taken into account in processing scripts.

require(plyr)

# Load SystematicInvestor's plot.table (https://github.com/systematicinvestor/SIT)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
source(con)
close(con)

# Read the data
cohorts
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == " 0%"] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)

This code produces nice visualizations of the cohort analysis as a table:

I used articles “Visualizing Tables with plot.table” and “Response to Flowingdata Challenge: Graphing obesity trends” as an inspiration for this R code.

If you want to get nice colours as in the example above, you’ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:

plot.table.helper.color <- edit(plot.table.helper.color)
function
(
 temp # matrix to plot
){
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!is.na(highlight)])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[is.na(temp)] = NA
 return(highlight)
}

Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.
plot.table.helper.colorbar <- edit(plot.table.helper.colorbar)

function
(
 plot.matrix # matrix to plot
)
{
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)
}

Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.

Now if you want to draw the cycle-like graph:

# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == "0"]
library(RColorBrewer)
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt="n",col=pal[1],type="o",ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab="Cohort by Month",ylab="Retention",main="Retention by Cohort")

for(i in 2:length(colnames(temp))) {
 points(temp[,i],pch=19,xaxt="n",col=pal[i])
 lines(temp[,i],pch=19,xaxt="n",col=pal[i])
}

axis(1,at=1:length(cohort_p$cohort),labels=as.vector(cohort_p$cohort),cex.axis=0.75)
legend("bottomleft",legend=colnames(temp),col=pal,lty=1,pch=19,bty="n")
abline(h=(seq(0,1,0.1)), col="lightgray", lty="dotted")

This code produces nice visualizations of the cohort analysis as multicolour cycle graph:

Share

Book review: Programming Amazon EC2

Should you buy a book on a new technology or just read technology provider’s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn’t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all.

After a short historical introduction to the philosophy of the  Amazon platform authors proceed to basics of the EC2 instance management (using Ubuntu as example) and describe how several real life web applications benefited from migration to Amazon infrastructure. High level architecture descriptions help to understand how all pieces of the Amazon platform come together – ELB, EC2, RDS, S3, SQS, SNS, etc.  Examples are provided in PHP, Ruby and Java.

Don’t expect any secret knowledge to be revealed in this book. A lot of intricate details and answers to the questions that you will face when planning a migration to AWS or designing architecture for a new web application are left out of this book. At the same time the book gives a fairly good overview of how to make your application to be scaleble and highly available on AWS and can serve as a good starting point in your AWS journey.

Some of the recommendations and descriptions given in the book are outright sloppy. A couple of examples: the book recommends “You can choose an availability zone, but that doesn’t matter very much because network latency between zones is comparable to that within a zone“. But tests indicate that Cross Availability Zones latency can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.

Another example where you might want to go run your own tests before making a decision or look for another opinion “A disk is always slower than memory. If you run your own MySQL using local disks, that’s slow as well. But using disk-based operations in RDS is just horrible. Minimizing disk usage means implementing proper indexes, something you always want to do.” A very strong yet vague statement on RDS performance, where I’d really love to see a comparison in performance of MySQL installation on EC2+EBS vs. RDS and a list of what kind of MySQL fine-tuning will and will not work on RDS.

Another annoying bit about the book is that authors keep promoting their commercial Android client for AWS monitoring throughout all chapters. As I see it – if there are ads, the book should be free.

Bottom line: I see only two reasons why you might want to buy and read this book – learning about the history of AWS and learning how five selected web services designed their architecture when migrating to AWS.

Share

Heat map visualization of sick day trends in Finland with R, ggplot2 and Google Correlate

Inspired by Margintale’s post “ggplot2 Time Series Heatmaps” and Google Flu Trends I decided to use a heat map to visualize sick days logged by HeiaHeia.com Finnish users.

I got the data from our database, filtering results by country (Finnish users only) in a tab separated form with the first line as the header. Three columns contained date, count of sick days logged on that date and count of Finnish users in the service on that date.

date count(*) user_cnt
2011-01-01 123 12345
2011-01-02 456 67890
...

Below is R source code for plotting the heat map. I made some small changes to the original code:

  • data normalization (line 9): this is specific to the data used in this example
  • days of the week have to be 1..7, not 0..6 as returned by $wday (line 19): dat$weekday = as.numeric(format(as.POSIXlt(dat$date),”%u”))
  • date format (line 31): week of year calculation required date conversion to POSIX dat$week <- as.numeric(format(as.POSIXlt(dat$date),”%W”))
  • custom header for the legend (line 39): adding + labs(fill=”per user per day”) allows you to customize legend header
require(zoo)
require(ggplot2)
require(plyr)

dat<-read.csv("~/data/sick_days_per_day.txt",header=TRUE,sep="\t")</pre>
colnames(dat) <- c("date", "count", "user_cnt")

# normalize data by number of users on each date
dat$norm_count <- dat$count / dat$user_cnt

# facet by year ~ month, and each subgraph will show week-of-month versus weekday the year is simple
dat$year<-as.numeric(as.POSIXlt(dat$date)$year+1900)
dat$month<-as.numeric(as.POSIXlt(dat$date)$mon+1)

# turn months into ordered facors to control the appearance/ordering in the presentation
dat$monthf<-factor(dat$month,levels=as.character(1:12),labels=c("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"),ordered=TRUE)

# the day of week is again easily found
dat$weekday = as.numeric(format(as.POSIXlt(dat$date),"%u"))

# again turn into factors to control appearance/abbreviation and ordering
# I use the reverse function rev here to order the week top down in the graph
# you can cut it out to reverse week order
dat$weekdayf<-factor(dat$weekday,levels=rev(1:7),labels=rev(c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")),ordered=TRUE)

# the monthweek part is a bit trickier - first a factor which cuts the data into month chunks
dat$yearmonth<-as.yearmon(dat$date)
dat$yearmonthf<-factor(dat$yearmonth)

# then find the "week of year" for each day
dat$week <- as.numeric(format(as.POSIXlt(dat$date),"%W"))

# and now for each monthblock we normalize the week to start at 1
dat<-ddply(dat,.(yearmonthf),transform,monthweek=1+week-min(week))

# Now for the plot
P<- ggplot(dat, aes(monthweek, weekdayf, fill = dat$norm_count)) +
 geom_tile(colour = "white") + facet_grid(year~monthf) + scale_fill_gradient(low="green", high="red") +
 opts(title = "Time-Series Calendar Heatmap - HeiaHeia.com sick days logged") + xlab("Week of Month") + ylab("") + labs(fill="per user per day")
P

Here are the results. Green indicates the healthiest days with lowest values of sick days logged per user, red indicates the worst days with highest values of sick days logged per user. It’s quite clear that there are seasonal peaks around February, and 2011 was a lot worse than 2012 (one should note that January-February of 2011 were exceptionally cold in Finland). It matches quite well with the coverage in the national press: Flu season reaching peak (Feb’2012), Employers grapple with sick leaves brought by flu wave (Feb’2012).

It’s interesting that there are less sick days logged on the weekends than on the work days, and traditional holiday month of July is the healthiest month of all.


(click to see full-sized image)

To get a more formal validation of the data logged by HeiaHeia users, I used Google Correlate lab tool to check that heat map results make sense. I uploaded sick days per user weekly time series and plotted a correlation with Google search queries for “kuumeen hoito” (treatment of fever in Finnish).


(click to see full-sized image)

Pearson Correlation Coefficient r between HeiaHeia sick days time series and Google search activity σ (both normalized so that mean is 0 and standard deviation is 1) is 0.8257 – this is a pretty good match.

Share