<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Ivan Kuznetsov</title>
	<atom:link href="http://www.ivankuznetsov.com/feed" rel="self" type="application/rss+xml" />
	<link>http://www.ivankuznetsov.com</link>
	<description>Entrepreneur, Ruby on Rails and Ubuntu fanatic, Data junkie</description>
	<lastBuildDate>Fri, 04 Jan 2013 13:40:28 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Measuring user retention using cohort analysis with R</title>
		<link>http://www.ivankuznetsov.com/2012/04/measuring-user-retention-using-cohort-analysis-with-r.html</link>
		<comments>http://www.ivankuznetsov.com/2012/04/measuring-user-retention-using-cohort-analysis-with-r.html#comments</comments>
		<pubDate>Fri, 27 Apr 2012 13:14:49 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[Startups]]></category>
		<category><![CDATA[Web/Tech]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=471</guid>
		<description><![CDATA[Cohort analysis is super important if you want to know if your service is in fact a leaky bucket despite nice growth of absolute numbers. There&#8217;s a good write up on that subject &#8220;Cohorts, Retention, Churn, ARPU&#8221; by Matt Johnson. So how to do it using R and how to visualize it. Inspired by examples [...]]]></description>
				<content:encoded><![CDATA[<p>Cohort analysis is super important if you want to know if your service is in fact a <a href="http://andrewchenblog.com/2007/12/20/is-your-website-a-leaky-bucket-4-scenarios-for-user-retention/" target="_blank">leaky bucket</a> despite nice growth of absolute numbers. There&#8217;s a good write up on that subject &#8220;<a href="http://www.socrata.com/uncategorized/cohorts-retention-churn-arpu/" target="_blank">Cohorts, Retention, Churn, ARPU</a>&#8221; by Matt Johnson.</p>
<p>So how to do it using R and how to visualize it. Inspired by examples described in &#8220;<a href="http://blog.intercom.io/retention-cohorts-and-visualisations/" target="_blank">Retention, Cohorts, and Visualizations</a>&#8221; I came up with the following solution.</p>
<p>First, get the data in a suitable format, like this:</p>
<pre>cohort  signed_up  active_m0  active_m1  active_m2
2011-10 12345      10432      8765       6754
2011-11 12345      10432      8765       6754
2011-12 12345      10432      8765       6754</pre>
<p>Cohort here is in &#8220;YYYY-MM&#8221; format, signed_up is the number of users who have created accounts in the given month, active_m0 &#8211; number of users who have been active in the same month as they registered, active_m1 &#8211; number of users who have been active in the following month, and so forth. For newest cohorts you&#8217;ll be getting zeroes in some of active_mN columns, since there&#8217;s no data on them yet. This is taken into account in processing scripts.</p>
<pre class="brush: r; title: ; notranslate">
require(plyr)

# Load SystematicInvestor's plot.table (https://github.com/systematicinvestor/SIT)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
source(con)
close(con)

# Read the data
cohorts
# Let's convert absolute values to percentages (% of the registered users remaining active)
cohort_p as.numeric(df$active_m0/df$signed_up), as.numeric(df$active_m1/df$signed_up), as.numeric(df$active_m2/df$signed_up),
as.numeric(df$active_m3/df$signed_up), as.numeric(df$active_m4/df$signed_up), as.numeric(df$active_m5/df$signed_up),
as.numeric(df$active_m6/df$signed_up), as.numeric(df$active_m7/df$signed_up), as.numeric(df$active_m8/df$signed_up) ))

# Create a matrix
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), sep=' ')
rownames(temp) = as.vector(cohort_p$V1)

# Drop 0 values and format data
temp[] = plota.format(100 * as.numeric(temp), 0, '', '%')
temp[temp == &quot; 0%&quot;] # Plot cohort analysis table
plot.table(temp, smain='Cohort(users)', highlight = TRUE, colorbar = TRUE)
</pre>
<p>This code produces nice visualizations of the cohort analysis as a table:</p>
<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/cohort-analysis.png"><img class="wp-image-476 aligncenter" title="cohort analysis" src="http://www.ivankuznetsov.com/wp-content/uploads/cohort-analysis.png" alt="" width="720" /></a></p>
<p>I used articles &#8220;<a href="http://systematicinvestor.wordpress.com/2011/10/07/visualizing-tables-with-plot-table/" target="_blank">Visualizing Tables with plot.table</a>&#8220; and &#8220;<a href="http://www.prettygraph.com/blog/response-to-flowingdata-challenge-graphing-obesity-trends/" target="_blank">Response to Flowingdata Challenge: Graphing obesity trends</a>&#8221; as an inspiration for this R code.</p>
<p>If you want to get nice colours as in the example above, you&#8217;ll need to adjust rainbow interval for plot.table. I managed to do it by editing functions code directly from R environment:</p>
<pre>plot.table.helper.color &lt;- edit(plot.table.helper.color)</pre>
<pre class="brush: r; title: ; notranslate">
function
(
 temp # matrix to plot
){
 # convert temp to numerical matrix
 temp = matrix(as.double(gsub('[%,$]', '', temp)), nrow(temp), ncol(temp))

highlight = as.vector(temp)
 cols = rep(NA, len(highlight))
 ncols = len(highlight[!is.na(highlight)])
 cols[1:ncols] = rainbow(ncols, start = 0, end = 0.3)

o = sort.list(highlight, na.last = TRUE, decreasing = FALSE)
 o1 = sort.list(o, na.last = TRUE, decreasing = FALSE)
 highlight = matrix(cols[o1], nrow = nrow(temp))
 highlight[is.na(temp)] = NA
 return(highlight)
}
</pre>
<p>Adjust interval in line 11 to 0.5, 0.6 to get shades of blue.<br />
plot.table.helper.colorbar &lt;- edit(plot.table.helper.colorbar)</p>
<pre class="brush: r; title: ; notranslate">
function
(
 plot.matrix # matrix to plot
)
{
 nr = nrow(plot.matrix) + 1
 nc = ncol(plot.matrix) + 1

c = nc
 r1 = 1
 r2 = nr

rect((2*(c - 1) + .5), -(r1 - .5), (2*c + .5), -(r2 + .5), col='white', border='white')
 rect((2*(c - 1) + .5), -(r1 - .5), (2*(c - 1) + .5), -(r2 + .5), col='black', border='black')

y1= c( -(r2) : -(r1) )

graphics::image(x = c( (2*(c - 1) + 1.5) : (2*c + 0.5) ),
 y = y1,
 z = t(matrix( y1 , ncol = 1)),
 col = t(matrix( rainbow(len( y1 ), start = 0.5, end = 0.6) , ncol = 1)),
 add = T)
}
</pre>
<p>Adjust interval in line 21 to 0.5, 0.6 to get shades of blue.</p>
<p>Now if you want to draw the cycle-like graph:</p>
<pre class="brush: r; title: ; notranslate">
# make matrix shorter for the graph (limit to 0-6 months)
temp = as.matrix(cohort_p[,3:(length(cohort_p[1,])-1)])
temp temp[temp == &quot;0&quot;]
library(RColorBrewer)
colnames(temp) = paste('Month', 0:(length(temp[1,])-1), 'retention', sep=' ')
palplot(temp[,1],pch=19,xaxt=&quot;n&quot;,col=pal[1],type=&quot;o&quot;,ylim=c(0,as.numeric(max(temp[,-2],na.rm=T))),xlab=&quot;Cohort by Month&quot;,ylab=&quot;Retention&quot;,main=&quot;Retention by Cohort&quot;)

for(i in 2:length(colnames(temp))) {
 points(temp[,i],pch=19,xaxt=&quot;n&quot;,col=pal[i])
 lines(temp[,i],pch=19,xaxt=&quot;n&quot;,col=pal[i])
}

axis(1,at=1:length(cohort_p$cohort),labels=as.vector(cohort_p$cohort),cex.axis=0.75)
legend(&quot;bottomleft&quot;,legend=colnames(temp),col=pal,lty=1,pch=19,bty=&quot;n&quot;)
abline(h=(seq(0,1,0.1)), col=&quot;lightgray&quot;, lty=&quot;dotted&quot;)
</pre>
<p>This code produces nice visualizations of the cohort analysis as multicolour cycle graph:<br />
<a href="http://www.ivankuznetsov.com/wp-content/uploads/cycle-graph.png"><img class="wp-image-483 aligncenter" title="cycle graph" src="http://www.ivankuznetsov.com/wp-content/uploads/cycle-graph.png" alt="" width="720" /></a></p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F04%2Fmeasuring-user-retention-using-cohort-analysis-with-r.html&amp;title=Measuring%20user%20retention%20using%20cohort%20analysis%20with%20R" id="wpa2a_2"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/04/measuring-user-retention-using-cohort-analysis-with-r.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Book review: Programming Amazon EC2</title>
		<link>http://www.ivankuznetsov.com/2012/04/book-review-programming-amazon-ec2.html</link>
		<comments>http://www.ivankuznetsov.com/2012/04/book-review-programming-amazon-ec2.html#comments</comments>
		<pubDate>Tue, 24 Apr 2012 20:08:30 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Web/Tech]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=464</guid>
		<description><![CDATA[Should you buy a book on a new technology or just read technology provider&#8217;s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn&#8217;t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all. After [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/amazonec2.jpg"><img class="size-medium wp-image-465 alignleft" title="Programming Amazon EC2" src="http://www.ivankuznetsov.com/wp-content/uploads/amazonec2-228x300.jpg" alt="" width="228" height="300" /></a>Should you buy a book on a new technology or just read technology provider&#8217;s guidelines, instructions and recommendations? This book was released over a year ago, so naturally it doesn&#8217;t cover all the latest developments that happened on Amazon AWS platform. For example Simple Email Service (SES) and Dynamo DB are not mentioned at all.</p>
<p>After a short historical introduction to the philosophy of the  Amazon platform authors proceed to basics of the EC2 instance management (using Ubuntu as example) and describe how several real life web applications benefited from migration to Amazon infrastructure. High level architecture descriptions help to understand how all pieces of the Amazon platform come together &#8211; ELB, EC2, RDS, S3, SQS, SNS, etc.  Examples are provided in PHP, Ruby and Java.</p>
<p>Don&#8217;t expect any secret knowledge to be revealed in this book. A lot of intricate details and answers to the questions that you will face when planning a migration to AWS or designing architecture for a new web application are left out of this book. At the same time the book gives a fairly good overview of how to make your application to be scaleble and highly available on AWS and can serve as a good starting point in your AWS journey.</p>
<p>Some of the recommendations and descriptions given in the book are outright sloppy. A couple of examples: the book recommends &#8220;<em>You can choose an availability zone, but that doesn’t matter very much because network latency between zones is comparable to that within a zone</em>&#8220;. But tests indicate that <em><a href="http://orensol.com/2009/05/24/network-latency-inside-and-across-amazon-ec2-availability-zones/" target="_blank">Cross Availability Zones latency</a> can be 6 times higher than inner zone latency. For a network intensive application, better keep your instances crowded in the same zone.</em></p>
<p>Another example where you might want to go run your own tests before making a decision or look for another opinion &#8220;<em>A disk is always slower than memory. If you run your own MySQL using local disks, that’s slow as well. But using disk-based operations in RDS is just horrible. Minimizing </em><em>disk usage means implementing proper indexes, something you always want to do.</em>&#8221; A very strong yet vague statement on RDS performance, where I&#8217;d really love to see a comparison in performance of MySQL installation on EC2+EBS vs. RDS and a list of what kind of MySQL fine-tuning will and will not work on RDS.</p>
<p>Another annoying bit about the book is that authors keep promoting their commercial Android client for AWS monitoring throughout all chapters. As I see it &#8211; if there are ads, the book should be free.</p>
<p>Bottom line: I see only two reasons why you might want to buy and read this book &#8211; learning about the history of AWS and learning how five selected web services designed their architecture when migrating to AWS.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F04%2Fbook-review-programming-amazon-ec2.html&amp;title=Book%20review%3A%20Programming%20Amazon%20EC2" id="wpa2a_4"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/04/book-review-programming-amazon-ec2.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Heat map visualization of sick day trends in Finland with R, ggplot2 and Google Correlate</title>
		<link>http://www.ivankuznetsov.com/2012/04/heatmap-visualization-of-sick-day-trends-in-finland-with-r-ggplot2-and-google-correlate.html</link>
		<comments>http://www.ivankuznetsov.com/2012/04/heatmap-visualization-of-sick-day-trends-in-finland-with-r-ggplot2-and-google-correlate.html#comments</comments>
		<pubDate>Tue, 24 Apr 2012 15:20:02 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[HeiaHeia]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=432</guid>
		<description><![CDATA[Inspired by Margintale&#8217;s post &#8220;ggplot2 Time Series Heatmaps&#8221; and Google Flu Trends I decided to use a heat map to visualize sick days logged by HeiaHeia.com Finnish users. I got the data from our database, filtering results by country (Finnish users only) in a tab separated form with the first line as the header. Three columns [...]]]></description>
				<content:encoded><![CDATA[<p>Inspired by Margintale&#8217;s post &#8220;<a href="http://margintale.blogspot.com/2012/04/ggplot2-time-series-heatmaps.html" target="_blank">ggplot2 Time Series Heatmaps</a>&#8221; and <a href="http://www.google.org/flutrends/" target="_blank">Google Flu Trends</a> I decided to use a heat map to visualize sick days logged by <a href="http://info.heiaheia.com" target="_blank">HeiaHeia.com</a> Finnish users.</p>
<p>I got the data from our database, filtering results by country (Finnish users only) in a tab separated form with the first line as the header. Three columns contained date, count of sick days logged on that date and count of Finnish users in the service on that date.</p>
<pre>date count(*) user_cnt
2011-01-01 123 12345
2011-01-02 456 67890
...</pre>
<p>Below is R source code for plotting the heat map. I made some small changes to the original code:</p>
<ul>
<li>data normalization (line 9): this is specific to the data used in this example</li>
<li>days of the week have to be 1..7, not 0..6 as returned by $wday (line 19): <em>dat$weekday = as.numeric(format(as.POSIXlt(dat$date),&#8221;%u&#8221;))</em></li>
<li>date format (line 31): week of year calculation required date conversion to POSIX <em>dat$week &lt;- as.numeric(format(as.POSIXlt(dat$date),&#8221;%W&#8221;))</em></li>
<li>custom header for the legend (line 39): adding <em>+ labs(fill=&#8221;per user per day&#8221;)</em> allows you to customize legend header</li>
</ul>
<pre class="brush: r; title: ; notranslate">
require(zoo)
require(ggplot2)
require(plyr)

dat&lt;-read.csv(&quot;~/data/sick_days_per_day.txt&quot;,header=TRUE,sep=&quot;\t&quot;)&lt;/pre&gt;
colnames(dat) &lt;- c(&quot;date&quot;, &quot;count&quot;, &quot;user_cnt&quot;)

# normalize data by number of users on each date
dat$norm_count &lt;- dat$count / dat$user_cnt

# facet by year ~ month, and each subgraph will show week-of-month versus weekday the year is simple
dat$year&lt;-as.numeric(as.POSIXlt(dat$date)$year+1900)
dat$month&lt;-as.numeric(as.POSIXlt(dat$date)$mon+1)

# turn months into ordered facors to control the appearance/ordering in the presentation
dat$monthf&lt;-factor(dat$month,levels=as.character(1:12),labels=c(&quot;Jan&quot;,&quot;Feb&quot;,&quot;Mar&quot;,&quot;Apr&quot;,&quot;May&quot;,&quot;Jun&quot;,&quot;Jul&quot;,&quot;Aug&quot;,&quot;Sep&quot;,&quot;Oct&quot;,&quot;Nov&quot;,&quot;Dec&quot;),ordered=TRUE)

# the day of week is again easily found
dat$weekday = as.numeric(format(as.POSIXlt(dat$date),&quot;%u&quot;))

# again turn into factors to control appearance/abbreviation and ordering
# I use the reverse function rev here to order the week top down in the graph
# you can cut it out to reverse week order
dat$weekdayf&lt;-factor(dat$weekday,levels=rev(1:7),labels=rev(c(&quot;Mon&quot;,&quot;Tue&quot;,&quot;Wed&quot;,&quot;Thu&quot;,&quot;Fri&quot;,&quot;Sat&quot;,&quot;Sun&quot;)),ordered=TRUE)

# the monthweek part is a bit trickier - first a factor which cuts the data into month chunks
dat$yearmonth&lt;-as.yearmon(dat$date)
dat$yearmonthf&lt;-factor(dat$yearmonth)

# then find the &quot;week of year&quot; for each day
dat$week &lt;- as.numeric(format(as.POSIXlt(dat$date),&quot;%W&quot;))

# and now for each monthblock we normalize the week to start at 1
dat&lt;-ddply(dat,.(yearmonthf),transform,monthweek=1+week-min(week))

# Now for the plot
P&lt;- ggplot(dat, aes(monthweek, weekdayf, fill = dat$norm_count)) +
 geom_tile(colour = &quot;white&quot;) + facet_grid(year~monthf) + scale_fill_gradient(low=&quot;green&quot;, high=&quot;red&quot;) +
 opts(title = &quot;Time-Series Calendar Heatmap - HeiaHeia.com sick days logged&quot;) + xlab(&quot;Week of Month&quot;) + ylab(&quot;&quot;) + labs(fill=&quot;per user per day&quot;)
P
</pre>
<p>Here are the results. Green indicates the healthiest days with lowest values of sick days logged per user, red indicates the worst days with highest values of sick days logged per user. It&#8217;s quite clear that there are seasonal peaks around February, and 2011 was a lot worse than 2012 (one should note that January-February of 2011 were exceptionally cold in Finland). It matches quite well with the coverage in the national press: <a href="Flu season reaching peak" target="_blank">Flu season reaching peak</a> (Feb&#8217;2012), <a href="http://yle.fi/uutiset/employers_grapple_with_sick_leaves_brought_by_flu_wave/3288795" target="_blank">Employers grapple with sick leaves brought by flu wave</a> (Feb&#8217;2012).</p>
<p>It&#8217;s interesting that there are less sick days logged on the weekends than on the work days, and traditional holiday month of July is the healthiest month of all.</p>
<p style="text-align: center;"><a href="http://www.ivankuznetsov.com/wp-content/uploads/sick-days-fi.png" target="_blank"><img class="alignnone  wp-image-458" title="HeiaHeia.com: sick days logged in Finland" src="http://www.ivankuznetsov.com/wp-content/uploads/sick-days-fi.png" alt="" width="709" /></a><br />
(click to see full-sized image)</p>
<p>To get a more formal validation of the data logged by HeiaHeia users, I used <a href="http://www.google.com/trends/correlate/whitepaper.pdf" target="_blank">Google Correlate</a> lab tool to check that heat map results make sense. I uploaded sick days per user weekly time series and plotted a correlation with Google search queries for &#8220;kuumeen hoito&#8221; (treatment of fever in Finnish).</p>
<p style="text-align: center;"><a href="http://www.ivankuznetsov.com/wp-content/uploads/gcorrelate.png" target="_blank"><img class="size-full wp-image-435 alignnone" title="HeiaHeia.com: sick days logged correlation with Google search for 'kuumeen hoito'" src="http://www.ivankuznetsov.com/wp-content/uploads/gcorrelate.png" alt="" width="709" height="438" /></a><br />
(click to see full-sized image)</p>
<p>Pearson Correlation Coefficient r between HeiaHeia sick days time series and Google search activity σ (both normalized so that mean is 0 and standard deviation is 1) is 0.8257 &#8211; this is a pretty good match.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F04%2Fheatmap-visualization-of-sick-day-trends-in-finland-with-r-ggplot2-and-google-correlate.html&amp;title=Heat%20map%20visualization%20of%20sick%20day%20trends%20in%20Finland%20with%20R%2C%20ggplot2%20and%20Google%20Correlate" id="wpa2a_6"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/04/heatmap-visualization-of-sick-day-trends-in-finland-with-r-ggplot2-and-google-correlate.html/feed</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Setting up and securing web server (+MySQL +Rails/PHP) on Ubuntu 10.04 LTS</title>
		<link>http://www.ivankuznetsov.com/2012/04/setting-up-and-securing-web-server-mysql-railsphp-on-ubuntu-10-04-lts.html</link>
		<comments>http://www.ivankuznetsov.com/2012/04/setting-up-and-securing-web-server-mysql-railsphp-on-ubuntu-10-04-lts.html#comments</comments>
		<pubDate>Mon, 23 Apr 2012 13:55:26 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Linode]]></category>
		<category><![CDATA[Linux]]></category>
		<category><![CDATA[Web/Tech]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=411</guid>
		<description><![CDATA[After repeating these operations many times in various setups, I decided to create a public set of instructions and share them with the world. This should be suitable for most of the simple web sites, utilizing Ruby on Rails or PHP. The setup works on Ubuntu 10.04 Server LTS (scheduled end of life April 2015). Other [...]]]></description>
				<content:encoded><![CDATA[<p>After repeating these operations many times in various setups, I decided to create a public set of instructions and share them with the world. This should be suitable for most of the simple web sites, utilizing Ruby on Rails or PHP.</p>
<p>The setup works on Ubuntu 10.04 Server LTS (scheduled end of life April 2015). Other components of the setup are <a href="http://nginx.org/" target="_blank">Nginx</a> as the web server, <a href="http://www.modrails.com/" target="_blank">Phusion Passenger</a> as application server.</p>
<p>I&#8217;m using this setup most often on <a href="http://www.linode.com/" target="_blank">Linode</a> VPS, however none of the instructions are Linode specific.</p>
<p><span id="more-411"></span></p>
<h2>Create non-root user with sudo rights</h2>
<p>(Login as root via ssh)</p>
<p>Add a new user (this user will be an administrator of the server &#8211; will have the ability to log in via ssh and use sudo, it is different from user account used for the web application):</p>
<pre><code>adduser username</code></pre>
<p>Add this user to sudoers list:</p>
<pre><code>visudo</code></pre>
<p>in the editor add</p>
<pre><code>username ALL=(ALL) ALL</code></pre>
<p><a name="Securing-sshd"></a></p>
<h2>Securing sshd</h2>
<p>Edit /etc/ssh/sshd_config:</p>
<p>Change the following lines (note, you&#8217;re changing default port for ssh connections &#8211; you&#8217;ll need to take that into account when connecting to the server later):</p>
<pre>Port 6668
PermitRootLogin no
X11Forwarding no
Protocol 2
PasswordAuthentication no
UsePAM no</pre>
<p>Add the following lines (here username is the name of the administrator user that was created earlier):</p>
<pre>UseDNS no 
AllowUsers username</pre>
<p>Restart ssh daemon:</p>
<pre><code>/etc/init.d/ssh restart</code></pre>
<p><a name="Add-authentication-keys-for-new-sudo-user"></a></p>
<h2>Add authentication keys for new administrator user</h2>
<p>Note, that &#8216;user&#8217; here is the name of the administrator user that was created earlier.</p>
<pre>mkdir /home/user/.ssh 
vim /home/user/.ssh/authorized_keys</pre>
<p>When editing authorized_keys file, insert a public of that user (e.g. your personal public key, if you want to make yourself an administrator).</p>
<pre>chown -R user:user /home/user/.ssh 
chmod 600 /home/user/.ssh/authorized_keys</pre>
<p>Now you can log out as root and log in as a sudo user you just created. NOTE! You won&#8217;t be able to log in as rot via ssh anymore.</p>
<p><a name="Set-up-iptables-firewall"></a></p>
<h2>Set up iptables firewall</h2>
<p>Save existing rules</p>
<pre><code>sudo sh -c 'iptables-save &gt; /etc/iptables.up.rules'</code></pre>
<p>Create test rules</p>
<pre><code>sudo vim /etc/iptables.test.rules</code></pre>
<p>Here&#8217;s an example of the firewall setup:</p>
<pre>*filter

#  Allows all loopback (lo0) traffic and drop all traffic to 127/8 that doesn't use lo0
-A INPUT -i lo -j ACCEPT
-A INPUT ! -i lo -d 127.0.0.0/8 -j REJECT

#  Accepts all established inbound connections
-A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT

#  Allows all outbound traffic
#  You can modify this to only allow certain traffic
-A OUTPUT -j ACCEPT

# Allows HTTP and HTTPS connections from anywhere (the normal ports for websites)
-A INPUT -p tcp --dport 80 -j ACCEPT
-A INPUT -p tcp --dport 443 -j ACCEPT

# (alternative) Uncomment to allow HTTP connections only from frontend server
# -A INPUT -p tcp -m state --state NEW -s frontend.server.ip --dport 80 -j ACCEPT

# Allow SSH connections
# THE -dport NUMBER IS THE SAME ONE YOU SET UP IN THE SSHD_CONFIG FILE
#
-A INPUT -p tcp -m state --state NEW --dport 6668 -j ACCEPT

# Uncomment to allow MySQL connections from defined servers
#-A INPUT -p tcp -m state --state NEW -s app.server.1.ip --dport 3306 -j ACCEPT

# Uncomment to allow memcached connections from app servers
#-A INPUT -p tcp -m state --state NEW -s app.server.1.ip --dport 11211 -j ACCEPT

# Uncomment to allow ping
#-A INPUT -p icmp -m icmp --icmp-type 8 -j ACCEPT

# log iptables denied calls
-A INPUT -m limit --limit 5/min -j LOG --log-prefix "iptables denied: " --log-level 7

# Reject all other inbound - default deny unless explicitly allowed policy
-A INPUT -j REJECT
-A FORWARD -j REJECT

COMMIT</pre>
<p>Apply firewall rules:</p>
<pre><code>sudo iptables-restore &lt; /etc/iptables.test.rules</code></pre>
<p>Check if everything is correct</p>
<pre><code>sudo iptables -L</code></pre>
<p>If everything is fine, save the rules</p>
<pre><code>sudo sh -c 'iptables-save &gt; /etc/iptables.up.rules'</code></pre>
<p>Make sure firewall rules are applied as soon as network interface comes up:</p>
<pre><code>sudo vim /etc/network/interfaces</code></pre>
<p>Add in the editor</p>
<pre>pre-up iptables-restore &lt; /etc/iptables.up.rules
<em>after</em>
iface lo inet loopback</pre>
<p><a name="Fix-locales"></a></p>
<h2>Fix locales</h2>
<pre>sudo locale-gen en_GB.UTF-8 
sudo /usr/sbin/update-locale LANG=en_GB.UTF-8</pre>
<p><a name="Set-timezone-and-hostname"></a></p>
<h2>Set timezone and hostname</h2>
<pre><code>sudo dpkg-reconfigure tzdata</code></pre>
<p>Select your timezone</p>
<pre>sudo sh -c 'echo "name" &gt; /etc/hostname' 
sudo hostname -F /etc/hostname</pre>
<p>Edit /etc/hosts (insert your VPS IP address and your server hostname here)</p>
<pre><code>IP.ad.dr.ess hostname</code></pre>
<p><a name="Update-the-system"></a></p>
<h2>Update the system</h2>
<p>Uncomment universe repositories</p>
<pre><code>sudo vim /etc/apt/sources.list</code></pre>
<p>And run safe and full upgrades</p>
<pre>sudo aptitude update 
sudo aptitude safe-upgrade 
sudo aptitude full-upgrade</pre>
<p><a name="Install-build-essentials"></a></p>
<h2>Install build essentials</h2>
<pre>sudo aptitude install build-essential 
sudo apt-get install zlibc zlib1g-dev libcurl4-openssl-dev libreadline5-dev</pre>
<p><a name="Install-setlock-for-crontab"></a></p>
<h2>Install setlock for crontab</h2>
<pre><code>sudo apt-get install daemontools</code></pre>
<p><a name="Set-up-Ruby-Enterprise-Edition-201202"></a></p>
<h2>Set up Ruby Enterprise Edition 2012.02</h2>
<p>Install ruby-ee (ruby 1.8.7 (2012-02-08 MBARI 8/0&#215;6770 on patchlevel 358) [x86_64-linux], MBARI 0&#215;6770, Ruby Enterprise Edition 2012.02):</p>
<pre>mkdir ~/tmp 
cd ~/tmp 
wget http://rubyenterpriseedition.googlecode.com/files/ruby-enterprise-1.8.7-2012.02.tar.gz
tar xvzf ruby-enterprise-1.8.7-2012.02.tar.gz</pre>
<p>Now enable tc_malloc large pages feature (32bit systems only): <a href="http://www.ivankuznetsov.com/2011/07/ree-segfaults-when-rails-application-has-too-many-localisation-files.html">http://www.ivankuznetsov.com/2011/07/ree-segfaults-when-rails-application-has-too-many-localisation-files.html</a></p>
<pre><code>sudo ./ruby-enterprise-1.8.7-2012.02/installer</code></pre>
<p>Create soft links to ruby tools</p>
<pre>sudo ln -s /opt/ruby-enterprise-1.8.7-2012.02/bin/ruby /usr/local/bin/ruby
sudo ln -s /opt/ruby-enterprise-1.8.7-2012.02/bin/gem /usr/local/bin/gem
sudo ln -s /opt/ruby-enterprise-1.8.7-2012.02/bin/irb /usr/local/bin/irb
sudo ln -s /opt/ruby-enterprise-1.8.7-2012.02/bin/rake /usr/local/bin/rake
sudo ln -s /opt/ruby-enterprise-1.8.7-2012.02/bin/bundle /usr/local/bin/bundle</pre>
<p>Create parametrised ruby launcher</p>
<pre><code>sudo vim /usr/local/bin/ruby-with-env</code></pre>
<p>With the following content:</p>
<pre>#!/bin/bash 
export RUBY_HEAP_MIN_SLOTS=1500000 
export RUBY_HEAP_SLOTS_INCREMENT=500000 
export RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 
export RUBY_GC_MALLOC_LIMIT=50000000 
exec "/usr/local/bin/ruby" "$@"</pre>
<p>and make this file executable:</p>
<pre><code>sudo chmod +x /usr/local/bin/ruby-with-env</code></pre>
<p><a name="Set-up-necessary-components"></a></p>
<h2>Set up necessary components</h2>
<p>Install readline wrap</p>
<pre><code>sudo apt-get install rlwrap</code></pre>
<p>Install git</p>
<pre><code>sudo apt-get install git-core</code></pre>
<p>Install native dependencies for gems</p>
<pre><code>sudo apt-get install libcurl4-gnutls-dev libxslt-dev </code>libssl-dev</pre>
<p><a name="Install-MySQL"></a></p>
<h2>Install MySQL</h2>
<pre>sudo apt-get install mysql-server libmysqlclient-dev</pre>
<h2>Install Passenger 3.0.11 and let it compile nginx</h2>
<pre>cd /tmp wget http://nginx.org/download/nginx-1.0.14.tar.gz
tar xvzf nginx-1.0.14.tar.gz 
sudo /opt/ruby-enterprise-1.8.7-2012.02/bin/passenger-install-nginx-module</pre>
<p>- choose installation option 2<br />
- provide source path /tmp/nginx-1.0.14<br />
- provide additional compilation options &#8211;with-http_realip_module &#8211;with-http_gzip_static_module &#8211;without-mail_pop3_module &#8211;without-mail_smtp_module &#8211;without-mail_imap_module</p>
<p>Create nginx init script:</p>
<pre><code>sudo vim /etc/init.d/nginx</code></pre>
<p>Enter the following content:</p>
<pre>#! /bin/sh

### BEGIN INIT INFO
# Provides:          nginx
# Required-Start:    $all
# Required-Stop:     $all
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: starts the nginx web server
# Description:       starts nginx using start-stop-daemon
### END INIT INFO

PATH=/opt/nginx/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
DAEMON=/opt/nginx/sbin/nginx
NAME=nginx
DESC=nginx

test -x $DAEMON || exit 0

# Include nginx defaults if available
if [ -f /etc/default/nginx ] ; then
    . /etc/default/nginx
fi

set -e

. /lib/lsb/init-functions

test_nginx_config() {
  if nginx -t
  then
    return 0
  else
    return $?
  fi
}

case "$1" in
  start)
    echo -n "Starting $DESC: "
        test_nginx_config
    start-stop-daemon --start --quiet --pidfile /var/run/$NAME.pid \
        --exec $DAEMON -- $DAEMON_OPTS || true
    echo "$NAME."
    ;;
  stop)
    echo -n "Stopping $DESC: "
    start-stop-daemon --stop --quiet --pidfile /var/run/$NAME.pid \
        --exec $DAEMON || true
    echo "$NAME."
    ;;
  restart|force-reload)
    echo -n "Restarting $DESC: "
    start-stop-daemon --stop --quiet --pidfile \
        /var/run/$NAME.pid --exec $DAEMON || true
    sleep 1
        test_nginx_config
    start-stop-daemon --start --quiet --pidfile \
        /var/run/$NAME.pid --exec $DAEMON -- $DAEMON_OPTS || true
    echo "$NAME."
    ;;
  reload)
        echo -n "Reloading $DESC configuration: "
        test_nginx_config
        start-stop-daemon --stop --signal HUP --quiet --pidfile /var/run/$NAME.pid \
            --exec $DAEMON || true
        echo "$NAME."
        ;;
  configtest)
        echo -n "Testing $DESC configuration: "
        if test_nginx_config
        then
          echo "$NAME."
        else
          exit $?
        fi
        ;;
  status)
    status_of_proc -p /var/run/$NAME.pid "$DAEMON" nginx &amp;&amp; exit 0 || exit $?
    ;;
  *)
    echo "Usage: $NAME {start|stop|restart|reload|force-reload|status|configtest}" &gt;&amp;2
    exit 1
    ;;
esac

exit 0</pre>
<p>Make it executable:</p>
<pre><code>sudo chmod +x /etc/init.d/nginx</code></pre>
<p>Add nginx to autostartup list:</p>
<pre><code>sudo /usr/sbin/update-rc.d -f nginx defaults</code></pre>
<p><a name="Edit-main-nginx-config"></a></p>
<h2>Edit main nginx config</h2>
<pre><code>sudo vim /opt/nginx/conf/nginx.conf</code></pre>
<p>Enter the following content:</p>
<pre>user  www-data;
worker_processes  4;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  8192;
    use epoll;
}

http {
    passenger_root /opt/ruby-enterprise-1.8.7-2012.02/lib/ruby/gems/1.8/gems/passenger-3.0.11;
    passenger_ruby /usr/local/bin/ruby-with-env;

    include       mime.types;
    default_type  application/octet-stream;

    log_format  main  '$remote_addr - $remote_user [$time_local] "$request" '
                      '$status $body_bytes_sent "$http_referer" '
                      '"$http_user_agent" "$http_x_forwarded_for"';

    access_log  /var/log/nginx/access.log;

    sendfile       on;
    tcp_nopush     on;
    tcp_nodelay        on;
    keepalive_timeout  65;

    # Passenger never sleeps!
    passenger_pool_idle_time 0;

    # Use more instances, because memory is enough
    passenger_max_pool_size 15;

    # Start application instantly
    passenger_pre_start http://127.0.0.1/;

    client_max_body_size 4m;

    include /opt/nginx/conf/sites-enabled/*;
}</pre>
<p>Create site configuration and log directories:</p>
<pre>sudo mkdir /opt/nginx/conf/sites-enabled 
sudo mkdir /opt/nginx/conf/sites-available 
sudo mkdir /var/log/nginx</pre>
<p><a name="Set-up-virtual-hosts"></a></p>
<h2>Set up virtual hosts</h2>
<p>Create a separate user for each virtual host, create a home directory, set password, create folder for logs and web application, set permissions</p>
<pre>sudo useradd newuser -d /home/newuser 
sudo mkdir /home/newuser 
sudo passwd newuser password
sudo mkdir /home/username/hostname 
sudo mkdir /home/username/logs 
sudo mkdir /home/username/logs/hostname 
sudo chown -R username:www-data /home/username/ 
sudo chmod 750 /home/username/</pre>
<p>Configure hosts:</p>
<pre><code>sudo vim /opt/nginx/conf/sites-available/mysite.com</code></pre>
<pre>server {
        listen   80;
        server_name  .mysite.com;

        access_log  /home/user/logs/mysite.com/access.log;
        error_log  /home/user/logs/mysite.com/error.log;
        root   /home/user/mysite.com/current/public;</pre>
<pre>        # uncomment if traffic is coming from a frontend server/loadbalancer
        #set_real_ip_from   IP.ad.dre.ss;
        #real_ip_header     X-Real-IP;

        passenger_enabled on;
        passenger_min_instances 5;
}
</pre>
<p>Enable site:</p>
<pre><code>sudo ln -s /opt/nginx/conf/sites-available/mysite.com /opt/nginx/conf/sites-enabled/mysite.com</code></pre>
<p>Set up log rotation:</p>
<pre><code>sudo ln -s /home/user/mysite.com/current/config/logrotate.conf /etc/logrotate.d/mysite</code></pre>
<h2>Install fastcgi (for PHP setups)</h2>
<p>Install dependencies</p>
<pre><code>sudo apt-get install libmcrypt-dev libxml2-dev libpng-dev autoconf2.13 libevent-dev libltdl-dev</code></pre>
<p>Download latest stable PHP 5.2.13, Suhosin patch, PHP-FPM patch</p>
<pre>cd ~/tmp wget http://pl2.php.net/get/php-5.2.13.tar.gz/from/pl.php.net/mirror
wget http://download.suhosin.org/suhosin-patch-5.2.13-0.9.7.patch.gz
wget http://php-fpm.org/downloads/php-5.2.13-fpm-0.5.13.diff.gz
tar xvzf php-5.2.13.tar.gz gunzip suhosin-patch-5.2.13-0.9.7.patch.gz 
gunzip php-5.2.13-fpm-0.5.13.diff.gz 
cd php-5.2.13 
patch -p 1 -i ../php-5.2.13-fpm-0.5.13.diff 
patch -p 1 -i ../suhosin-patch-5.2.13-0.9.7.patch 
./buildconf --force 
./configure --enable-fastcgi --enable-fpm --with-mcrypt --with-zlib --enable-mbstring --with-openssl --with-mysql --with-mysql-sock --with-gd --without-sqlite --disable-pdo 
make 
make test 
sudo make install</pre>
<p>Alternatively download latest stable PHP 5.3.2, Suhosin patch, apply PHP-FPM patch</p>
<pre>
cd ~/tmp http://fi.php.net/get/php-5.3.2.tar.gz/from/this/mirror
wget http://download.suhosin.org/suhosin-patch-5.3.2-0.9.9.1.patch.gz
tar xvzf php-5.3.2.tar.gz gunzip suhosin-patch-5.3.2-0.9.9.1.patch.gz 
cd php-5.3.2 
patch -p 1 -i ../suhosin-patch-5.3.2-0.9.9.1.patch 
svn co http://svn.php.net/repository/php/php-src/trunk/sapi/fpm sapi/fpm 
./buildconf --force 
./configure --enable-fastcgi --enable-fpm --with-mcrypt --with-zlib --enable-mbstring --with-openssl --with-mysql --with-mysql-sock --with-gd --without-sqlite --disable-pdo --disable-reflection 
make 
make test 
sudo make install</pre>
<p>Uninstall autoconf2.13 after compilation, since it is an old version and only required for PHP compilation</p>
<pre><code>sudo apt-get uninstall autoconf2.13</code></pre>
<p>Change user and group of php-fpm processes to your user dedicated to this application and www-data &#8211; lines 63 and 66 respectively</p>
<pre><code>sudo vim /usr/local/etc/php-fpm.conf</code></pre>
<p>Edit PHP settings</p>
<pre><code>sudo vim /etc/php5/cgi/php.ini</code></pre>
<p>Set:</p>
<pre>max_execution_time = 30 
memory_limit = 64M 
error_reporting = E_COMPILE_ERROR|E_RECOVERABLE_ERROR|E_ERROR|E_CORE_ERROR 
display_errors = Off 
log_errors = On 
error_log = /var/log/php.log 
register_globals = Off</pre>
<div>Now restart nginx and you&#8217;re all set</div>
<div>
<pre><code>sudo  /etc/init.d/nginx restart</code></pre>
</div>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F04%2Fsetting-up-and-securing-web-server-mysql-railsphp-on-ubuntu-10-04-lts.html&amp;title=Setting%20up%20and%20securing%20web%20server%20%28%2BMySQL%20%2BRails%2FPHP%29%20on%20Ubuntu%2010.04%20LTS" id="wpa2a_8"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/04/setting-up-and-securing-web-server-mysql-railsphp-on-ubuntu-10-04-lts.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Book review: European Founders At Work</title>
		<link>http://www.ivankuznetsov.com/2012/04/book-review-european-founders-at-work.html</link>
		<comments>http://www.ivankuznetsov.com/2012/04/book-review-european-founders-at-work.html#comments</comments>
		<pubDate>Wed, 11 Apr 2012 11:02:59 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Business]]></category>
		<category><![CDATA[Startups]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=403</guid>
		<description><![CDATA[A book by Pedro Santos follows the format of Jessica Livingstone&#8217;s &#8220;Founders at Work&#8221;, offering a series of interviews with the founders of European start-ups. Entrepreneurs, such as Illya Segalovich (co-founder of Yandex), Loic LeMeur (founder of Seesmic and LeWeb), Peter Arvai (co-founder of Prezi) and many others (see full list on the book&#8217;s website: www.europeanfoundersatwork.com) tell about how they [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/1329389172.jpg"><img class="alignleft size-full wp-image-404" title="1329389172" src="http://www.ivankuznetsov.com/wp-content/uploads/1329389172.jpg" alt="" width="165" height="248" /></a>A book by <a href="http://www.twitter.com/gairifo" target="_blank">Pedro Santos</a> follows the format of Jessica Livingstone&#8217;s &#8220;Founders at Work&#8221;, offering a series of interviews with the founders of European start-ups.</p>
<p>Entrepreneurs, such as Illya Segalovich (co-founder of Yandex), Loic LeMeur (founder of Seesmic and LeWeb), Peter Arvai (co-founder of Prezi) and many others (see full list on the book&#8217;s website: <a href="http://www.europeanfoundersatwork.com/interviews.html" target="_blank">www.europeanfoundersatwork.com</a>) tell about how they started, built, pivoted and drove their businesses to success.</p>
<p>The book gives a unique first-hand perspective on how to grow a successful business from Europe, what is the importance of US market, what are the challenges European start-ups are facing and what are the competitive advantages of being in Europe.</p>
<p>It is an inspiring book, and it is very relevant to European entrepreneurs. While stories of US start-ups quite often start with &#8220;we got $N mln in funding and started growing from there&#8221;, in Europe it&#8217;s more about bootstrapping and building a profit-generating machine. I would definitely recommend it to anyone who is thinking of starting a technology company in Europe or is already running one.</p>
<p><span id="more-403"></span> A couple of quotes from the book that I particularly liked:</p>
<p>From interview with Peter Arvai, co-founder of Prezi (company is in the business of SaaS presentation software):</p>
<p><em>&#8220;In my case, and I think I alluded to this if I&#8217;ve not said it directly, I&#8217;ve been very concerned with not building a company where there&#8217;s too much of a feeling that there&#8217;s an A and a B team. And naturally you&#8217;d think that the A team is in San Francisco. That is the cool place where everyone wants to be. And the B team is in Budapest, this former communist, gray country—and that&#8217;s where you outsource people. This is how most people tend to look at the world, and it&#8217;s a view that bugs the hell out of me.</em></p>
<p><em>And so what can you do as an entrepreneur is to try to work on those preconceptions. So one thing that we do, for example, is a fellowship program. The fellowship program allows Prezi employees to work in the other office and Prezi sponsors the travel and the living costs for the person if this is within, of course, the three-month visa restrictions that are allowed. So this means that people can experience the different places with all of their strengths and weaknesses.</em></p>
<p><em>Another thing is, so far we&#8217;ve had a semi-annual power week. I don&#8217;t know if we will be able to keep this up, but during this week we collect the entire company in one place and so far it&#8217;s always been Budapest because that&#8217;s always where we&#8217;ve had more people. This was probably a very important moment in Prezi history as well.&#8221;</em></p>
<p>Having the head office of <a href="http://info.heiaheia.com">HeiaHeia</a> in Helsinki, Finland and R&amp;D in Krasnodar, Russia it takes certain amount of effort not to split into A and B teams, so it&#8217;s very interesting to see how other companies with similar setup are addressing the issue.</p>
<p>And a couple of valuable notes on how to go to the US being a European company:</p>
<p>From interview with Jos White, founder of MessageLabs (UK SaaS startup, made $700M exit to Symantec in 2008), now a partner in Notion Capital, VC specializing in B2B SaaS businesses:</p>
<p><em>&#8220;You usually cannot charge as much for your service in the US, so you have to make it up on scale. Initially, US can be a loss leader until you really build up the scale. So we did have more users in the US than anywhere else, but we still had Europe as our largest market in terms of revenue and profit.</em></p>
<p><em>[...]</em></p>
<p><em>&#8230;initially we entered the US with a fairly small budget and we just thought we </em><em>would have a sales team out there and see if we can do it that way. And it just doesn&#8217;t work. I think that</em><br />
<em>in the US you really have to make a full investment into the market. You need to have a fairly </em><em>autonomous management team. You need to be prepared to look at spending money for some time </em><em>before delivering any profitability.&#8221;</em></p>
<p>From interview with Peldi Guilizzoni, founder of Balsamiq:</p>
<p><em>&#8220;I wanted to keep an American company because half of our customers are American. And they like to call an American phone number if they have a problem, and they like to mail a check, in dollars, in the mail, to an American address. It’s just the way software business is done in the US.&#8221;</em></p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F04%2Fbook-review-european-founders-at-work.html&amp;title=Book%20review%3A%20European%20Founders%20At%20Work" id="wpa2a_10"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/04/book-review-european-founders-at-work.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Informal notes from Strata 2012 conference on Big Data and Data Science</title>
		<link>http://www.ivankuznetsov.com/2012/03/informal-notes-from-strata-2012-conference-on-big-data-and-data-science.html</link>
		<comments>http://www.ivankuznetsov.com/2012/03/informal-notes-from-strata-2012-conference-on-big-data-and-data-science.html#comments</comments>
		<pubDate>Sat, 31 Mar 2012 13:47:27 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Data Science]]></category>
		<category><![CDATA[Travel]]></category>
		<category><![CDATA[Web/Tech]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=381</guid>
		<description><![CDATA[It&#8217;s been almost a month since I came back from California, and I just got around to sorting the notes from O&#8217;Reilly Strata conference. Spending time in the Valley is always inspiring &#8211; lots of interesting people, old friends, new contacts, new start-ups &#8211; it is the center of IT universe. Spending 3 days with [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/strata2012_logo_date.png"><img class="alignleft size-full wp-image-382" title="strata2012_logo_date" src="http://www.ivankuznetsov.com/wp-content/uploads/strata2012_logo_date.png" alt="" width="302" height="177" /></a>It&#8217;s been almost a month since I came back from California, and I just got around to sorting the notes from O&#8217;Reilly Strata conference. Spending time in the Valley is always inspiring &#8211; lots of interesting people, old friends, new contacts, new start-ups &#8211; it is the center of IT universe.</p>
<p>Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in <a href="http://www.heiaheia.com">HeiaHeia</a>. It&#8217;s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.</p>
<p>Unfortunately <a href="http://info.heiaheia.com">HeiaHeia</a> was the only company from Finland attending the conference. But I&#8217;m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.</p>
<p>By the way, Trulia (co-founded by <a href="https://twitter.com/#!/samiinkinen">Sami Inkinen</a>) had an excellent demo in the halls of the conference venue &#8211; <a href="http://www.truliablog.com/2012/03/07/trulia-showcases-data-visualizations-at-strata-2012/">check it out in their blog</a>.</p>
<p>Below are my notes from the conference &#8211; I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition &#8220;In data science, domain expertise is more important than machine learning skill.&#8221; (see videos below).</p>
<p><span id="more-381"></span></p>
<h1>== Day 1 ==</h1>
<h2>Michael Rys / Microsoft &#8211; SQL and NoSQL &#8211; two sides of the same coin</h2>
<p><a href="http://www.slideshare.net/MichaelRys/sql-and-nosql-are-two-sides-of-the-same-coin-strata-2012">http://www.slideshare.net/MichaelRys/sql-and-nosql-are-two-sides-of-the-same-coin-strata-2012</a></p>
<p>Web 2.0 &#8211; money are made by &#8220;motetizing the social&#8221;<br />
- improving individual experience<br />
- re-selling agreate data</p>
<p>MySpace 500 database servers (use stored procedures)<br />
FB is running on 1800 MySQL databases (+Cassandra, HBase, etc)</p>
<p>Shard/partition data. Global transactions do not scale.<br />
Read cache might help.</p>
<p>Social networking &#8211; user status update. Update user&#8217;s own db, then service dispatcher asynchronously updates servers for user&#8217;s friends.</p>
<p>Eventual consistency &#8211; changes are taking time to propagate.<br />
Propagation needs to be able to deal with failures &#8211; retries.</p>
<p>Scaling requires:<br />
- elasticity<br />
- multiversion schema support<br />
- flexible schema support</p>
<p>Most of the customers end up developing own solutions on top of the existing db servers.</p>
<p>NoSQL:<br />
- low cost<br />
- HA, scale out, performance<br />
- data first: flexible schema</p>
<p>MS Azure with federation: split/drop of shards without downtime (no merge supported yet)<br />
Running aggregate queries on federated data using map-reduce approach in SQL:<br />
select sum(scores)/sum(count) from (select scores, count from federated_table &#8230;)</p>
<p>NoSQL issues:<br />
- no secondary indices<br />
- eventual consistency is not always and optimal choice</p>
<p>Architecture:<br />
- multiple shards, each one has read-only replicas for OLAP processing</p>
<h2>Claudia Perlich &#8211; chief scientists Media 6 degrees, From Knowing ‘What’ To Understanding ‘Why’</h2>
<p>Does showing online ad to a person increases probability of that person going and buying the product.<br />
Collecting full set of events &#8211; from the patterns on browsing the web to identify the target and<br />
bid to show the ad to that target and then monitoring if that person will visit the site.</p>
<p>Clicks are worth nothing &#8211; conversion is everything. But raw conversion raws are very deceiving.<br />
Does it makes sense to show an ad to the people who will buy the product anyway?<br />
The idea is to track how advertising changes the rate of conversion.</p>
<p>E.g. user visited the web site within 5 days after seeing the ad.</p>
<p>Understand the causal process. A/B testing is the way to do it &#8211; but it costs a lot of money.<br />
Non invasive causal estimation.<br />
TMLE &#8211; targeted maximum likelihood estimate</p>
<p>Have null tests: showing unrelated ads should have no effect on conversion!</p>
<p>A lot of people don&#8217;t want to be told that what they are doing is subobtimal &#8211; that hinders the process.<br />
Marketer, ad agency, creative designers &#8211; all follow their own best practices.</p>
<p>Tools:<br />
1. always pull the data yourself &#8211; it takes more time to figure out how the data was pulled than to pull it yourself<br />
2. hadoop + hive, perl/python to clean up the data, R</p>
<p>Time-wise: 0,5-1 year to build the model that is correct and believable (iterative process)</p>
<h2>Monica Rogati / senior data scientist @ LinkedIn &#8211; The Model and the Train Wreck: A Training Data How-to</h2>
<p>Peter Norvig: &#8220;The unreasonable effectiveness of data talk&#8221; &#8211; more data beats clever algorithms.<br />
But what beats more data is better data.</p>
<p>Cold start problem: what to do if we don&#8217;t know enough about users yet?<br />
Random recommendations are not good &#8211; too easy to offend people.</p>
<h2>Jacob Perkins / Weotta &#8211; Corpus Bootstrapping with NLTK</h2>
<p>translating existing corpus with Yahoo Babelfish API: English -&gt; Spanish<br />
then using Naive Bayes to classify spanish movie reviews</p>
<p>bootstrapping phrase selection</p>
<p>Yelp study indicated that naive bayes classifier produces more accurate classification compared to mechanical turk</p>
<h2>Ben Gimpert / Altos research &#8211; The Importance of Importance: An Introduction to Feature Selection</h2>
<p>collect real estate listings</p>
<p>importance of selecting proper features for data mining/predictions<br />
Gartner&#8217;s hype cycle 2011: www.gartner.com/it/page.jps?id=1763814 &#8211; data mining is on the rise</p>
<p>Data dimensionality doubles every year, Moore&#8217;s law predicts computational power to double only every 2 years.</p>
<p>CART model, OLS, Ridge &#8211; time goes down, accuracy goes up</p>
<p>Feature selection:<br />
- manual (apply domain expertise)<br />
- automatic &#8211; look for efficiency of all subsets (but this is only a thought experiment)<br />
- forward stepwise feature selection<br />
- information gain feature selection (if taking out a feature increases entropy, it must be pretty important)<br />
- Ridge regression regularization<br />
- least angle regression (LARS)<br />
- principles components analysis (PCA) &#8211; does badly agains non-linear data shapes (it&#8217;s also difficult to explain it)<br />
- ensamble of decision trees (CART or random forests)</p>
<h2>Matt Biddulph (@mattb) / Dopplr co-founder, Hackdiary.com &#8211; Social Network Analysis Isn’t Just For People</h2>
<p>- triadic closure for people you might know implementation<br />
- community detection in SNA 1/2m SUM(Aij &#8211; ki * kj / 2m) delta (ci, cj)<br />
- belgian phonecall network &#8211; split between french and flemish speaking communities<br />
- Nokia map app &#8211; all route calculation requests are logged, it can be used for identifying social links between cities<br />
- gephi.org &#8211; photoshop of graphs<br />
- betweenness centrality algorithm to size and colour network of topics tagged in del.ico.us<br />
- links between most active pages on wikipedia as a social network<br />
- map of affinity between journalists and topics &#8211; using Guaridian API<br />
- map of the developers using GitHub API<br />
- links between programming languages and music (last.fm API), JS, Node.js &#8211; alternative and rock, Ruby &#8211; singer-songwriter</p>
<h2>Robert Lefkowitz / 1010data &#8211; Array Theory vs. Set Theory in Managing Data</h2>
<p>- array structure are a superset of set structures, so array-based databases are more powerful than set based databases<br />
- permutation vector (order in R) gives you a list of indices which can be used to retrieve original data elements in order<br />
- example: in a database with &#8220;date, place, low, high&#8221; structure how do you get the &#8220;range&#8221; (high-low) for each date<br />
- it&#8217;s possible to make all operations defined in the set theory (in relational databases) with permutation vectors<br />
and also grading, proximity joins, fuzzy joins, transation isolation, sharding<br />
- 1010data.com has an implementation of the database based on array theory, another one is SciDB</p>
<h2>Robert Lancaster / Orbitz &#8211; Survival Analysis for Cache Time-to-Live Optimization</h2>
<p>- survival analysis (heart-transplant pationes &#8211; time to death, leukemia in remission &#8211; time to relapse)<br />
- survival function (Weibull, exponential, log-logistic)<br />
- Kaplan-Meier estimates<br />
- Orbitz applied this to cache of rates and availability pulled from the partners<br />
- change deployed on Feb 2, 2012 &#8211; sharp decrease in traffic, impact on look-to-book is positive</p>
<h2>Debate &#8211; who should be the first hire in a data science team &#8211; a domain expert or machine learning expert?</h2>
<p>O&#8217;Reilly has provided a recording of this debate for free &#8211; <a href="http://vplayer.oreilly.com/?chapter=http://atom.oreilly.com/atom/oreilly/videos/1076046&amp;video_product=urn:x-domain:oreilly.com:product:0636920025467.VIDEO#embedded_player">click here to watch it</a>.</p>
<p>Mike Driscoll (<a href="http://www.twitter.com/medriscoll">@medriscoll</a>), moderator of this debate, summarized it nicely <a href="http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine">in his blog</a>: &#8220;Thus who you decide to hire as your first data scientist — a domain expert or a machine learner — might be as simple as this: could you currently prepare your data for a Kaggle competition?  If so, then hire a machine learner.  If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.&#8221;</p>
<h1>== Day 2 ==</h1>
<h2>Edd Dumbill / O&#8217;Reilly Media, Alistair Croll / Bitcurrent</h2>
<p>Data science is growing. This year latecomers like MS are joining in. Data visualization is growing in importance.<br />
20 years ago computing became available to all businesses &#8211; we built an exo-skeletone for the business.<br />
Now data is becoming nervous system of the businesses.</p>
<h2>Doug Cutting / Cloudera &#8211; The Apache Hadoop Ecosystem</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/Ttu3ZQ58ovo?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- we&#8217;re now at peta-scale kMGTPEZY<br />
- big data is using commodity hw, but scales better<br />
- big data: distributes, raw data (vs. schema), open source (vs. proprietory)<br />
- Hadoop became de-facto industry standard, it is the kernel<br />
- there are a lot of projects around the kernel (like nobody uses Linux kernel alone)<br />
- most of the tools are open source at apache</p>
<h2>Dave Campbell / Microsoft &#8211; Do We Have The Tools We Need To Navigate The New World Of Data?</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/3l2OEJQYWX8?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- microsoft announced hadoop support last september<br />
- signal -&gt; data -&gt; information -&gt; knowledge -&gt; insights &amp; actions<br />
- Haiti earthquake &#8211; how ML was used to build a statistical translation engine for Haitian Creole &#8211; English translations<br />
- search &amp; acquire -&gt; explore &amp; analyze -&gt; explain &amp; share<br />
- new agenda for the tooling: data visualisation shall be as easy as creating a powerpoint</p>
<h2>Abhishek Mehta / Tresata &#8211; Decoding the Great American ZIP myth</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/Q-WM32zbb78?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- levitt (creator of suburbia &#8211; same houses), ford &#8211; homogeniety<br />
- we are data rich but information poor society<br />
- tools for solving data problems are finally here &#8211; financial system and all o fthe industries that are in catarsis can be rebuilt</p>
<h2>Mike Olson / Cloudera CEO &#8211; Guns, Drugs and Oil: Attacking Big Problems with Big Data</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/SYZcCdz_OXU?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- Drugs: genome analysis using hadoop<br />
- Guns: Santa Cruz &#8211; predictive policing<br />
- Oil: subsurface structure mapping<br />
- importance is in applying technology to solving big social problems</p>
<h2>Flavio Villanustre / LexisNexis Risk Solutions and HPCC Systems &#8211; Machine Learning and Big Data: Sustainable Value or Hype?</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/O3L59lTgdEs?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- company with 15 years of big data experience<br />
- released HPCC as open source last year<br />
- ML is establishing itself as an important discipline, companies that don&#8217;t leverage the data they have will loose to competitors</p>
<h2>Steve Schoettler / co-founder of Zynga, Junyo &#8211; Learning Analytics: What Could You Do With Five Orders of Magnitude More Data About Learning?</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/eyogpjCEsjc?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- Junyo &#8211; learning analytics<br />
- US government takes two years to collect and process school scores nationwide (now 2009 is the last data point available)<br />
- there&#8217;s no analysis &#8211; there&#8217;s no way to improve<br />
- most important factor affecting student achievement is feedback<br />
- it&#8217;s possible to double student achievement &#8211; need to tailor instructions to students&#8217; needs<br />
- technology needs to be used for that, because that&#8217;s a lot of work</p>
<h2>Avinash Kaushik / Market Motive &#8211; A Big Data Imperative: Driving Big Action</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/CrSX97elHDA?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- information is powerful. But it is how we use it that will define us.<br />
- bring people in the organisation closer to the data, let them run reports<br />
- known knowns | known unknowns | unknown unknowns &#8211; Donald Rumsfeld <img src='http://www.ivankuznetsov.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /><br />
- tops (top-10, bottom-10) do not give you anything &#8211; it&#8217;s gauss, the information is in the middle<br />
- GA is data puking, highlighting differences between predictions and real data is important<br />
- Occam Razor blog</p>
<h2>Ben Goldacre / Bad Science &#8211; The Information Architecture of Medicine is Broken</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/AK_EUKJyusg?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- it&#8217;s possible to see if any small negative studies are missing<br />
- Tamiflu story: no data is given by Roche<br />
- Reboxetine: 76% of all the trials were witheld<br />
- research fraud<br />
- it&#8217;s possible to find worst offenders by matching data from various sources<br />
- alltrials.org</p>
<h2>Eddie Satterly / Expedia, Sanjay Mehta / Splunk &#8211; Turning Big Data Into Competitive Advantage</h2>
<p>- Big Data vs. Barak Obama search popularity &#8211; data wins<br />
- this in itself is a big data problem<br />
- splunk: collects data in real time from various sources reliably, indexes it and makes available for search, analysis, visualisation and pattern recognition<br />
- Expedia: 4,000 people in R&amp;D &#8211; 8,000 total<br />
- Expedia took Splunk into use in 3 months for all machines that are logging and everything else &#8211; call logs on voip, tickets, changes, scripts<br />
- Data ~6Tb/day<br />
- Expedia deploy cycle &#8211; twice a week<br />
- Expedia was not prepared to give out all BI data to SaaS solution &#8211; one of the reasons for Splunk</p>
<h2>Peter Skomoroch / LinkedIn &#8211; Street Fighting Data Science</h2>
<p>- act as a street fighter: analyze, improvise, anticipate, adapt<br />
- how to approach a new problem as a new grad: predict sales and recommend price for new shoes model<br />
- go and see academic papers, what other people have done, create a model how to drop the price over time<br />
- result: pricing model decreases sales by 30% &#8211; what has gone wrong<br />
- ran complex black box model<br />
- didn&#8217;t analyze the data first<br />
- didn&#8217;t anticipate elastisicty error<br />
- talk at Velocity 2011 by John Rauser from Amazon: &#8220;Look at your data&#8221;: http://www.youtube.com/watch?v=coNDCIMH8bk<br />
- look at your data!<br />
- looking at the top end of anything doesn&#8217;t really tell you much!<br />
- book: street fighting mathematics<br />
- red teaming: plan for worst case scenarios, to avoid situations like described in teh beginning of the talk<br />
- blog: http://dataist.wordpress.com/</p>
<h2>Sam Shah / LinkedIn &#8211; Collaborative Filtering using MapReduce</h2>
<p>Collaborative filtering techinques:<br />
- latente factor models<br />
- neighbourhood based</p>
<p>algorithmic challenges:<br />
- performance<br />
- desparsification</p>
<p>Themes:<br />
- reliance<br />
- user experience<br />
- productization</p>
<p>Implementation: co-occurence graph<br />
1. Map-reduce implementation: time-viewer-viewee<br />
2. Index: alice, bob, 1 (co-occurence)</p>
<p>generating pairs might be expensive &#8211; limit by time<br />
first version does at 60% of optimal solution</p>
<p>improvements:<br />
- check your outputs (Obama problem)<br />
- check your inputs (repeated profile views probably should be discounted)<br />
- drill-down: if people you might know shows a lot of &#8220;john smiths&#8221; (because of failed searches) then CTR will be very low</p>
<p>presentation matters:<br />
- adding avatars to people you might know: 50% to CTR</p>
<p>putting it to production:</p>
<p>- pairs-&gt;scoring-&gt;top-n-&gt;push production<br />
- azkaban &#8211; for managing pig jobs<br />
- oozie -<br />
- voldemort &#8211; working with batch-computed data (Serving large scale batch computed data with project voldemort. In Fast 2012) https://www.usenix.org/conference/fast12/serving-large-scale-batch-computed-data-project-voldemort</p>
<h2>Philip Kromer / Infochimps &#8211; Disambiguation: Embrace wrong answers &amp; find truth</h2>
<p>- human entered geolocations<br />
- instead of correcting those, they can be used to identify strong links</p>
<h2>Xavier Amatriain / Netflix &#8211; Netflix recommendations: beyond the 5 stars</h2>
<p>(update 2012-04-07: now also available as blog post <a href="http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html">http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html</a>)</p>
<p>- Netflix prize: 10% improvement in predicted rating<br />
- Netflix went from dvd by mail company to a global streaming company that operates not only on the web, but on many other devices<br />
- make user aware the results provided are personalized<br />
- diversity in the top-10 recommendations (for you, for wife, for kids)<br />
- social support &#8211; not only recommendation based on personal preferences, but how many friends have liked it<br />
- note: in the US there&#8217;s a legal problem with showing friends&#8217; likes/favourites<br />
- ranking = scoring + sorting + filtering<br />
- models used by netflix: lo/lin regression, elastic nets, scd, boltzman machines, gradient boosted decision trees, &#8230;<br />
- recommending new movies: even a few more ratings beat more data<br />
- Norvig&#8217;s statement about more data beats better alorithms is valid for language processing, but Netflix indicates that more training examples are not helping if there are just a few relevant features</p>
<h1>== Day 3 ==</h1>
<h2>Jonathan Gosier / metaLayer Inc. &#8211; Democratization of Data Platforms</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/yqfqv_LAIYw?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- art used to live in the domain of specialists &#8211; paintings were comisisoned by institutions<br />
- not until recently math and science were also in the domain of specialist<br />
- data science now is in a similar position &#8211; it is for experts only<br />
- there need to be tools available for general public<br />
- examples: subprime mortgage market &#8211; how to understand who owns which loan, automatic trading systems &#8211; who understands how that works<br />
- ushahidi &#8211; open source distress signal collection<br />
- there should be an equivalent of a calculator for unstructured data<br />
- &#8220;&#8230;take the best and spread it around to everybody&#8230;&#8221; &#8211; Steve Jobs<br />
- metaLayer offer drag and drop insights</p>
<h2>Luke Lonergan / Greenplum, a division of EMC &#8211; 5 Big Questions about Big Data</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/m8Mu_zH7pyw?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<h2>Coco Krumme / MIT Media Lab &#8211; The Trouble with Taste</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/E1dM3IqIQCw?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- data analysis of the wine-tasting evening</p>
<h2>Pete Warden / Jetpac &#8211; Embrace the Chaos</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/tYsZFybelDo?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- building a business on somebody else&#8217;s data is very dangerous<br />
- Google had to deal with a chaos of billions of web pages<br />
- take support emails or some other seeminly chaotic data and try to extract value from it</p>
<h2>Usman Haque / Pachube.com &#8211; Open Data and the Internet of Things</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/oO3m7r_p2sc?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- general infrastructure for collecting data from various sensors<br />
- Internet of things bill of rights<br />
- people own the data they create</p>
<h2>Gary Lang / MarkLogic &#8211; Big Data’s Next Step: Applications</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/FSVeSbheH2E?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<h2>Hal Varian / Google &#8211; Using Google Data for Short-term Economic Forecasting</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/-I8acYHQ0v0?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- Google time-series for searches: vodka searches peak every Saturday and Dec 31st is a spike, hangover searches peak every Sunday and Jan 1st is a spike<br />
- initial claims vs unemployement level using governmant data<br />
- peaks in the end of every recession<br />
- challenges: simple corellation don&#8217;t work (like with unemployment)<br />
- human judgement doesn&#8217;t scale<br />
- baseball and auto sales both peak in summer, but just because of seasonality, not causality<br />
- Google Research blog<br />
- fat regression<br />
- incremental predictability<br />
- during resession coupon-related queries are growing<br />
- prediction for UM consumer sentiment</p>
<h2>Theo Schlossnagle / OmniTI &#8211; Is this normal? Finding anomalies in real-time data</h2>
<p>- author of &#8220;Scalable Internet Architectures&#8221;<br />
- hard real-time systems: outputs are considered incorrect if the latency of their delivery is above a specified amount<br />
- soft real-time systems: similar, but less useful instead of incorrect<br />
- Big data systems:<br />
- traditional: oracel, postgres, mysql<br />
- the shiny: hadoop, hive, hbase, pig, cassandra<br />
- the real time: sql stream, s4, flumebase, truviso, esper, storm<br />
- 300k datum/sec<br />
- real-time queries done with Esper<br />
- Holt-Winters &#8211; tri-variate seasunal regression<br />
- look at historic data<br />
- use that to predict the immediate future with some quantifiable confidence<br />
- implemented the Snowth for storage of data<br />
- implemented a C/lua distributed system to analyze 4 weeks of data<br />
- to ensure the system real-time &#8211; need to ensure that queries return in less than 2ms<br />
- how to transform batch processing, offline analytics to online real-time analytics without crazy brainpower</p>
<h2>Jeremy Howard / Kaggle &#8211; From Predictive Modelling to Optimization: The Next Frontier</h2>
<p><iframe width="500" height="281" src="http://www.youtube.com/embed/vYrWTDxoeGg?feature=oembed" frameborder="0" allowfullscreen></iframe></p>
<p>- predicting something doesn&#8217;t mean that you&#8217;ve made an impact on human lifes<br />
- 1. figure out the objective<br />
- 2. figure out the levers to pull &#8211; what can we change<br />
- 3. figure out what data we have<br />
- 4. <img src='http://www.ivankuznetsov.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  write a PhD on PageRank<br />
- working in insurance &#8211; objective is to make shit loads of money made on each individual client<br />
- to get necessary data &#8211; insurance companies had to start charging random prices<br />
- to convince them to do that, a simulation was run showing what happens if that is done vs. not done<br />
- how to apply this to marketing?<br />
- objective: maximise lifetime value of a customer<br />
- levers: recommendations, offers, discounts, care calls, etc.<br />
- what data can we collect?<br />
- Amazon recommends pratchet books to those who buys a lot of them. that&#8217;s not helpful<br />
- to improve it &#8211; start offering random stuff to collect the data that can later be mined</p>
<h2>Alyona Medelyan / Pingar, Anna Divoli / Pingar &#8211; Mining Unstructured Data: Practical Applications</h2>
<p>- IDC: 17h/wk searching information, 14h/wk writing emails<br />
- 17h/wk = $37,000/year<br />
- pingar provides metadata extraction tools</p>
<h2>Alasdair Allan / University of Exeter &#8211; Migratory data: the distributed data you carry with you</h2>
<p>- calendar, phonebook, camera phone pictures, bookmarks, text messages &#8211; all that is carried with you<br />
- photos are geotagged &#8211; and there&#8217;s a lot of other metadata collected<br />
- iOS spotlight index contains a cache of your smses and contacts even deleted ones<br />
- web cache, sms draft cache &#8211; are all metadata<br />
- facetime calls cache on iOS<br />
- google search cache is not wiped when you clear you browser cache on iOS<br />
- gps cache on iOS is never deleted for a year in iOS4, in iOS5 caching time is decreased<br />
- history of all maps searches on iOS are cached<br />
- keyboard cache is stored as well &#8211; text messages can be reconstructed from it<br />
- cache of last visible screen when swithing between applications<br />
- mitmproxy &#8211; an SSL-capable man-in-the-middle proxy<br />
- siri proxy &#8211; proxy to listen in to siri traffic<br />
- angry birds and cut the rope are offering &#8220;look at my local community&#8221; &#8211; that uploads your phonbook to the internet<br />
- data exhaust: broadcasting your location, rfid cards like oyster, tesco purchases tracking<br />
- who owns that data? consumer or companies?<br />
- bit.ly/FindingYourFriends &#8211; http://www.cs.rochester.edu/~sadilek/publications/Sadilek-Kautz-Bigham_Finding-Your-Friends-and-Following-Them-to-Where-You-Are_WSDM-12.pdf<br />
- after you stop sharing your location, it&#8217;s possible to pinpoint you to a 100m with a 80% accuracy by locations of your friends<br />
- how to prevent CC fraud &#8211; compare postcode of the latest credit card transaction and iPhone location<br />
- mapping receipts coming to email with online transaction</p>
<h2>Robbie Allen / Automated Insights, Inc. &#8211; From Big Data to Big Insights</h2>
<p>- big data by itself is not helpful<br />
- visualizations is one possible answer, but one problem here is that they still require interpretation<br />
- automatic writing by software &#8211; it can process more data and can be continuosly improved<br />
- examples: automatic stock market reports, insider trading reports<br />
- parked domains: generating relevant content for those pages from a list of keywords</p>
<h2>Cheryl Phillips / The Seattle Times &#8211; Exploring the Stories Behind the Data</h2>
<p>- philip meyer<br />
- seattle times visualisations of data brought to the users &#8211; illegal forest logging, casualties of war<br />
- The Nutgraph: important and visual information that attracts attention and makes people explore data further<br />
- never put more than one number into a paragraph of text<br />
- don&#8217;t overwhelm the reader with detail that they don&#8217;t need</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F03%2Finformal-notes-from-strata-2012-conference-on-big-data-and-data-science.html&amp;title=Informal%20notes%20from%20Strata%202012%20conference%20on%20Big%20Data%20and%20Data%20Science" id="wpa2a_12"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/03/informal-notes-from-strata-2012-conference-on-big-data-and-data-science.html/feed</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Book review: The start-up of you</title>
		<link>http://www.ivankuznetsov.com/2012/03/book-review-the-start-up-of-you.html</link>
		<comments>http://www.ivankuznetsov.com/2012/03/book-review-the-start-up-of-you.html#comments</comments>
		<pubDate>Fri, 30 Mar 2012 21:21:42 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=376</guid>
		<description><![CDATA[I bought this book because it was written by Reid Hoffman, co-founder of LinkedIn, and because it&#8217;s so damn easy to buy books on Kindle. It was a quick read, and I should say I&#8217;m a bit disappointed. If you want to save time and money &#8211; go to the book&#8217;s web page and you&#8217;ll get [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/the_startup_of_you.jpg"><img class="alignleft size-thumbnail wp-image-377" title="the_startup_of_you" src="http://www.ivankuznetsov.com/wp-content/uploads/the_startup_of_you-150x150.jpg" alt="" width="150" height="150" /></a>I bought this book because it was written by Reid Hoffman, co-founder of LinkedIn, and because it&#8217;s so damn easy to buy books on Kindle.</p>
<p>It was a quick read, and I should say I&#8217;m a bit disappointed. If you want to save time and money &#8211; go to the <a href="http://www.thestartupofyou.com/about-the-book/" target="_blank">book&#8217;s web page</a> and you&#8217;ll get all main ideas that are described in the full version.</p>
<p>Yes, the world is changing very fast. You don&#8217;t want to become a Detroit of the modern age. You should not expect a life-time employment in one company. Go learn new stuff, go meet people to find interesting opportunities. I was expecting a more insightful book with less direct LinkedIn service promotion, but in the end got a help-yourself, very Silicon Valley-centric guide on building a network using LinkedIn.</p>
<p>Unfortunately I cannot recommend this book, unless it&#8217;s news for you that you need to invest in continuous self-education and network building to succeed.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F03%2Fbook-review-the-start-up-of-you.html&amp;title=Book%20review%3A%20The%20start-up%20of%20you" id="wpa2a_14"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/03/book-review-the-start-up-of-you.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Book review: The Information Diet: A Case for Conscious Consumption</title>
		<link>http://www.ivankuznetsov.com/2012/03/book-review-the-information-diet-a-case-for-conscious-consumption.html</link>
		<comments>http://www.ivankuznetsov.com/2012/03/book-review-the-information-diet-a-case-for-conscious-consumption.html#comments</comments>
		<pubDate>Fri, 30 Mar 2012 21:02:01 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Books]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=372</guid>
		<description><![CDATA[I read &#8220;Information Diet&#8221; by Clay Johnson last Christmas. Central ideas of the book: - information is like food &#8211; bad consumption habits are bad for your health - it&#8217;s too easy to get yourself into information bubble: &#8220;When we tell ourselves, and listen to, only what we want to hear, we can end up [...]]]></description>
				<content:encoded><![CDATA[<p><a href="http://www.ivankuznetsov.com/wp-content/uploads/51fKFOtBR9L._BO2204203200_PIsitb-sticker-arrow-clickTopRight35-76_AA300_SH20_OU02_.jpg"><img class="alignleft size-thumbnail wp-image-373" title="51fKFOtBR9L._BO2,204,203,200_PIsitb-sticker-arrow-click,TopRight,35,-76_AA300_SH20_OU02_" src="http://www.ivankuznetsov.com/wp-content/uploads/51fKFOtBR9L._BO2204203200_PIsitb-sticker-arrow-clickTopRight35-76_AA300_SH20_OU02_-150x150.jpg" alt="" width="150" height="150" /></a>I read &#8220;<a href="http://www.informationdiet.com">Information Diet</a>&#8221; by Clay Johnson last Christmas. Central ideas of the book:<br />
- information is like food &#8211; bad consumption habits are bad for your health<br />
- it&#8217;s too easy to get yourself into information bubble: &#8220;When we tell ourselves, and listen to, only what we want to hear, we can end up so far from reality that we start making poor decisions&#8221;<br />
- fight your addictions: disable all notifications on your computer and check email only once in a while, not every 5 minutes to improve your attention span<br />
- media serves what people want to consume, if you want to et objective picture &#8211; learn to go down to the facts instead of relying on pre-processed information</p>
<p>I&#8217;m conflicted about recommending this book. On one hand &#8211; it&#8217;s 4 hours of your time that can be spent on better. On the other hand &#8211; it made me reconsider my own information diet and I see how I can now do more in a day because of that. So next time when you want to open IM, GMail, Twitter or Facebook, consider reading this book instead.</p>
<p>Update: after  months trying to follow the recommendations of this book &#8211; I have noticed that I have significantly improved my productivity. Topic of responsible information consumption also came up many times in conversations I had with many smart people in the past few months. There&#8217;s definitely a tectonic shift going on when it comes to information consumption habits.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2012%2F03%2Fbook-review-the-information-diet-a-case-for-conscious-consumption.html&amp;title=Book%20review%3A%20The%20Information%20Diet%3A%20A%20Case%20for%20Conscious%20Consumption" id="wpa2a_16"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2012/03/book-review-the-information-diet-a-case-for-conscious-consumption.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>REE segfaults when Rails application has too many localisation files</title>
		<link>http://www.ivankuznetsov.com/2011/07/ree-segfaults-when-rails-application-has-too-many-localisation-files.html</link>
		<comments>http://www.ivankuznetsov.com/2011/07/ree-segfaults-when-rails-application-has-too-many-localisation-files.html#comments</comments>
		<pubDate>Fri, 01 Jul 2011 22:03:56 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Ruby on Rails]]></category>
		<category><![CDATA[Software Development]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=349</guid>
		<description><![CDATA[We ran into an interesting problem &#8211; at some point of time our Rails application started to fail occaionally because of REE segfaults on startup. Even starting the console with &#8216;script/console production&#8217; was occasionally failing with REE segfault. Application was growing, new features were added and segfaults started happening more and more often. There was [...]]]></description>
				<content:encoded><![CDATA[<p>We ran into an interesting problem &#8211; at some point of time our Rails application started to fail occaionally because of REE segfaults on startup. Even starting the console with &#8216;script/console production&#8217; was occasionally failing with REE segfault. Application was growing, new features were added and segfaults started happening more and more often. There was no one single place where crashes occurred, so there was no clear understanding how to tackle this problem.</p>
<p>Examples of crashes we observed:</p>
<pre>/vendor/rails/actionpack/lib/action_controller/routing/route.rb:205):2:
   [BUG] Segmentation fault
/opt/ruby-enterprise-1.8.7-2011.03/lib/ruby/1.8/yaml.rb:133: 
   [BUG] Segmentation fault
/vendor/rails/activesupport/lib/active_support/vendor/i18n-0.3.7/i18n/
   backend/base.rb:257: [BUG] Segmentation fault
/vendor/rails/actionpack/lib/action_view/template.rb:226: [BUG] Segmentation fault
/opt/ruby-enterprise-1.8.7-2011.03/lib/ruby/gems/1.8/gems/pauldix-sax-machine-0.0.14/
   lib/sax-machine/sax_document.rb:30: [BUG] Segmentation fault
/vendor/rails/activesupport/lib/active_support/memoizable.rb:32: [BUG] Segmentation fault</pre>
<p>After banging my head against the wall for a week I found a solution (even two) and what might seem to be a likely reason for the segfaults. Two &#8220;suspects&#8221; &#8211; lack of available memory and incorrect version of libxml were ruled out. What seems to be the actual reason is the total size of the localisation files in config/locales loaded upon startup:</p>
<pre>$ du -shb config/locales
1665858    config/locales</pre>
<pre>$ cd config/locales
$ find . -type f | wc -l
805</pre>
<p>So ~1.6Mb in 805 files give occasional segfaults. Adding 200Kb of localisation files more started giving 100% segfaults on script/console startup.</p>
<p>Now I&#8217;ve found two workarounds for this problem.</p>
<p>1. Recompile REE with &#8211;no-tcmalloc flag</p>
<pre>./ruby-enterprise-1.8.7-2011.03/installer --no-tcmalloc</pre>
<p>Note that on 64-bit platforms tcmalloc is disabled by default.</p>
<p>2. Enable large pages feature in tcmalloc</p>
<p>This is described in <a href="http://google-perftools.googlecode.com/svn/tags/perftools-1.6/INSTALL" target="_blank">tcmalloc documentation</a> as: &#8220;Internally, tcmalloc divides its memory into &#8220;pages.&#8221;  The default page size is chosen to minimize memory use by reducing fragmentation. The cost is that keeping track of these pages can cost tcmalloc time. We&#8217;ve added a new, experimental flag to tcmalloc that enables a larger page size.  In general, this will increase the memory needs of applications using tcmalloc.  However, in many cases it will speed up the applications as well, particularly if they allocate and free a lot of memory.  We&#8217;ve seen average speedups of 3-5% on Google applications.&#8221;</p>
<p>There&#8217;s a warning &#8211; &#8220;this feature is still very experimental&#8221;, but it works to solve the problem with too many localisation files.</p>
<p>To compile REE with tcmalloc with large pages enables I just edited ruby-enterprise-1.8.7-2011.03/source/distro/google-perftools-1.7/src/common.h &#8211; replaced</p>
<pre>#if defined(TCMALLOC_LARGE_PAGES)
static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 &lt;&lt; 20;
#else
static const size_t kPageShift  = 12;
static const size_t kNumClasses = 61;
static const size_t kMaxThreadCacheSize = 2 &lt;&lt; 20;
#endif</pre>
<p>with</p>
<pre>static const size_t kPageShift  = 15;
static const size_t kNumClasses = 95;
static const size_t kMaxThreadCacheSize = 4 &lt;&lt; 20;</pre>
<p>On production servers I opted for no tcmalloc for now &#8211; but I hope there&#8217;ll be a <a href="http://code.google.com/p/rubyenterpriseedition/issues/detail?id=73" target="_blank">better way to deal with this issue</a> soon.</p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2011%2F07%2Free-segfaults-when-rails-application-has-too-many-localisation-files.html&amp;title=REE%20segfaults%20when%20Rails%20application%20has%20too%20many%20localisation%20files" id="wpa2a_18"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2011/07/ree-segfaults-when-rails-application-has-too-many-localisation-files.html/feed</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Pitfalls of Rails fragment caching with memcached</title>
		<link>http://www.ivankuznetsov.com/2011/06/pitfalls-of-rails-fragment-caching-with-memcached.html</link>
		<comments>http://www.ivankuznetsov.com/2011/06/pitfalls-of-rails-fragment-caching-with-memcached.html#comments</comments>
		<pubDate>Wed, 29 Jun 2011 15:52:36 +0000</pubDate>
		<dc:creator>Ivan Kuznetsov</dc:creator>
				<category><![CDATA[Ruby on Rails]]></category>
		<category><![CDATA[Software Development]]></category>

		<guid isPermaLink="false">http://www.ivankuznetsov.com/?p=319</guid>
		<description><![CDATA[Fragment caching is a powerful technique for improving performance of your web application. Rails site describes in detail how to apply this technique. Rails are providing developers with really excellent abstractions, but it&#8217;s always good to know what&#8217;s under the hood and how it all works. There are a few things that might potentially cause [...]]]></description>
				<content:encoded><![CDATA[<p><img class="alignleft size-full wp-image-356" title="300px-Ruby_on_Rails_logo" src="http://www.ivankuznetsov.com/wp-content/uploads/300px-Ruby_on_Rails_logo.jpg" alt="" width="150" />Fragment caching is a powerful technique for improving performance of your web application. Rails site <a href="http://guides.rubyonrails.org/caching_with_rails.html#fragment-caching" target="_blank">describes in detail</a> how to apply this technique.</p>
<p>Rails are providing developers with really excellent abstractions, but it&#8217;s always good to know what&#8217;s under the hood and how it all works.</p>
<p>There are a few things that might potentially cause bugs in your code, or waste your time (speaking from my own experience). So here goes:</p>
<p>1. Beware of globally keyed fragments</p>
<p>Let&#8217;s take example from Rails tutorial:</p>
<pre>&lt;% cache do %&gt;
  All available products:
  &lt;% Product.all.each do |p| %&gt;
    &lt;%= link_to p.name, product_url(p) %&gt;
  &lt;% end %&gt;
&lt;% end %&gt;</pre>
<p>Now if you need to deal with a multi-language site you might want to make cache fragment language dependent. What might seem a convenient solution:</p>
<pre>&lt;%- cache([user.locale.to_s]) do -%&gt;</pre>
<p>will turn into a source of very interesting problems. While calling the cache method without parameters will automatically create a controller/action specific cache key, calling it with a key will make this fragment a globally keyed fragment. Cache key in the first case is going to look like &#8220;views/localhost:3000/controller-name&#8221;, and in the other case &#8220;views/en&#8221; &#8211; this is not as unique identifier any more.</p>
<p>While automatic cache key naming provided by rails is very convenient, it is very easy to run into a problem with duplicate cache key names used in different places.</p>
<p>2. Another pitfall of automatic cache key naming is that you shall never assume that when creating a cache with global key you can later find it using e.g. <a href="http://lzone.de/articles/memcached.htm" target="_blank">telnet interface</a> to memcache. Example &#8211; add</p>
<pre>&lt;%- cache('unique_cache_key') do -%&gt;
&lt;%- end -%&gt;</pre>
<p>in your view and then try to read directly from memcache:</p>
<pre>$ telnet localhost 11211
GET unique_cache_key
END</pre>
<p>At the same time</p>
<pre>GET views/unique_cache_key</pre>
<p>will work. It&#8217;s easy to make this mistake trying to check or delete cache keys directly from memcache when using Rails cache methods.</p>
<p>3. delete_matched is not supported by memcached (see rails/activesupport/lib/active_support/cache/mem_cache_store.rb)</p>
<p>In practice that means that if you&#8217;re using memcached as Rails cache engine and trying to delete or expire fragment cache using standard Rails methods and regexp &#8211; you&#8217;ll fail.</p>
<p>expire_fragment(/base\/xyz.*/)</p>
<p>will fail miserably. Ideal solution is not to use explicit cache expiration, but rather create cache keys in such a way that doesn&#8217;t require expiration. Alternatively it&#8217;s possible to use <a href="https://github.com/jkassemi/memcache-store-extensions" target="_blank">extensions</a> implementing delete_matched for memcached (haven&#8217;t tried it myself though).</p>
<p><em>Tip: one very useful tool for checking memcached is <a href="https://github.com/fauna/peep" target="_blank">peep</a> by Evan Weaver &#8211; allows you to peek into the cache and see what&#8217;s really cached and how it is used.</em></p>
<p><a class="a2a_dd a2a_target addtoany_share_save" href="http://www.addtoany.com/share_save#url=http%3A%2F%2Fwww.ivankuznetsov.com%2F2011%2F06%2Fpitfalls-of-rails-fragment-caching-with-memcached.html&amp;title=Pitfalls%20of%20Rails%20fragment%20caching%20with%20memcached" id="wpa2a_20"><img src="http://www.ivankuznetsov.com/wp-content/plugins/add-to-any/share_save_171_16.png" width="171" height="16" alt="Share"/></a></p>]]></content:encoded>
			<wfw:commentRss>http://www.ivankuznetsov.com/2011/06/pitfalls-of-rails-fragment-caching-with-memcached.html/feed</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
