It’s been almost a month since I came back from California, and I just got around to sorting the notes from O’Reilly Strata conference. Spending time in the Valley is always inspiring – lots of interesting people, old friends, new contacts, new start-ups – it is the center of IT universe.
Spending 3 days with people who are working at the bleeding edge of data science was an unforgettable experience. I got my doze of inspiration and got a lot of new ideas how to apply data science in HeiaHeia. It’s difficult to underestimate the importance data analysis will have in the nearest years. Companies that do not get the importance of understanding data and making their decisions based on data analysis instead of gut feeling of board members/operative management will simply fade away.
Unfortunately HeiaHeia was the only company from Finland attending the conference. But I’m really happy to see that recently there are more and more signals that companies in Finland are starting to realize the importance of data, and there are new Finnish start-ups dealing with data analysis. I believe that Finland has an excellent opportunity to have not only a cluster of game development companies, but also big data companies and start-ups. So far it seems that the Valley, London and Australia are leading in this field.
Below are my notes from the conference – I added presentation links and videos that I have found, but otherwise those are quite unstructured. There were multiple tracks and it was very difficult to choose between them. Highlights of the conference are talks by Avinash Kaushik, Jeremy Howard, Matt Biddulph, Ben Goldacre, and Alasdair Allan and the Oxford-style debate on the proposition “In data science, domain expertise is more important than machine learning skill.” (see videos below).
== Day 1 ==
Michael Rys / Microsoft – SQL and NoSQL – two sides of the same coin
Web 2.0 – money are made by “motetizing the social”
- improving individual experience
- re-selling agreate data
MySpace 500 database servers (use stored procedures)
FB is running on 1800 MySQL databases (+Cassandra, HBase, etc)
Shard/partition data. Global transactions do not scale.
Read cache might help.
Social networking – user status update. Update user’s own db, then service dispatcher asynchronously updates servers for user’s friends.
Eventual consistency – changes are taking time to propagate.
Propagation needs to be able to deal with failures – retries.
- multiversion schema support
- flexible schema support
Most of the customers end up developing own solutions on top of the existing db servers.
- low cost
- HA, scale out, performance
- data first: flexible schema
MS Azure with federation: split/drop of shards without downtime (no merge supported yet)
Running aggregate queries on federated data using map-reduce approach in SQL:
select sum(scores)/sum(count) from (select scores, count from federated_table …)
- no secondary indices
- eventual consistency is not always and optimal choice
- multiple shards, each one has read-only replicas for OLAP processing
Claudia Perlich – chief scientists Media 6 degrees, From Knowing ‘What’ To Understanding ‘Why’
Does showing online ad to a person increases probability of that person going and buying the product.
Collecting full set of events – from the patterns on browsing the web to identify the target and
bid to show the ad to that target and then monitoring if that person will visit the site.
Clicks are worth nothing – conversion is everything. But raw conversion raws are very deceiving.
Does it makes sense to show an ad to the people who will buy the product anyway?
The idea is to track how advertising changes the rate of conversion.
E.g. user visited the web site within 5 days after seeing the ad.
Understand the causal process. A/B testing is the way to do it – but it costs a lot of money.
Non invasive causal estimation.
TMLE – targeted maximum likelihood estimate
Have null tests: showing unrelated ads should have no effect on conversion!
A lot of people don’t want to be told that what they are doing is subobtimal – that hinders the process.
Marketer, ad agency, creative designers – all follow their own best practices.
1. always pull the data yourself – it takes more time to figure out how the data was pulled than to pull it yourself
2. hadoop + hive, perl/python to clean up the data, R
Time-wise: 0,5-1 year to build the model that is correct and believable (iterative process)
Monica Rogati / senior data scientist @ LinkedIn – The Model and the Train Wreck: A Training Data How-to
Peter Norvig: “The unreasonable effectiveness of data talk” – more data beats clever algorithms.
But what beats more data is better data.
Cold start problem: what to do if we don’t know enough about users yet?
Random recommendations are not good – too easy to offend people.
Jacob Perkins / Weotta – Corpus Bootstrapping with NLTK
translating existing corpus with Yahoo Babelfish API: English -> Spanish
then using Naive Bayes to classify spanish movie reviews
bootstrapping phrase selection
Yelp study indicated that naive bayes classifier produces more accurate classification compared to mechanical turk
Ben Gimpert / Altos research – The Importance of Importance: An Introduction to Feature Selection
collect real estate listings
importance of selecting proper features for data mining/predictions
Gartner’s hype cycle 2011: www.gartner.com/it/page.jps?id=1763814 – data mining is on the rise
Data dimensionality doubles every year, Moore’s law predicts computational power to double only every 2 years.
CART model, OLS, Ridge – time goes down, accuracy goes up
- manual (apply domain expertise)
- automatic – look for efficiency of all subsets (but this is only a thought experiment)
- forward stepwise feature selection
- information gain feature selection (if taking out a feature increases entropy, it must be pretty important)
- Ridge regression regularization
- least angle regression (LARS)
- principles components analysis (PCA) – does badly agains non-linear data shapes (it’s also difficult to explain it)
- ensamble of decision trees (CART or random forests)
Matt Biddulph (@mattb) / Dopplr co-founder, Hackdiary.com – Social Network Analysis Isn’t Just For People
- triadic closure for people you might know implementation
- community detection in SNA 1/2m SUM(Aij – ki * kj / 2m) delta (ci, cj)
- belgian phonecall network – split between french and flemish speaking communities
- Nokia map app – all route calculation requests are logged, it can be used for identifying social links between cities
- gephi.org – photoshop of graphs
- betweenness centrality algorithm to size and colour network of topics tagged in del.ico.us
- links between most active pages on wikipedia as a social network
- map of affinity between journalists and topics – using Guaridian API
- map of the developers using GitHub API
- links between programming languages and music (last.fm API), JS, Node.js – alternative and rock, Ruby – singer-songwriter
Robert Lefkowitz / 1010data – Array Theory vs. Set Theory in Managing Data
- array structure are a superset of set structures, so array-based databases are more powerful than set based databases
- permutation vector (order in R) gives you a list of indices which can be used to retrieve original data elements in order
- example: in a database with “date, place, low, high” structure how do you get the “range” (high-low) for each date
- it’s possible to make all operations defined in the set theory (in relational databases) with permutation vectors
and also grading, proximity joins, fuzzy joins, transation isolation, sharding
- 1010data.com has an implementation of the database based on array theory, another one is SciDB
Robert Lancaster / Orbitz – Survival Analysis for Cache Time-to-Live Optimization
- survival analysis (heart-transplant pationes – time to death, leukemia in remission – time to relapse)
- survival function (Weibull, exponential, log-logistic)
- Kaplan-Meier estimates
- Orbitz applied this to cache of rates and availability pulled from the partners
- change deployed on Feb 2, 2012 – sharp decrease in traffic, impact on look-to-book is positive
Debate – who should be the first hire in a data science team – a domain expert or machine learning expert?
O’Reilly has provided a recording of this debate for free – click here to watch it.
Mike Driscoll (@medriscoll), moderator of this debate, summarized it nicely in his blog: “Thus who you decide to hire as your first data scientist — a domain expert or a machine learner — might be as simple as this: could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.”
== Day 2 ==
Edd Dumbill / O’Reilly Media, Alistair Croll / Bitcurrent
Data science is growing. This year latecomers like MS are joining in. Data visualization is growing in importance.
20 years ago computing became available to all businesses – we built an exo-skeletone for the business.
Now data is becoming nervous system of the businesses.
Doug Cutting / Cloudera – The Apache Hadoop Ecosystem
- we’re now at peta-scale kMGTPEZY
- big data is using commodity hw, but scales better
- big data: distributes, raw data (vs. schema), open source (vs. proprietory)
- Hadoop became de-facto industry standard, it is the kernel
- there are a lot of projects around the kernel (like nobody uses Linux kernel alone)
- most of the tools are open source at apache
Dave Campbell / Microsoft – Do We Have The Tools We Need To Navigate The New World Of Data?
- microsoft announced hadoop support last september
- signal -> data -> information -> knowledge -> insights & actions
- Haiti earthquake – how ML was used to build a statistical translation engine for Haitian Creole – English translations
- search & acquire -> explore & analyze -> explain & share
- new agenda for the tooling: data visualisation shall be as easy as creating a powerpoint
Abhishek Mehta / Tresata – Decoding the Great American ZIP myth
- levitt (creator of suburbia – same houses), ford – homogeniety
- we are data rich but information poor society
- tools for solving data problems are finally here – financial system and all o fthe industries that are in catarsis can be rebuilt
Mike Olson / Cloudera CEO – Guns, Drugs and Oil: Attacking Big Problems with Big Data
- Drugs: genome analysis using hadoop
- Guns: Santa Cruz – predictive policing
- Oil: subsurface structure mapping
- importance is in applying technology to solving big social problems
Flavio Villanustre / LexisNexis Risk Solutions and HPCC Systems – Machine Learning and Big Data: Sustainable Value or Hype?
- company with 15 years of big data experience
- released HPCC as open source last year
- ML is establishing itself as an important discipline, companies that don’t leverage the data they have will loose to competitors
Steve Schoettler / co-founder of Zynga, Junyo – Learning Analytics: What Could You Do With Five Orders of Magnitude More Data About Learning?
- Junyo – learning analytics
- US government takes two years to collect and process school scores nationwide (now 2009 is the last data point available)
- there’s no analysis – there’s no way to improve
- most important factor affecting student achievement is feedback
- it’s possible to double student achievement – need to tailor instructions to students’ needs
- technology needs to be used for that, because that’s a lot of work
Avinash Kaushik / Market Motive – A Big Data Imperative: Driving Big Action
- information is powerful. But it is how we use it that will define us.
- bring people in the organisation closer to the data, let them run reports
- known knowns | known unknowns | unknown unknowns – Donald Rumsfeld
- tops (top-10, bottom-10) do not give you anything – it’s gauss, the information is in the middle
- GA is data puking, highlighting differences between predictions and real data is important
- Occam Razor blog
Ben Goldacre / Bad Science – The Information Architecture of Medicine is Broken
- it’s possible to see if any small negative studies are missing
- Tamiflu story: no data is given by Roche
- Reboxetine: 76% of all the trials were witheld
- research fraud
- it’s possible to find worst offenders by matching data from various sources
Eddie Satterly / Expedia, Sanjay Mehta / Splunk – Turning Big Data Into Competitive Advantage
- Big Data vs. Barak Obama search popularity – data wins
- this in itself is a big data problem
- splunk: collects data in real time from various sources reliably, indexes it and makes available for search, analysis, visualisation and pattern recognition
- Expedia: 4,000 people in R&D – 8,000 total
- Expedia took Splunk into use in 3 months for all machines that are logging and everything else – call logs on voip, tickets, changes, scripts
- Data ~6Tb/day
- Expedia deploy cycle – twice a week
- Expedia was not prepared to give out all BI data to SaaS solution – one of the reasons for Splunk
Peter Skomoroch / LinkedIn – Street Fighting Data Science
- act as a street fighter: analyze, improvise, anticipate, adapt
- how to approach a new problem as a new grad: predict sales and recommend price for new shoes model
- go and see academic papers, what other people have done, create a model how to drop the price over time
- result: pricing model decreases sales by 30% – what has gone wrong
- ran complex black box model
- didn’t analyze the data first
- didn’t anticipate elastisicty error
- talk at Velocity 2011 by John Rauser from Amazon: “Look at your data”: http://www.youtube.com/watch?v=coNDCIMH8bk
- look at your data!
- looking at the top end of anything doesn’t really tell you much!
- book: street fighting mathematics
- red teaming: plan for worst case scenarios, to avoid situations like described in teh beginning of the talk
- blog: http://dataist.wordpress.com/
Sam Shah / LinkedIn – Collaborative Filtering using MapReduce
Collaborative filtering techinques:
- latente factor models
- neighbourhood based
- user experience
Implementation: co-occurence graph
1. Map-reduce implementation: time-viewer-viewee
2. Index: alice, bob, 1 (co-occurence)
generating pairs might be expensive – limit by time
first version does at 60% of optimal solution
- check your outputs (Obama problem)
- check your inputs (repeated profile views probably should be discounted)
- drill-down: if people you might know shows a lot of “john smiths” (because of failed searches) then CTR will be very low
- adding avatars to people you might know: 50% to CTR
putting it to production:
- pairs->scoring->top-n->push production
- azkaban – for managing pig jobs
- oozie -
- voldemort – working with batch-computed data (Serving large scale batch computed data with project voldemort. In Fast 2012) https://www.usenix.org/conference/fast12/serving-large-scale-batch-computed-data-project-voldemort
Philip Kromer / Infochimps – Disambiguation: Embrace wrong answers & find truth
- human entered geolocations
- instead of correcting those, they can be used to identify strong links
Xavier Amatriain / Netflix – Netflix recommendations: beyond the 5 stars
(update 2012-04-07: now also available as blog post http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html)
- Netflix prize: 10% improvement in predicted rating
- Netflix went from dvd by mail company to a global streaming company that operates not only on the web, but on many other devices
- make user aware the results provided are personalized
- diversity in the top-10 recommendations (for you, for wife, for kids)
- social support – not only recommendation based on personal preferences, but how many friends have liked it
- note: in the US there’s a legal problem with showing friends’ likes/favourites
- ranking = scoring + sorting + filtering
- models used by netflix: lo/lin regression, elastic nets, scd, boltzman machines, gradient boosted decision trees, …
- recommending new movies: even a few more ratings beat more data
- Norvig’s statement about more data beats better alorithms is valid for language processing, but Netflix indicates that more training examples are not helping if there are just a few relevant features
== Day 3 ==
Jonathan Gosier / metaLayer Inc. – Democratization of Data Platforms
- art used to live in the domain of specialists – paintings were comisisoned by institutions
- not until recently math and science were also in the domain of specialist
- data science now is in a similar position – it is for experts only
- there need to be tools available for general public
- examples: subprime mortgage market – how to understand who owns which loan, automatic trading systems – who understands how that works
- ushahidi – open source distress signal collection
- there should be an equivalent of a calculator for unstructured data
- “…take the best and spread it around to everybody…” – Steve Jobs
- metaLayer offer drag and drop insights
Luke Lonergan / Greenplum, a division of EMC – 5 Big Questions about Big Data
Coco Krumme / MIT Media Lab – The Trouble with Taste
- data analysis of the wine-tasting evening
Pete Warden / Jetpac – Embrace the Chaos
- building a business on somebody else’s data is very dangerous
- Google had to deal with a chaos of billions of web pages
- take support emails or some other seeminly chaotic data and try to extract value from it
Usman Haque / Pachube.com – Open Data and the Internet of Things
- general infrastructure for collecting data from various sensors
- Internet of things bill of rights
- people own the data they create
Gary Lang / MarkLogic – Big Data’s Next Step: Applications
Hal Varian / Google – Using Google Data for Short-term Economic Forecasting
- Google time-series for searches: vodka searches peak every Saturday and Dec 31st is a spike, hangover searches peak every Sunday and Jan 1st is a spike
- initial claims vs unemployement level using governmant data
- peaks in the end of every recession
- challenges: simple corellation don’t work (like with unemployment)
- human judgement doesn’t scale
- baseball and auto sales both peak in summer, but just because of seasonality, not causality
- Google Research blog
- fat regression
- incremental predictability
- during resession coupon-related queries are growing
- prediction for UM consumer sentiment
Theo Schlossnagle / OmniTI – Is this normal? Finding anomalies in real-time data
- author of “Scalable Internet Architectures”
- hard real-time systems: outputs are considered incorrect if the latency of their delivery is above a specified amount
- soft real-time systems: similar, but less useful instead of incorrect
- Big data systems:
- traditional: oracel, postgres, mysql
- the shiny: hadoop, hive, hbase, pig, cassandra
- the real time: sql stream, s4, flumebase, truviso, esper, storm
- 300k datum/sec
- real-time queries done with Esper
- Holt-Winters – tri-variate seasunal regression
- look at historic data
- use that to predict the immediate future with some quantifiable confidence
- implemented the Snowth for storage of data
- implemented a C/lua distributed system to analyze 4 weeks of data
- to ensure the system real-time – need to ensure that queries return in less than 2ms
- how to transform batch processing, offline analytics to online real-time analytics without crazy brainpower
Jeremy Howard / Kaggle – From Predictive Modelling to Optimization: The Next Frontier
- predicting something doesn’t mean that you’ve made an impact on human lifes
- 1. figure out the objective
- 2. figure out the levers to pull – what can we change
- 3. figure out what data we have
- 4. write a PhD on PageRank
- working in insurance – objective is to make shit loads of money made on each individual client
- to get necessary data – insurance companies had to start charging random prices
- to convince them to do that, a simulation was run showing what happens if that is done vs. not done
- how to apply this to marketing?
- objective: maximise lifetime value of a customer
- levers: recommendations, offers, discounts, care calls, etc.
- what data can we collect?
- Amazon recommends pratchet books to those who buys a lot of them. that’s not helpful
- to improve it – start offering random stuff to collect the data that can later be mined
Alyona Medelyan / Pingar, Anna Divoli / Pingar – Mining Unstructured Data: Practical Applications
- IDC: 17h/wk searching information, 14h/wk writing emails
- 17h/wk = $37,000/year
- pingar provides metadata extraction tools
Alasdair Allan / University of Exeter – Migratory data: the distributed data you carry with you
- calendar, phonebook, camera phone pictures, bookmarks, text messages – all that is carried with you
- photos are geotagged – and there’s a lot of other metadata collected
- iOS spotlight index contains a cache of your smses and contacts even deleted ones
- web cache, sms draft cache – are all metadata
- facetime calls cache on iOS
- google search cache is not wiped when you clear you browser cache on iOS
- gps cache on iOS is never deleted for a year in iOS4, in iOS5 caching time is decreased
- history of all maps searches on iOS are cached
- keyboard cache is stored as well – text messages can be reconstructed from it
- cache of last visible screen when swithing between applications
- mitmproxy – an SSL-capable man-in-the-middle proxy
- siri proxy – proxy to listen in to siri traffic
- angry birds and cut the rope are offering “look at my local community” – that uploads your phonbook to the internet
- data exhaust: broadcasting your location, rfid cards like oyster, tesco purchases tracking
- who owns that data? consumer or companies?
- bit.ly/FindingYourFriends – http://www.cs.rochester.edu/~sadilek/publications/Sadilek-Kautz-Bigham_Finding-Your-Friends-and-Following-Them-to-Where-You-Are_WSDM-12.pdf
- after you stop sharing your location, it’s possible to pinpoint you to a 100m with a 80% accuracy by locations of your friends
- how to prevent CC fraud – compare postcode of the latest credit card transaction and iPhone location
- mapping receipts coming to email with online transaction
Robbie Allen / Automated Insights, Inc. – From Big Data to Big Insights
- big data by itself is not helpful
- visualizations is one possible answer, but one problem here is that they still require interpretation
- automatic writing by software – it can process more data and can be continuosly improved
- examples: automatic stock market reports, insider trading reports
- parked domains: generating relevant content for those pages from a list of keywords
Cheryl Phillips / The Seattle Times – Exploring the Stories Behind the Data
- philip meyer
- seattle times visualisations of data brought to the users – illegal forest logging, casualties of war
- The Nutgraph: important and visual information that attracts attention and makes people explore data further
- never put more than one number into a paragraph of text
- don’t overwhelm the reader with detail that they don’t need