Here is some reading for BigData and telemetry. I highlight a couple of talks which
are very good.
BigData and Telemetry
Eugene Dvorkin provides an introduction to Storm framework, explains how to build
real-time applications on top of Storm with Groovy, how to process data from
Twitter in real-time and the architectural decision behind WebMD MedPulse
Aaron Gardner pulls back the covers on the Etsy Search ecosystem and how they got
here — the good, the bad, and the funky.
Garrett Wampole describes an experimental methodology of applying Enterprise
Integration Patterns to the near real-time processing of surveillance radar
data, developed by MITRE.
Mike Keane presents how Conversant migrated to Flume, managing 1000 agents across 4
data centers, processing over 50B log lines per day with peak hourly averages
of over 1.5 million log lines/sec
Randy Shoup describes KIXEYE’s analytics infrastructure from Kafka queues through
Hadoop 2 to Hive and Redshift, built for flexibility, experimentation,
iteration, componentization, testability, reliability and replay-ability.
Camille Fournier explains what projects ZooKeeper is useful for, the common challenges
running it as a service and advice to consider when architecting a system using
Neha Narkhede of Kafka fame shares the experience of building LinkedIn’s powerful
and efficient data pipeline infrastructure around Apache Kafka and Samza to
process billions of events every day.
Gabriel Gonzalez introduces TSAR (TimeSeries AggregatoR), a service for real-time event
aggregation designed to deal with tens of billions of events per day at
The authors discuss Netflix’s new stream processing system that supports a reactive
programming model, allows auto scaling, and is capable of processing millions
of messages per second
Seth Juarez shares insight on how to create applications that use dashboards to
drive value, convert raw data into answers, and simplify business processes.
Steve Hoffman, Ken Dallmeyer share their experience integrating Hadoop into the
existing environment at Orbitz, creating a reusable data pipeline, ingesting,
transporting, consuming and storing data
Brian Degenhardt discusses lessons that Twitter learned managing a high rate of
change and complexity, and how those can be applied anywhere. Brian Degenhardt
has founded a successful startup, worked on console game development, written
code that ran in the space shuttle, and scaled serving infrastructures on
Pentium 2s in the late 90’s. He currently works on Core Systems Libraries at
Twitter, lured by their scalability challenges and use of functional
Nathan Marz introduces Twitter Storm, outlining its architecture and use cases, and
takes a look at future features to be made available.
Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly
tagged movies and the UK’s most unexpected hotspot for weapon possession.
Hadoop and Spark
Roman Shaposhnik expands on the previous year’s “Hadoop: Just the Basics for Big
Data Rookies”, diving deeper into the details of key Apache Hadoop
projects. He starts with a brief recap of HDFS and MapReduce, then discusses
more advanced features of HDFS, in addition to how YARN has enabled businesses
to massively scale their systems beyond what was previously possible.
Marcel Kornacker presents a case study of an EDW built on Impala running on 45 nodes,
reducing processing time from hours to seconds and consolidating multiple data
sets into one single view.
Matei Zaharia talks about the latest developments in Spark and shows examples of how
it can combine processing algorithms to build rich data pipelines in just a few
lines of code. Matei Zaharia is an assistant professor of computer science at
MIT, and CTO of Databricks, the company commercializing Apache Spark.
Dean Wampler argues that Spark/Scala is a better data processing engine than
MapReduce/Java because tools inspired by mathematics, such as FP, are ideal
tools for working with data.
Jeremy Stieglitz discusses best practices for a data-centric security , compliance and
data governance approach, with a particular focus on two customer use cases.
Bob Kelly presents case studies on how Platfora uses Hadoop to do analytics for
several of their customers.
In this solutions track talk, sponsored by Cloudera, Eva Andreasson discusses how
search and Hadoop can help with some of the industry’s biggest challenges. She
introduces the data hub concept.