Reading for BigData and Telemetry

Article
02/06/2015

Here is some reading for BigData and telemetry. I highlight a couple of talks which
are very good.

BigData and Telemetry

Scalable Big Data Stream Processing with Storm and Groovy

Eugene Dvorkin provides an introduction to Storm framework, explains how to build
real-time applications on top of Storm with Groovy, how to process data from
Twitter in real-time and the architectural decision behind WebMD MedPulse
mobile application.

Etsy Search: How We Index and Query 26 Million One-of-a-kind Items

Aaron Gardner pulls back the covers on the Etsy Search ecosystem and how they got
here -- the good, the bad, and the funky.

Applications of Enterprise Integration Patterns to Near Real-Time Radar Data Processing

Garrett Wampole describes an experimental methodology of applying Enterprise
Integration Patterns to the near real-time processing of surveillance radar
data, developed by MITRE.

1.5 Million Log Lines Per Second: Building and Maintaining Flume Flows at Conversant

Mike Keane presents how Conversant migrated to Flume, managing 1000 agents across 4
data centers, processing over 50B log lines per day with peak hourly averages
of over 1.5 million log lines/sec

The Game of Big Data: Scalable, Reliable Analytics Infrastructure at KIXEYE

Randy Shoup describes KIXEYE's analytics infrastructure from Kafka queues through
Hadoop 2 to Hive and Redshift, built for flexibility, experimentation,
iteration, componentization, testability, reliability and replay-ability.

ZooKeeper for the Skeptical Architect

Camille Fournier explains what projects ZooKeeper is useful for, the common challenges
running it as a service and advice to consider when architecting a system using
it.

Samza in LinkedIn: How LinkedIn Processes Billions of Events Everyday in Real-time

Neha Narkhede of Kafka fame shares the experience of building LinkedIn's powerful
and efficient data pipeline infrastructure around Apache Kafka and Samza to
process billions of events every day.

TSAR: How to Count Tens of Billions of Daily Events in Real Time Using Open Source Technologies

Gabriel Gonzalez introduces TSAR (TimeSeries AggregatoR), a service for real-time event
aggregation designed to deal with tens of billions of events per day at
Twitter.

Mantis: Netflix's Event Stream Processing System

The authors discuss Netflix's new stream processing system that supports a reactive
programming model, allows auto scaling, and is capable of processing millions
of messages per second

Dashboarding:The Developers’ Role in Data Analysis

Seth Juarez shares insight on how to create applications that use dashboards to
drive value, convert raw data into answers, and simplify business processes.

Building a Data Pipeline with the Tools You Have - An Orbitz Case Study

Steve Hoffman, Ken Dallmeyer share their experience integrating Hadoop into the
existing environment at Orbitz, creating a reusable data pipeline, ingesting,
transporting, consuming and storing data

Real-Time Systems at Twitter

Brian Degenhardt discusses lessons that Twitter learned managing a high rate of
change and complexity, and how those can be applied anywhere. Brian Degenhardt
has founded a successful startup, worked on console game development, written
code that ran in the space shuttle, and scaled serving infrastructures on
Pentium 2s in the late 90's. He currently works on Core Systems Libraries at
Twitter, lured by their scalability challenges and use of functional
programming.

Storm: Distributed and Fault-Tolerant Real-time Computation

Nathan Marz introduces Twitter Storm, outlining its architecture and use cases, and
takes a look at future features to be made available.

Revealing the Uncommonly Common with Elasticsearch

Mark Harwood shows how anomaly detection algorithms can spot card fraud, incorrectly
tagged movies and the UK's most unexpected hotspot for weapon possession.

Hadoop and Spark

Hadoop 201 -- Deeper into the Elephant

Roman Shaposhnik expands on the previous year's "Hadoop: Just the Basics for Big
Data Rookies", diving deeper into the details of key Apache Hadoop
projects. He starts with a brief recap of HDFS and MapReduce, then discusses
more advanced features of HDFS, in addition to how YARN has enabled businesses
to massively scale their systems beyond what was previously possible.

The Next Wave of SQL-on-Hadoop: The Hadoop Data Warehouse

Marcel Kornacker presents a case study of an EDW built on Impala running on 45 nodes,
reducing processing time from hours to seconds and consolidating multiple data
sets into one single view.

Unified Big Data Processing with Apache Spark

Matei Zaharia talks about the latest developments in Spark and shows examples of how
it can combine processing algorithms to build rich data pipelines in just a few
lines of code. Matei Zaharia is an assistant professor of computer science at
MIT, and CTO of Databricks, the company commercializing Apache Spark.

Why Spark Is the Next Top (Compute) Model

Dean Wampler argues that Spark/Scala is a better data processing engine than
MapReduce/Java because tools inspired by mathematics, such as FP, are ideal
tools for working with data.

The Big Data Imperative: Discovering & Protecting Sensitive Data in Hadoop

Jeremy Stieglitz discusses best practices for a data-centric security , compliance and
data governance approach, with a particular focus on two customer use cases.

Customer Analytics on Hadoop

Bob Kelly presents case studies on how Platfora uses Hadoop to do analytics for
several of their customers.

Finding the Needle in a Big Data Haystack

In this solutions track talk, sponsored by Cloudera, Eva Andreasson discusses how
search and Hadoop can help with some of the industry's biggest challenges. She
introduces the data hub concept.

Reading for BigData and Telemetry

Additional resources