Day three of Qcon Conference

Day three of Qcon Conference were focus on NoSQL.  The afternoon's session are less interesting, and I did not find good insight from them. Below are details of the tracks

 

Modern Big Data Systems

Creating Culture

Everything as a Service (XaaS) - Web APIs

Beyond JavaScript

Real World Functional Programming

Solutions Track

The Next Wave of SQL-on-Hadoop: The Hadoop Data Warehouse

 

Marcel Kornacker

Mentoring Humans and Engineers

 

Daniel Doubrovkine

Evolving REST for an IoT World

 

Todd Montgomery

TypeScript – Scaling up JavaScript

 

Jonathan Turner

End to End Reactive Programming at Netflix

 

Matthew Podwysocki & Jafar Husain

Enterprise Mobile App High-wire Act

 

Isaac Blaise

The Game of Big Data: Scalable, Reliable Analytics Infrastructure at KIXEYE

 

Randy Shoup

Climbing Off The Ladder, Before We Fall Off

 

Chris Angove

Canonical Models for API Interoperability

 

Ted Epstein

UI: The Functional Final Frontier

 

David Nolen

Real World Functional Programming

Open Space

 

Priming Java for Speed at Market Open

 

Gil Tene

Let Me Graph That For You: Building a Graph Database Application

 

Ian Robinson

GitHub Communications Culture and Tools

 

Matthew Mc Cullough

10 Reasons Why Developers Hate Your API

 

John Musser

Beyond JavaScript

Open Space

 

Caml Trading - Experiences with OCaml on Wall Street

 

Yaron Minsky

 

 

Mobile Dev Ops for Mobile App Excellence

 

Alex Gaber

How WebMD maintains operational flexibility with NoSQL

 

Rajeev Borborah & Matthew Wilson

Learnings from Building and Scaling Gilt

 

Michael Bryzek

Web APIs

Open Space

 

ECMAScript 6: what’s next for JavaScript?

 

Dr. Axel Rauschmayer

Property-Based Testing for Better Code

 

Jessica Kerr

Humanoid robots need applications !

 

Bruno Maisonnier

Analyzing Big Data On The Fly

 

 

Shawn Gandhi

Creating Culture

Open Space

 

Clients Matter, Services Don't

 

Mike Amundsen

SpiderMonkey Parser API: A Standard For Structured JS Representations

 

Michael Ficarra

The Functional Programming Concepts in Facebook's Mobile Apps

 

Adam Ernst

Performance Testing Crash Course

 

Dustin Whittle

Big Data

Open Space

 

 

The Pivotal Way

 

Josh Knowles

All Your API Are Belong to Us

 

Paul Hill

CoffeeScript: The Good Parts

 

Azat Mardan

A day in the life of a functional data scientist

 

Richard Minerich

 

 

Booths

Alderbaran

Formed in 2005,  Alderbaran is the result of a dream: build humanoid robots, a new humane species for the benefit of humankind.  I saw the demo of leading robot in industry "NAO".  The robust have been sold more than 5000 with a price of $8000 and $6000 for developer.  NAO has:

  • 25 degrees of freedom, for movement
  • Two cameras, to see its surroundings
  • An inertial measurement unit, which lets him know whether he is upright or sitting down
  • Touch sensors to detect your touch
  • Four directional microphones, so he can hear you

This combination of technologies (and many others) gives NAO the ability to detect its surroundings. The presenter said that they want NAO as not only a robot, but a member of your family and they want him can be more social and have intelligence to understand people more. The company do both software and hardware on their own.

 

Crittercism

Crittercism is the world’s first mobile application performance management (APM) solution, offering both error monitoring and service monitoring solutions. Crittercism products monitor every aspect of mobile app performance, allowing Developers and IT Operations to deliver high performing, highly reliable, highly available mobile apps. Crittercism offers a real-time global view of app diagnostics and app errors across iOS, Android, Windows Phone 8, Hybrid and HTML5 apps monitoring 1 billion monthly active users

 

New Relic

New Relic is a software analytics company that makes sense of billions of metrics about millions of applications in real time. New Relic’s

comprehensive SaaS-based solution provides one powerful interface for web and native mobile applications and consolidates the performance

monitoring data for any chosen technology in an environment. Thousands of active accounts use New Relic’s cloud solution every day

to optimize more than 200 billion metrics across millions of applications. New Relic is pioneering a new category called Software Analytics.

 

Keynote

 

NoSQL like there no Tomorrow from Amazon

   Khawaja Shams Head of Engineering for NoSQL, Amazon & Swami Sivasubramanian  General Manager for NoSQL, Amazon share experience  and suggestion about how do you build your own DynamoDB Scale services, and actually they focused on How we run No-SQL database in Amazon.

  In early days,  Amazon had 1000s apps managed its states in RDBMSs, RDMSs are actual kind of cool since it provide a simple T-SQL Language.  However managing so many RDBMSs are very painful.  In the past Q4 (holiday season) was hard-work at Amazon, we have to prepare be ready for Q4 for several months, and sometime we need to repartition databases.

  Then, the cool Dynamo project was born (note,  Amazon's Dynamo paper which use many cool techniques to build an eventual consistent key value store  had big impact on many of NoSQL database systems.) Dynamo is a  key value store which aims to have higher availability (5 9s with eventual consistency) and incremental scalability with  no more repartitioning 2) no need to architect apps for peak 3) just add boxes to scale 4)

simpler querying model --> predictable performance.

   But dynamo was not perfect.  Scaling was easier, running is hard (install, benchmarking and management it) Have a big learning curve and it did not succeed.  the main reason is that it is a software, but not a service.

 

In 2012, Amazon announcement  DynamoDB, a totally new cloud only  a NoSQL database service which as

  • Seamless scalability
  • Each administration
  • Fast & predictable performance

Which is quite successfully.  The way Amazon to plan to create a service is that you need to 1) pretend if you are launching a service,  2) set the goal and customer 3) reviewed by architect 4) make sure it make sense for customer.

 

Sacred Tenets in Servers

  • Don't compromise durability for performance
  • Plan for success -plan for scalability
  • Plan for failure -> fault -tolerance is key
  • Consistent performance is important
  • Design -> think of blast radius
  • Insist on correctness

 

About Testing:

Testing is a lifelong journey, but testing is not enough. Four way of ensure quality

  • Onebox testing
  • Gamma:

Simulate real workload

  • Phased deployment
  • Monitor:

 does it still work?

Measuring customer experience is the key, don't only measure average, also 90%, 99% percentile.

  

The Game of Big Data analytic by Randy Shou

This is a must watch talk if you are building a data pipeline for your team.  If you are in Xbox or Gaming area, this is also interesting for you.  Randy was CTO at KIXEYE, former Dir. Google App Engine and eBay Chief Engineer.  He discussed in details how he design a  near real-time data pipeline in month with some well-known techniques and provide high extensibility and scalability and reliability.

 

Part 1. Goals for data analytic project

KIXYE is a  free-to-play real-time strategy game company, it has >500 employees. The main platforms are both Web and mobile. There are three high level goal for the data analytic project to help company grow:

  • Acquiring more user, at the same time to make sure user's estimated lifetime value is more that the cost to acquire that user. This will be achieved by 1) Publisher campaign 2) On-platform recommendations
  • Game analytics: Measure and optimized "Fun" by define and measure many metrics, such as gameplays, features, etc, but all metrics are just proxy for fun.  This was achieved by Game balance, Match balance (two players are equally match), Economy management, Player typology, etc.
  • Retention and Monetization: Sustainable Business, Monetization drivers and Revenue recognition. This will be achieved by Pricing and bundling; Tournament ("Event") Design; Recommendations

 

Part 2 "Deep Thought" , the new system

As you can see above, building such a single telemetry system to achieve many goal is pretty challenge.   In this section, Randy discuss in design principles for their new system, called "Deep Thought".   Before describing the new system, he mentioned about their v1 analytic system (old system):

Grow organically, build for the goal of user acquisition, progressively go down to much more requirement

Idiosyncratic mix of language, systems, tools

  • Log files -> Chukwa (from Yahoo) -> Hadoop ->Hive->MySQL
  • Php for reports and ETL
  • Single massive table with everything

And there are any issues

  • Slow queries
  • No data standardization
  • Very difficult to add new games, report, ETL
  • Difficult to backfill the system on error or outage
  • Difficult for analysis to use, impossible for PMs designers, etc.

 

Then, he mentioned the high level Goals of "Deep Thought:

  • Independent scalability: logically separate services components, independently scale tiers,
  • Stability and Outage recovery (NOTE: I mean it is very important that you can replay to catch up the missing datas)
    • Tiers can completely failed with no data loses
    • Every step idempotent and replayable
    • Standardization:  standardized event types, fields, queries, reports. (NOTE, if you know Xbox team's telemetry system, you will know how important, standardization is important)
    • In-stream event processing:  Sessionlization, Dimensionalization, Chortling (see below for details)
    • Querability
      • Structure are simple to reason about
      • Simple things are simple
      • Analysis, data sciences, PM, Game designers,  etc
      • Extensibility
        • East to add new games, events, fields, reports

 

Core capabilities

  • Sessionlization: all events are part of a "session"

Explicit start event, optional stop event

Game-defined semantics

  • Event Batching:

Event arrived in batch, associated with session

Pipeline computed batch level metric

Disaggregates events

Can optional attach batch-level metrics to each event

  • Time-series aggregations
    • Configurable metrics
    • Accumulated in-stream
    • Faster to calculate in-stream comparing with Map-Reduce
      • 1 day X, 7 day X, lifetime X
      • Total attacks, total time playyed
      • V1 aggregate + batch delta

 

  • Dimensionalization

Pipeline assign a unique numeric id to string enum

  • E.gs. "Twigs" resource => ID 1234

System handle automatically (no need to change the way how to log)

  • Games log strings
  • Pipeline generates and maps id
  • No configuration necessary

Fast dimensional queries

  • Join on integer instead of string
  • Metadata emulation and manipulate

Easily to enumerate all values for a field

Merge multiple values (easy to correct data quality issues)

TWIGS == twigs -- Twigs

Metadata Tagging:

Can assign arbitrary tags to metadata

E.g. "Panzer 05" is {tank, infantry, event prized}

Enables customer views

  • Cohorting:

Group players along any dimension/metrics

Well beyond classis age-based cohorted

Core analytical building blocks

Experiment groups

User acquisition campaign tracking

Prospective modeling

Retrospective analytic

Set-base Cohorting:

Overlapping groups: >100, >300, etc

Exclusive groups: (100-200)

Time-based

  • People who played in last days
  • E.g :whale" --> ($$ >X) in last N days
  • Auto expire from a group without explicit intervention

 

Part 3 Implementation

Randy discussed the techniques used for the implementation.  I did not record all technique details.  He made a couple of points after this session:

  • They invite consultions recommended by Horton Works to the company to guide the implementation and also quickly trained employees to be familiar with new techniques
  • He mentioned that he will evaluate Amazon S3 storage and Spark (very popular Hadoop Real-time analytic engine ).  I have a sense that because he made the whole system loose couple and very to be extended so that it will be much easy to evaluate new techniques
  • Adding new techniques are pain, but the whole company are begging him to create such new pipeline and the team even afford to delay commitment to customer for a couple of month to make new pipeline online

Ingestion: Logging Service

HTTP/JSON Endpoint

Play framework

Non-blocking, event driven

Reasonability

Message integrality via checksum

Durability via local disk persistence

Async batch writes to Kafka topics

  • Separated different type of events: {valid, invalid, unauth}  to different queue

 

Event Log: Kafka

Persistent, replayable pipe of event (Kafka was mentioned many times in QCON,  I will provide some context about other teams inside Microsoft who is using Kafka in production)

  • Events stored for 7 days

Responsibility:

  • Durability via replication and local disk stream
  • Replayability via commit logs
  • Scalability via partitions brokers
  • Segment data for different type of processing

 

Transformation

Exporter:

Consume Kafka topic, rebroadcast

E.g consume batches, rebroadcast events

Responsibility:

  • Bach validation against JSON schema
    • Syntactic validation
    • Semantic validation
    • Batches -> Events

Importer:

Session

Dimenatioanziation

Player metadata

Cohorting

 

Session Store

Key-value stores (Counchbase)

Fast constant-time access to session, player

Responsibility

Store session, players, dimensions, config

Lookup

Idempotent update

Store accumulated session level metrics

Store player history

Use replay to get metric

 

Data Storage

  • Using Camus Map Reduce to generate Append_eventstable from Kafka to HDFS every 3 minute
    • Append-only log of events
    • Each events has session-version for deduplication (remember the event's history)
    • Append_events -> base_events using  Map Reduce
      • Logical update of base_event (since Hadoop has no physical update)
      • Replayable from beginning with no data duplication
      • Base_events:
        • De-normalized table of all events
        • Stores original JSON + decoration
        • Customer Serdes to query/extract JSON filed without materializing entire rows
        • Standardize  event types => lots of functionality for free
      • Update evens with new metadata
      • By Swap old partition for new partitions

 

Analytic and Visualization

Hive Warehouse

Normalized event-specific, game-specific stores

Aggregate metric data for report, analytic

Maintained thought customer RTC

Amazon Redshift:

Fast- ad-hoc queuing

Tableau:

Simple, powerful report

 

Hadoop for analytic workloads

Marcel Kornacker from Cloudera talks about recently changed in their Hadoop platform and provide benchmark and user studies about how people use Hadoop for analytic workloads, i.e, mostly interactive, adhoc queries with high response time requirement.   I think it is mostly interesting for DBMS guys and Hadoop/HDInsight teams.

 

HDFS as a storage for Analytic Warehouse:

The core Hadoop HDFS storage system have a couple of performance improvement :

  • High-performance data scan and for disk/memory
  • Co-partitioned tables for distributed join

Affinity groups: collocated blocks from different files in the same machine

 ->   create co-partition

  • Temp-FS write temp data in memory, bypassing disk

 -> idea for iterative interactive data analysis

  • Shot-circuit read: by pass data node protocol when reading from disk read at 100Mb per second

HDFS caches access explicitly cached data w/o copy or checksum

  • access memory -resident data at memory bus speed (35G per second)
  • enable in-memory processing

 

Parquet Columnar Storage for Hadoop

This is the new storage format. People who familiar with ColumStore might knew that by storing data column by column can reduce data significantly.  It is a Open source Apache project and it is available for most of the engines, such as Impala, Hive, Pig, MR, Cascading. It was co-created by Twitter and Cloudera (it is amazing that companies are working together), hosted on Github.  With contributor form Criteo, stripe, Berkley, Linked-in, and it is used in production at Twitter and Criteo.

 

Optimized storage of nested data structures, pattern after Dremels' columnIO format (good for twitter's user case)

Extensive  set of column encodings:

  • RLE, dictionary encoding 1.2
  • Delta and optimized string encodings in 2.0
  • Embedded statistics: version 2.0 stores in-lined column statistics for further optimization of scan efficiency

In a demo with  TPC-H 1T data,  the uncompressed file was 750Gb , with  Parquet and snappy, the data was only 250G.  Note, they did not use any Columnar storage query processing techniques (we used a lot in SQL Server).

 

Impala: A modern, Open-Source SQL Engine
 

MPP SQL query engine for the Hadoop environment (something, our Cosmo team might be interesting, I think we are ahead of them for years) . Version 1.3 available for CDH4 and CDH5. 

 

It is design for performance: C++, and maintains Hadoop flexibility by standard Hadoop components (HDFS, Hbase, Metastore, Yarn). It play well with traditional BI tools exposes, ANSI SQL 95.Run standard SQL.  Execution engine designed for efficiency, circumvents Map Reduce completely. It is Cloudera and open sourced.

 

In-memory execution

  • Aggregation results and right-handle side input of join are cached

 

Run-time code generate

Use llvm to jit-complie the runtime intensive part  of query

Effect the same as cusotm0cide query

Remove branch

Inline functional call

 

Impala from The user's perspective

  • Create tables as virtual views over data in HDFS or Hbase
  • Schema metadata is stored in Metastore (shared with hive, pig, etc)
  • Connection vis odbc/jdbc

 

Road map

Windows function

Nested data structures (many company need this)

 

User cases

There are a couple of user cases.  And during the talk and offline, people asked comparing with Vertica and Spark.

 

  • Fraud detection from finance industry,  switch from DB2 to Cloudera with 300TB, 52 node cluster, daily 2TB per day, overall $30M saving, storage cost went from $7K/TB to $0.5K/TB

 

  • Media match: leading demand-side platform,

Data aggregation in 30 minute interval for reports

Evaluated Hive/Pig: 6 hour latency, not satisfied

Decide to use CDB/Impala replaced Netzza

 

  • Allstate insurance:

Interactive data exploration with Tableau

Test case: 1.1B rows , 800 columns, 2.3T as CSV

35-node

Simple Group by query: 120 seconds

 

Scale Gilt with Micro Services

Lessons and challenges that have/had with micro-service architecture.  Gilt was founded in 2007. Top 50 Internet-retailer in flash sales business. It has  ~150 Engineers. In the early days, the team was using Ruby on Rails with Postgres and Memcache. Then as business grows, they had a couple of technique pain:

  • Spike required to 1,000s of ruby processes
  • Postgres was overloaded
  • Routing traffic between ruby processes sucked.

Then, three things happened to resolve 90% of the scale issues

  • Transition to the JVM
  • M(a/i)cro-service era started
  • Dedicated data stores for each service
  • Separate data into different stores with different storage tools, such as H2 and Voldermont

But the team still had pain points in the development cycles:

  • 1000 Models/controllers, 200K LOC, 100s of jobs
  • Lot of contributors + no ownership
  • Difficulty deployment with long integration cycles
  • Hard to identify root causes.

 

So the team adopt micro-services with following goals

  • Empower team wand ownership
  • Smaller scope
  • Simpler and easier deployments and rollback.

Right now, they had ~400 services in production and begin to transition to Scala and Play.  The presenter shows an demo where  each interaction in their website is a call for micro-service.  The micro-service is defined as a functionality scope tie to small number of developers.

The new challenge came out from the micro-services:

  • Development and testing are hard.
  • Lack of dev/integration environment

The hardware is not strong enough

  1. No one want to compile 20 services
  • The ownership is still lack
  • Monitoring many services are hard.
  • Testing is hard:
  1. Hard to execute functional tests between services
  2. Frustrating to deploy semi-manually (Capistrano)
  3.  Scary to deploy other team services

 

The presenter described a couple of solutions to the above problem which I did not get in

Details due to lack of technique knowledge. Following are  a  couple of key points which I think it is important

 

  • The team adopted Go Reactive  (docker): an extension to Linux Containers (LXC)
  1. Decentralization
  2. Simple configurations
  3. Much lighter than a VM

 

  • Code Owner Ship: Code will stay more than the people in the company
  • You should build tools in 2014