QCON New York 2014 Conference Day 1 Highlight

Article
06/19/2014

Today, I attended the first day of QCON New York 2014 Conference. Here is a brief introduction of the conference:

Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

QCon starts with 2 days of tutorials on Monday and Tuesday, June 9-10 followed by the full 3-day conference from Wednesday, June 11-13. The conference will feature over 100 speakers in 6 concurrent tracks daily covering the most timely and innovative topics driving the evolution of enterprise software development today. The setting is the beautiful, centrally-located Marriott at Brooklyn Bridge in New York City.

The main reason I attend this conference is to know the industry trend in the software and service development, learning more people use bigdata and machine learning techniques in there services and find existing Microsoft customer and understand their user cases.

Here is the highlight of day 1 of the conference. In Day one, we have 6 sections running concurrently, they are

Hot Technologies behind Modern Finance

Continuous Delivery

The Hyperinteractive Client

Lean Product Design

Architectures you've always wondered about

Solutions Track

I am mostly involved in Continuous Delivery and Architectures you've always wondered about sections. I will share the slides and talks (if possible) in later days

Highlight

Hot techniques:

Linked-in Kafka
Apache Zookeeper

New techniques:

Micro-services
Linux Container

Booths

Application monitoring and NoSQL/other database techniques are dominate booths, nearly all booths are related to one of the techniques, which is also a

good indicator of what is hot in industry

Application Monitoring:

AppDynamic

I knew AppDynamic from the popular slides "Call of Duty: Dev Ops". AppDynamic is very popular company in term of monitoring customer's application. Their application performance management (APM) allow people to monitor hybrid environemnts with Java, .Net and PHP (and a wide list of environments). The people showed

me a demo about how they monitor oracle database and do performance troubleshooting on wait stats, poor query plans, etc. Which remind me that our SQL Azure's performance monitoring is not as great as we should.

Compuware APR Development edition. Deep application transaction performance engineering

Tibco Software Inc: Real-time event-enabled solutions .

NoSQL databases:

Riak: Key value store Yammer use it. They are talking with Azure to hosting on Azure. Moving to seattle soon.

GridGain: In memory computing (HPC), in memory Streaming, Im memory Accelerator for Hadoop, In-memory database

Tibco ActiveSpaces: a distributed peer-to-peer in memory data grid

Keynote

Today's keynote is Whither Web programming? from Gilad Bracha Co-author of the Java Spec. Gilad discussed several tools which can interactive compile and see the result in browerse. Since I am not frontend expert, I just list the tools in here:

Try dart (https://try.dartlang.org/): a full dart compile running in the browser

You can modify code online and syntax online. He mentioned Reflection: reflection makes your output larger

Elm (https://elm-lang.org/blog/announce/0.12.elm). A pure functional language for UI construction. Runs on the web. With a live environment.
Lively (https://www.lively-kernel.org and https://lively-web.org/welcome.html) - An Explorative Authoring Environment. Running on JavaScript in live environment, you can modify on the fly and change the behavior. Editor can also be changed as well

Has a live debugger, (IDE),

serializing a thread,

continuation to flow…

Leisure: A purely functional, lazy, dynamite functional language. Support different themes
Newspeak. His hobby work. A live, modular, object-capacity. Sync the application by code and data when it is online, app always available offline

Live code editing + Mixins

Changing a mixin at runtime means changing all classes that mix it in.

His final thought:

Whither web programming:

Web apps should evolve to complete and surpass native apps
Web platform must support offline works, stored programs, and many programming languages.

Foursquare: Involving from check-in to recommendation engineering

Current the company has 140 employees. Highlight:

Zookeeper to connect many service together
Adding new API without deploy is quite interesting

Part 1: scale the data storage

Started in 2009, MySQL switch to postgresql during to requirement. They use typical with of scale out: indexing, memory cache and when things becomes big, they do either Split tables so that replacing joins with several queries or 2) Replication to read-only and redirect traffic.

However, they are facing Outgrowing our hardware:

No enough RAM for indexing and working data set
100 writes/second/disk

So they decide to use Shading by evaluating the following options:

Building your own logical of shading logic, but lot of issues to be handled
Try: Cassandra, Hbase, Mongo,

They selected Mongo due to

i. Geo-replication
ii. Schema flexible + secondary index on the data
iii. Auto balance feature in the road map

From 2010 to 2011, migrate one table at one time to Mongo. 15 clusters, Peak take 1million query per second.

Beside Mongo, they use , they use Memcache, Elastic search for Nearly venue search and User search. They also build two data service:

Read only key value server
In memory cache

Read-only key value server (Hfile) is a file index service. They use nightly map reduce jobs to generate Hfile: prefix index files to pre-compute common used query results.

Use zookeeper to tie Hadoop cluster to run these jobs. Caching services on top of mongo to avoid try something is very expensive for mongo

Part 2: application complexity

In 2009 using PhP, the company use Scala, and then shift to Java using programs named Lift.

One interesting tools is called RPC tracing: API explorer issues. Most inexpensive tools to get API insight (DB connection, performance and troubleshooting and stack traces). RPC counts past week per API: if the increase RPC calls, it means something wrong.

Another tool is called Throttles: dynamic switch on/off the feature. Turn on features on ids, internal users, etc with different rules. Used for rollout new features as well.

Remember the goats, i.e., the grow pain as developers:

Dev need compiling all codes at all time
Deploying all the code all the time
Hard to isolate cause of performance regressions and resource leaks

So theSolution: SOA infancy

Single codebase, multiple build
Twitter's Scala based RPC library (automatically generate client/server side API)

But the team still face following problems:

Duplication in packaging and deployment efforts
Hard to trace execution problem (correlated different traces)
Hard to define/change where things live (config are hard coded)
Networks aren't reliable (RPC calls will fail)

The solutions are:

Builds and deploys using 1) Single service definition files 2) consistent build packaging 3) Simple deployment of candary & feet

such as ./service_releaser -j servier_name

Monitoring: Each application use the same way of monitoring
- Healthcheck endpoint over http
- Consistent metric names
- Dashboard for every service

Distributed tracing tools

Send all traces to kafka queue to summary the traces

Each application pass correlateate id from parent dwon to the children

Exception aggregation

All aggregation was published to single slot and see full stacks every easily

Application discovery

Use Zookeep + Finagle server sets to dynamic handle hostnames, etc

Network issue:

Circuit breaking:

Fast failing RPC calls after some error rate threshold

Loosely based on Netflix's hystrix

Organization changes

Smaller teams owning front to back implementation of a features

Desire to have quick deploy cycles on new API end points

Remote endpoints

Wouldn't it be cool if a developer could expose new API without reploy new packages

Some libraries register the endpoints thought zookeeper. Thrift .

Take minutes for dev to have a new API running on official site by using proxy to redirect traffic to new API

Benefit: Tight contract for service interaction

Json response
All http parameters passed along

Clear path to breaking off more chunks from API monopolistic

Future works; part 3:

Further Isolating service with independent storage layer.
Completely automated continuous deployment
Hybrid immutable/mutable data storage
- Mongo & hfile & cache service

jon@foursquare.com

Migrating to cloud native with micro-services by Adrian Cockcroft

Adrian recently left Netflix to help IT industry to adopt the practices built from Netflix.

Here is the highlight of learning from Netflix

Speed wins in the marketplace: facing big competitors, but speed will win
Remove friction from production development (key)
High truest, low process, no hand-offs between teams, (reduce meetings, permissions, culture)
Freedom and responsibility culture (hard to replica for other company)
Don't do your own undifferentiated heavy lifting (reducing management cost).
Use simple patterns automated by tools (simple architect review board).
Self-service cloud makes impossible things instance

Question: rapid change with latest year/6 month, co-operate IT was learning cloud, up to speed.

I did not record all notes, so here is some of the notes I recorded for your reference. He talked a lot for micro-service, which seems very popular recently. I will share the

slides and talks when it available.

Disruptors: take what used to be expensive learn to "waste" them to save money somewhere

Example1: Solid state disk: Past: assume random reads are expensive , Now: RR is free, immutable writes, log-merge

SSD packaging as disk, as PCI card, as memory storage.

Cloud native storage architecture (don’t build SSD build distribute system, but embedded into Hadoop machine).

Cassandra scalability:

Linear scale up

Hundreds of nodes per cluster in common use today

Thousands of nodes per cluster are tested.

One node: 300,0000 iops read/write, 5.4T of SSD
100 nodes -30 million and 640T -

Example2: Non-Cloud product development as an example

hardware provision is un-differential heavy lifting -replace with IASS. IASS based product development allow you develop in weeks, However, SASS can allow

You develop in days.

The difference with bigdata with bi is it answering unplanned questions in hours.

Open Space discussion on Continuous delivery

I attend the open space discussion on CD. There are two topics I am involved, which are both testing related. I guess how testing strategy fit into the overall delivery pipeline

In the whole service is still an open question which struggled many people.

Automatic Performance testing on complex system

In this session, one developer describe his problem. His team has very good testing strategy, focus a lot on unit testing, and some of the integration testing, and a couple of end to end acceptance testing with UI automation. He is worry about performance regression and want to see whether we can test performance in a cheap way. There are a couple of ideas from different people:

Testing each component might not be enough, you need test end to end for performance
Having an environment similar to production and do testing use some tools such as Jmetor
Monitoring production

I explained the three Ds in Yammer team: Dark Release, Dogfood and Data Insight. Suggest that we turn feature off by default to reduce risk, use dogfood to test your feature and build rich telemetry for your KPIs of the system.

Automatic v.s Manual

One game company have many manual testers in QA department to test new releases which release every 6 month. One guys suggest that try to reduce (or remove all) the tester and invest more test automation. The people who has this question also mentioned that organization is the main issues since you are working with different departments with different point of views. Again, I suggest they do A/B testing, even they ship client applications to customer, they might can still do A/B testing to turn feature on/off and also do more dogfood and telemetry

Google backup cloud from Raymond Blum

Not very interesting talks, but his a couple of high light is very important for us

Redudenarncy does not bring recoverability: does not mean that your data are safe. Disturbed processing imposes data consolation
Local copies don’t protected against site outages
Diversify of storage technologies further guards against bugs taking out data. Google use Tapes to backup,Tapes are more durable than disk
The only good backup is One that you have restored
Run continuous restores
Run automatic comparisons
Alert us: if there are unexpected type or rates of failures

Replicated data has to be consolidated, eventually, optimized for restore, even backup is more complex
Backup and Restore strategies have to scale
Automated !!!!!!!!!!!!!!!! To reduce the costs
Co-ordinate the processing of thousands of Taps
Map Reduce is really good at Shading, Fault tolerance and Blocking on dependencies
- DiRT! Google has Yearly exercise on data recovery and found lot of issu
- Be care of Cycle dependency. Examples are the encryption key was backup on the Tape, but without the key, no way to read form Tape.
- Backup team only guarantee for a couple of hours of data, and other teams build by soft deleted or other techniques to meet the SLO and SLA

TestOps for Continuous Delivery

Acquia Cloud is PASS for PHP apps, it has

Multiple environments : Dev, stage, prod
Continuum integration environment for your app
Special sauce for Drupal

Obligatory impressive numbers, 03/2014

27 billion original hits per month
422 Tb data transfer/month
8000+ E2 instance

Release every 1.6 day per day on average

Each release alters the infrastructure under thousands of web apps that don't control

Our customer hate downtime. The main issues are Server configuration is a software, and it is hard to test. Puppet, Chef can assist you, but you still need invest lot of testing on this. Problem: reality is very mess!: you might have launch failures or race conditions.

Unit test v.s. system tests

Unit test works on isolated program
The problem is that: you cannot mock out the real workload and get accurate results
- Server configuration interact with OS, network, and services

System tests are for end to end

Apply code changes to real, running services

Exercise the infra as the apps will

System tests FTW:

For infrastructure, system tests are essential

Test in a clone of production is not right:

No back-doors to make tests "easier"
PCV, HIPPA security requirement
Tests operators just as admins do:
- Most tests operator "from a bastion":
- Ensures the code works in production

Testing Strategies

Basic build tests: Launch VMS, run puppets

Replicate a functional production environment
Isolation from production
Scan syslog for errors
Test config files, daemons, users, cron jobs
Simple failures:
Incorrect puppet dependencies work while iteration on development instance but not on clean lauch

Functionality tests the moving parts:

Backup and restores

Message queues

Work auto-scaling

Load balancing with up and down workload

ELN health check and recovery

Database failover

Monitoring and alerts

Self-healing

Application tests: Install and verify applications (s)

Real site code
Real site db (scrubbed)
Cause app to exercise the infrastructure

- - Write to database, message, queue, etc
  - Verify success on the backup
  - Operator app on degraded infrastructure

Reboot tests:

Reboot all test servers
Re-run build tests
Re-run functional and application tests
Sample failure

- - File system mounted?
  - Services restarted?

- - Database quota daemon starts before MySQL daemon, then alert

Re-launch tests: Re-launch all test servers from base images

Simulate server crash and recovery
Persistent data retained?
Server rejoins servers?
Unexpected issues

Re-run build, functional and applications

Sample failure:

Non-deployable customer application prevents relauch from completing normal

Upgrade test

Also need to test upgrading existing servers: (you add a feature
The upgrade test dance:
Launch servers in test environments on current production code
Run smoke tests to ensure system is operating
Upgrade servers to latest development code

Testing in parallel: Run them in parallel to optimized running time

Workers may alter server-wide behavior
Each work need an isolated set of servers
Workers that break their services need to self-destruct, or they will cause false failures

Management issues:

Who writes the tests?

Our tests are as or more ,complex than the production
Subtle cases requires white-box testing
First try QA department
Now: Engineering

- - Take longer to write

- - Triggering specific failures scenario requirements understand OS and code details together

- - Don' work, they could not keep up or go deep

- - Every devs write unit and system tests for their own code.

Who fixes the tests:

Infrastructure system tests are fragile
Code review requires a "passing" run
Bugs often only occur post-commit
Permanent, rotating team handles failures

- - Authoring must analyze any failures, config that we unrelated and refer to or open a ticket for it.

- - Authority to revert any commit causing a failure
  - Usually it is easier to fix it instead.

Who invests in the tests:

Manage must accept that infrastructure system tests are :

Hard

Time-consuming

Essential

Worth it

Under-investing will bite you badly.

QCON New York 2014 Conference Day 1 Highlight

Additional resources