Today, I attended the first day of QCON New York 2014 Conference. Here is a brief introduction of the conference:
Software is Changing the World. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.
QCon starts with 2 days of tutorials on Monday and Tuesday, June 9-10 followed by the full 3-day conference from Wednesday, June 11-13. The conference will feature over 100 speakers in 6 concurrent tracks daily covering the most timely and innovative topics driving the evolution of enterprise software development today. The setting is the beautiful, centrally-located Marriott at Brooklyn Bridge in New York City.
The main reason I attend this conference is to know the industry trend in the software and service development, learning more people use bigdata and machine learning techniques in there services and find existing Microsoft customer and understand their user cases.
Here is the highlight of day 1 of the conference. In Day one, we have 6 sections running concurrently, they are
I am mostly involved in Continuous Delivery and Architectures you've always wondered about sections. I will share the slides and talks (if possible) in later days
- Linked-in Kafka
- Apache Zookeeper
- Linux Container
Application monitoring and NoSQL/other database techniques are dominate booths, nearly all booths are related to one of the techniques, which is also a
good indicator of what is hot in industry
I knew AppDynamic from the popular slides "Call of Duty: Dev Ops". AppDynamic is very popular company in term of monitoring customer's application. Their application performance management (APM) allow people to monitor hybrid environemnts with Java, .Net and PHP (and a wide list of environments). The people showed
me a demo about how they monitor oracle database and do performance troubleshooting on wait stats, poor query plans, etc. Which remind me that our SQL Azure's performance monitoring is not as great as we should.
- Compuware APR Development edition. Deep application transaction performance engineering
- Tibco Software Inc: Real-time event-enabled solutions .
Riak: Key value store Yammer use it. They are talking with Azure to hosting on Azure. Moving to seattle soon.
GridGain: In memory computing (HPC), in memory Streaming, Im memory Accelerator for Hadoop, In-memory database
Tibco ActiveSpaces: a distributed peer-to-peer in memory data grid
Today's keynote is Whither Web programming? from Gilad Bracha Co-author of the Java Spec. Gilad discussed several tools which can interactive compile and see the result in browerse. Since I am not frontend expert, I just list the tools in here:
- Try dart (http://try.dartlang.org/): a full dart compile running in the browser
You can modify code online and syntax online. He mentioned Reflection: reflection makes your output larger
- Elm (http://elm-lang.org/blog/announce/0.12.elm). A pure functional language for UI construction. Runs on the web. With a live environment.
Has a live debugger, (IDE),
serializing a thread,
continuation to flow…
- Leisure: A purely functional, lazy, dynamite functional language. Support different themes
- Newspeak. His hobby work. A live, modular, object-capacity. Sync the application by code and data when it is online, app always available offline
Live code editing + Mixins
Changing a mixin at runtime means changing all classes that mix it in.
His final thought:
Whither web programming:
- Web apps should evolve to complete and surpass native apps
- Web platform must support offline works, stored programs, and many programming languages.
Foursquare: Involving from check-in to recommendation engineering
Current the company has 140 employees. Highlight:
- Zookeeper to connect many service together
- Adding new API without deploy is quite interesting
Part 1: scale the data storage
Started in 2009, MySQL switch to postgresql during to requirement. They use typical with of scale out: indexing, memory cache and when things becomes big, they do either Split tables so that replacing joins with several queries or 2) Replication to read-only and redirect traffic.
However, they are facing Outgrowing our hardware:
- No enough RAM for indexing and working data set
- 100 writes/second/disk
So they decide to use Shading by evaluating the following options:
- Building your own logical of shading logic, but lot of issues to be handled
- Try: Cassandra, Hbase, Mongo,
They selected Mongo due to
- i. Geo-replication
- ii. Schema flexible + secondary index on the data
- iii. Auto balance feature in the road map
From 2010 to 2011, migrate one table at one time to Mongo. 15 clusters, Peak take 1million query per second.
Beside Mongo, they use , they use Memcache, Elastic search for Nearly venue search and User search. They also build two data service:
- Read only key value server
- In memory cache
Read-only key value server (Hfile) is a file index service. They use nightly map reduce jobs to generate Hfile: prefix index files to pre-compute common used query results.
Use zookeeper to tie Hadoop cluster to run these jobs. Caching services on top of mongo to avoid try something is very expensive for mongo
Part 2: application complexity
In 2009 using PhP, the company use Scala, and then shift to Java using programs named Lift.
One interesting tools is called RPC tracing: API explorer issues. Most inexpensive tools to get API insight (DB connection, performance and troubleshooting and stack traces). RPC counts past week per API: if the increase RPC calls, it means something wrong.
Another tool is called Throttles: dynamic switch on/off the feature. Turn on features on ids, internal users, etc with different rules. Used for rollout new features as well.
Remember the goats, i.e., the grow pain as developers:
- Dev need compiling all codes at all time
- Deploying all the code all the time
- Hard to isolate cause of performance regressions and resource leaks
So the Solution: SOA infancy
- Single codebase, multiple build
- Twitter's Scala based RPC library (automatically generate client/server side API)
But the team still face following problems:
- Duplication in packaging and deployment efforts
- Hard to trace execution problem (correlated different traces)
- Hard to define/change where things live (config are hard coded)
- Networks aren't reliable (RPC calls will fail)
The solutions are:
- Builds and deploys using 1) Single service definition files 2) consistent build packaging 3) Simple deployment of candary & feet
such as ./service_releaser -j servier_name
- Monitoring: Each application use the same way of monitoring
- Healthcheck endpoint over http
- Consistent metric names
- Dashboard for every service
Distributed tracing tools
Send all traces to kafka queue to summary the traces
Each application pass correlateate id from parent dwon to the children
- Exception aggregation
All aggregation was published to single slot and see full stacks every easily
- Application discovery
Use Zookeep + Finagle server sets to dynamic handle hostnames, etc
- Network issue:
Fast failing RPC calls after some error rate threshold
Loosely based on Netflix's hystrix
- Organization changes
Smaller teams owning front to back implementation of a features
Desire to have quick deploy cycles on new API end points
Wouldn't it be cool if a developer could expose new API without reploy new packages
Some libraries register the endpoints thought zookeeper. Thrift .
Take minutes for dev to have a new API running on official site by using proxy to redirect traffic to new API
Benefit: Tight contract for service interaction
- Json response
- All http parameters passed along
Clear path to breaking off more chunks from API monopolistic
Future works; part 3:
- Further Isolating service with independent storage layer.
- Completely automated continuous deployment
- Hybrid immutable/mutable data storage
- Mongo & hfile & cache service
Migrating to cloud native with micro-services by Adrian Cockcroft
Adrian recently left Netflix to help IT industry to adopt the practices built from Netflix.
Here is the highlight of learning from Netflix
- Speed wins in the marketplace: facing big competitors, but speed will win
- Remove friction from production development (key)
- High truest, low process, no hand-offs between teams, (reduce meetings, permissions, culture)
- Freedom and responsibility culture (hard to replica for other company)
- Don't do your own undifferentiated heavy lifting (reducing management cost).
- Use simple patterns automated by tools (simple architect review board).
- Self-service cloud makes impossible things instance
Question: rapid change with latest year/6 month, co-operate IT was learning cloud, up to speed.
I did not record all notes, so here is some of the notes I recorded for your reference. He talked a lot for micro-service, which seems very popular recently. I will share the
slides and talks when it available.
Disruptors: take what used to be expensive learn to "waste" them to save money somewhere
Example1: Solid state disk: Past: assume random reads are expensive , Now: RR is free, immutable writes, log-merge
SSD packaging as disk, as PCI card, as memory storage.
Cloud native storage architecture (don’t build SSD build distribute system, but embedded into Hadoop machine).
Linear scale up
Hundreds of nodes per cluster in common use today
Thousands of nodes per cluster are tested.
- One node: 300,0000 iops read/write, 5.4T of SSD
- 100 nodes -30 million and 640T -
Example2: Non-Cloud product development as an example
hardware provision is un-differential heavy lifting -replace with IASS. IASS based product development allow you develop in weeks, However, SASS can allow
You develop in days.
The difference with bigdata with bi is it answering unplanned questions in hours.
Open Space discussion on Continuous delivery
I attend the open space discussion on CD. There are two topics I am involved, which are both testing related. I guess how testing strategy fit into the overall delivery pipeline
In the whole service is still an open question which struggled many people.
Automatic Performance testing on complex system
In this session, one developer describe his problem. His team has very good testing strategy, focus a lot on unit testing, and some of the integration testing, and a couple of end to end acceptance testing with UI automation. He is worry about performance regression and want to see whether we can test performance in a cheap way. There are a couple of ideas from different people:
- Testing each component might not be enough, you need test end to end for performance
- Having an environment similar to production and do testing use some tools such as Jmetor
- Monitoring production
I explained the three Ds in Yammer team: Dark Release, Dogfood and Data Insight. Suggest that we turn feature off by default to reduce risk, use dogfood to test your feature and build rich telemetry for your KPIs of the system.
Automatic v.s Manual
One game company have many manual testers in QA department to test new releases which release every 6 month. One guys suggest that try to reduce (or remove all) the tester and invest more test automation. The people who has this question also mentioned that organization is the main issues since you are working with different departments with different point of views. Again, I suggest they do A/B testing, even they ship client applications to customer, they might can still do A/B testing to turn feature on/off and also do more dogfood and telemetry
Google backup cloud from Raymond Blum
Not very interesting talks, but his a couple of high light is very important for us
- Redudenarncy does not bring recoverability: does not mean that your data are safe. Disturbed processing imposes data consolation
- Local copies don’t protected against site outages
- Diversify of storage technologies further guards against bugs taking out data. Google use Tapes to backup,Tapes are more durable than disk
- The only good backup is One that you have restored
- Run continuous restores
- Run automatic comparisons
- Alert us: if there are unexpected type or rates of failures
- Replicated data has to be consolidated, eventually, optimized for restore, even backup is more complex
- Backup and Restore strategies have to scale
- Automated !!!!!!!!!!!!!!!! To reduce the costs
- Co-ordinate the processing of thousands of Taps
- Map Reduce is really good at Shading, Fault tolerance and Blocking on dependencies
- DiRT! Google has Yearly exercise on data recovery and found lot of issu
- Be care of Cycle dependency. Examples are the encryption key was backup on the Tape, but without the key, no way to read form Tape.
- Backup team only guarantee for a couple of hours of data, and other teams build by soft deleted or other techniques to meet the SLO and SLA
TestOps for Continuous Delivery
Acquia Cloud is PASS for PHP apps, it has
- Multiple environments : Dev, stage, prod
- Continuum integration environment for your app
- Special sauce for Drupal
Obligatory impressive numbers, 03/2014
- 27 billion original hits per month
- 422 Tb data transfer/month
- 8000+ E2 instance
Release every 1.6 day per day on average
- Each release alters the infrastructure under thousands of web apps that don't control
Our customer hate downtime. The main issues are Server configuration is a software, and it is hard to test. Puppet, Chef can assist you, but you still need invest lot of testing on this. Problem: reality is very mess!: you might have launch failures or race conditions.
Unit test v.s. system tests
- Unit test works on isolated program
- The problem is that: you cannot mock out the real workload and get accurate results
- Server configuration interact with OS, network, and services
System tests are for end to end
Apply code changes to real, running services
Exercise the infra as the apps will
System tests FTW:
For infrastructure, system tests are essential
Test in a clone of production is not right:
- No back-doors to make tests "easier"
- PCV, HIPPA security requirement
- Tests operators just as admins do:
- Most tests operator "from a bastion":
- Ensures the code works in production
- Basic build tests: Launch VMS, run puppets
- Replicate a functional production environment
- Isolation from production
- Scan syslog for errors
- Test config files, daemons, users, cron jobs
- Simple failures:
- Incorrect puppet dependencies work while iteration on development instance but not on clean lauch
- Functionality tests the moving parts:
Backup and restores
Load balancing with up and down workload
ELN health check and recovery
Monitoring and alerts
- Application tests: Install and verify applications (s)
- Real site code
- Real site db (scrubbed)
- Cause app to exercise the infrastructure
- Write to database, message, queue, etc
- Verify success on the backup
- Operator app on degraded infrastructure
- Reboot tests:
- Reboot all test servers
- Re-run build tests
- Re-run functional and application tests
- Sample failure
- File system mounted?
- Services restarted?
- Database quota daemon starts before MySQL daemon, then alert
- Re-launch tests: Re-launch all test servers from base images
- Simulate server crash and recovery
- Persistent data retained?
- Server rejoins servers?
- Unexpected issues
- Re-run build, functional and applications
- Non-deployable customer application prevents relauch from completing normal
- Upgrade test
- Also need to test upgrading existing servers: (you add a feature
- The upgrade test dance:
- Launch servers in test environments on current production code
- Run smoke tests to ensure system is operating
- Upgrade servers to latest development code
- Testing in parallel: Run them in parallel to optimized running time
- Workers may alter server-wide behavior
- Each work need an isolated set of servers
- Workers that break their services need to self-destruct, or they will cause false failures
- Who writes the tests?
- Our tests are as or more ,complex than the production
- Subtle cases requires white-box testing
- First try QA department
- Now: Engineering
- Take longer to write
- Triggering specific failures scenario requirements understand OS and code details together
- Don' work, they could not keep up or go deep
- Every devs write unit and system tests for their own code.
- Who fixes the tests:
- Infrastructure system tests are fragile
- Code review requires a "passing" run
- Bugs often only occur post-commit
- Permanent, rotating team handles failures
- Authoring must analyze any failures, config that we unrelated and refer to or open a ticket for it.
- Authority to revert any commit causing a failure
- Usually it is easier to fix it instead.
- Who invests in the tests:
Manage must accept that infrastructure system tests are :
Under-investing will bite you badly.