This is my though on building and running a service. The main reason I separate these two is that it is possible that we have a poor service, but we can run while. And another side is also true. The ideally case is that we build a very good service and running in a professional way in that we can continuously improve ourselves.
How to build a good service?
1) Build a good team then you can build a better service
a) Having a architect group to review major change
b) Engineers make decision on what and how to build.
c) Less team, less people, less communication and whole team responsible for the service.
d) Keep people in similar position longer to perverse the expertise.
2) Invest more on tools and process to enable fast delivery with low cost:
a) Using tools to present bug happens (not detect bugs). Examples are FxCop, ShareCheker, XML parser/schema validation, watchdog rule checking, etc.
b) always ask does it work and how do you know for each check-in, i.e., every check-in should have some test to valid it really works.
3) Treat config as code
a) Your service should run correctly if config is wrong
b) You should try best to detect bad config upfront
c) Reduce the chance of having bad config being deployed
d) Fuzz testing with wrong configurations
4) Deployment automation
a) Prefer script based deployment than code based deployment
b) Focus on production deployment.
5) Metric driven release
a) Define or collect metric before you coding. Sometime be more specific on customer impact and business impact
b) Collect metrics in production
c) Valid metric against your expectation and re-coding.
How to run a service?
1) Balance between schedule, feature and quality.
a) livesite first principle: no matter how many features you have, if your service is not stabilize, you wouldn’t get more customer.
b) it is ok to catch next train, but don’t rush to current train to make it slow.
2) Focus on how can I ship faster than shipping faster.
3) Different release channel with different release schedule
4) Build a robust service health model with easy to use dashboard
5) Decision maker should know the service inside,
a) you should know how many cluster, machines, users per cluster.
b) you should have some technique knowledge about the service, such as how deployment works, how a feature works, and tries at least by themself to use the feature or deployed to test cluster.
c) you need be a customer of your service
6) In long term, shiproom might not be necessary or shift the focus of ship room from deployment, QFE, bugs to service health/KPI driven decision
7) Get rid of release notes. Why not integrate all contents in release notes into deployment automation, and do policy based deployment
8) Op team very closely works with dev team, no gaps, not communication issues.
9) Every live site accident need root cause analysis
10) Knowing where your money coming from, and making sure the money pipeline is not broken. Examples are create/drop database should have very high pass rate.