Qcon Beijing 2010笔记:失败来临的征兆

Aiming for the wrong target

Assumption $1
Users care about the things the do-features-not the software or hardware you run
Assumption #2: Faults and errors will occur.
Your can choose to engineer safe failure modes into your system or to accept whatever random failure modes naturally occur

Engineering Failure Modes
Tolerance : Absorb shocks . but do not transmit them
Severability: Limit functionality instead of crashing completely
Recoverablility: Allow component-level restarts instead of rebooting the world
Resilience: Recover from transient effects automatically
These produce consistent availability of reatures

工程化失败模式
容忍:将震荡吸收,而不是传递它
服务能力:功能缩水而不是整个损坏
恢复能力:允许组件重启,而不是让整个世界“重新启动”
弹性:能够从瞬时性(Transient)的影响中恢复
这样可以保持功能的可用性

Stability Antipatterns

1. Intergration Poinnts
Intergrations are the #1 risk to stability
Your first job is to protect against integration points
Every socket process. Pip or remote procedure call can and will eventlually kill your system
Even database calls can hang. in obvious and not-so-obvious ways

“In Spec” vs. “Out of Spec”
“In Spec” failures
TCP connection refused
HTTP response code 500
Error message in XML response

Out of spec failures
TCP connection

Remember this

Large systems fail faster than small ones

2. Chain Reaction:
Cascading Failure: Failure in one system causes calling systems to be jeopardized

Remember this
Prevent Cascading Failure to stop cracks from jumping to the gap

3. Users: Can’t live with them…
First type of “bad” user
Front-page viewer: creates useless sessions, ties up memory for no reason
Application servers are all fragile to sessions: Users can ….

Handle traffic surges gracefully: Turn off expensive features when the system is busy. Divert of throttle users. Preserve a good experience for some when you can’t server all. Reduce the burden of serving each user. Be especially …

Second type of “bad” user
Buyers: (most expensive type of user to service, secure pages requires more cpu, more pages, external integration), High conversion rate is bad for the systems!. Your sponsors may not agree

Blocked Threads: Request handling threads are precious. Protect them.
Most common for of “crash”: all request threads blocked. Very difficult to test for:. Best bet: keep threads isolated. Use well-tested. High-lvel contracts for cross-thread communication

Attacks of Self-Denail: Good marketing can kill your system at any time

Defending the ramparts: avoid deep links, setup static landing pages, only allow the user’s second click to reach application servers. Allow throtting of …

Remember this
Keep lines of communication open , protect shared resources, expect ..

Scaling Effects
*QA and Dev balance?

Unbalanced Capacities
Traffic floods sometimes start inside the data center walls.

SLA Inversion: Surviving by luck alone.

Unbounded Result Sets: Limited resources, unlimited data volumns
记住:要使用显示的数据容量测试

Also read...