Bobby, Thanks for a detailed unpacking of how a supposedly value-adding, technology-aided release morphed into a value erosion nightmare and how MRC recovered.
I'd like to add a couple of points based on my experience dealing with similar problems, either as an operator or as a consultant. And some were crafted from dealing with high stakes situations like Black Friday traffic spikes.
1. Most engineering teams are happy to 'Code the design' without thinking about how the architecture will perform under different but potential production situations.
2. For many teams, Non-Functional Requirements (NFRs) are an afterthought rather than something to be carefully considered pre-design.
3. Lack of load visualization and ignoring the need to performance test
4. Lack of traceability built into connected codes, network and servers(Web, application, caches, databases) to catch run time issues5. Lack of durable processes including but not limited to What-If scenario SOP preparedness and fire drills.
I also think from the explicit information I gathered from the video, one crucial action could have enabled a quick recovery without prolonging the production down situation, a thoughtful roll back plan. This should accompany every release which is an effective tool in handling these production fiascos in spite of covering many of the release best practices.
Whenever a tech leader takes charge of a new org or adds a bolt on PE or not, one of the initial 30 days plan action should be to at least quiz the teams on STATE OF PRODUCTION. This will uncover many of the repeated instability issues, customer dissatisfactions, inconsistent or nonexistent best practices, etc. While all these corrections will take time, at least having these conversations can help the teams and some leads under you to start focusing and taking action which can finally lead to a better state. Would love to hear your thoughts and others'.
Bobby, Thanks for a detailed unpacking of how a supposedly value-adding, technology-aided release morphed into a value erosion nightmare and how MRC recovered.
I'd like to add a couple of points based on my experience dealing with similar problems, either as an operator or as a consultant. And some were crafted from dealing with high stakes situations like Black Friday traffic spikes.
1. Most engineering teams are happy to 'Code the design' without thinking about how the architecture will perform under different but potential production situations.
2. For many teams, Non-Functional Requirements (NFRs) are an afterthought rather than something to be carefully considered pre-design.
3. Lack of load visualization and ignoring the need to performance test
4. Lack of traceability built into connected codes, network and servers(Web, application, caches, databases) to catch run time issues5. Lack of durable processes including but not limited to What-If scenario SOP preparedness and fire drills.
I also think from the explicit information I gathered from the video, one crucial action could have enabled a quick recovery without prolonging the production down situation, a thoughtful roll back plan. This should accompany every release which is an effective tool in handling these production fiascos in spite of covering many of the release best practices.
Whenever a tech leader takes charge of a new org or adds a bolt on PE or not, one of the initial 30 days plan action should be to at least quiz the teams on STATE OF PRODUCTION. This will uncover many of the repeated instability issues, customer dissatisfactions, inconsistent or nonexistent best practices, etc. While all these corrections will take time, at least having these conversations can help the teams and some leads under you to start focusing and taking action which can finally lead to a better state. Would love to hear your thoughts and others'.