A Social Business Reaction to the Amazon EC2 Outage

author-photo
Latest Articles

On ApAmazon Web Services | Stuzoril 21st, Amazon Web Services experienced a service interruption, disrupting servers for many large and prominent businesses. Among those who were affected included an abundance of social services and websites such as Foursquare, HootSuite and Reddit. At Stuzo, we leverage EC2 for some of our hosting services and experienced temporary issues with our server infrastructure, as the Amazon back up systems did not kick in. As a result, this affected client applications which were live on Facebook, corporate websites, and elsewhere on the web.

Many companies have multi-zone failovers and other disaster recovery plans. On the technical side, we had most of our clients back up and running within an hour of the Amazon outage by leveraging our backup and internal recovery plans.

What did we do to respond to the situation from a Social Business perspective?

PLAN: Firstly, it’s important when working with any technology that you need a disaster recovery plan. Technology does fail and will continue to fail. It’s how you are prepared internally for these emergencies and execute when it does occur which will make all the difference. We’ve seen hardware and software fail before and know the importance of having a chain of command, operations, and procedures in these situations. This is documented and communicated through internal collaboration tools such as a corporate wiki, server documentation, and readily accessible information for the appropriate team members on the web. Generally, this just boils down to good business practices and the current tools adopted by our organization enable plans to be developed and documented more effectively.

SIGNAL: Through our server monitoring systems and communications measures the appropriate team members were notified instantaneously at critical points through the outage. This began with server monitoring tools at the root issues level (sending emails, SMS, etc) and the signal transitioned all the way  to our account and executive teams through our collaboration tools and communication practices.

DECIDE & ACT: We had to make necessary and demanding decisions before the sun was up for many of our clients. Do we wait for EC2 to come back, only migrate particular environments, notify all users? With the preparation and communication we had the right people with the proper knowledge available and briefed to make the best decision for our clients. On our end, we made the critical decision to migrate all live client applications associated with the Amazon EC2 environment in a time frame scheduled with the team (who were already notified through the signal stage and had an understanding of common options, responsibilities, and steps through our planning).

COMMUNICATE: On the account side we communicated the timeline and demands to make the migration happen internally along with a communication strategy for each client. All engineers, project managers, account managers, and appropriate team members were fully briefed on the status and progress thus far by seeing the alerts, briefings, and communication streams on their mobile devices prior to arriving in the office. Each client application that was moved to the new environment was tested by our QA engineers. Notification plans were setup for clients to ensure they were aware of the situation, our response to it, and how we are proceeding to manage in the coming days. Most of these occurred through emails to the clients coupled with direct phone calls early in the morning to provide transparency and responsibilities moving forward.

Effective collaboration, planning, and communication strategies made the situation much less stressful and clear across the team. We of course still experienced some stressful moments (and long hours), which is expected in emergency scenarios. However, each and every decision and action was made with the goal to stabilize our client’s environments in the available time with a cohesive plan and workforce. I thank the Stuzo | Dachis Group team who was by my side during this process, the dynamic signals that awoke me from my bed very early in the morning, and the understanding clients who appreciated our mission and decisions made to ensure the best user experience and account services we could provide given the outage.

Share this story

Latest Articles

Don't Miss These...

Contact Us

Please let us know if you’re interested in our software services and Open Commerce platform, in scheduling an Insights briefing, or establishing a partnership.

    By submitting the form you agree to receive
    future periodic email communication from Stuzo.

    We use cookies

    We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from. By browsing our website, you consent to our use of cookies and other tracking technologies.