On September 28 and September 29 this week, a number of Microsoft customers worldwide were impacted by a cascading series of problems resulting in many being unable to access their Microsoft apps and services. On October 1, Microsoft posted its post-mortem about the outages, outlining what happened and next steps it plans to take to head this kind of issue off in the future.
Starting around 5:30 p.m. ET on Monday, September 28, customers began reporting they couldn’t sign into Microsoft and third-party applications which used Azure Active Directory (Azure AD) for authentication. (Yes, this means Office 365 and other Microsoft cloud services.) Those who were already signed in were less likely to have had issues. According to Microsoft’s report, users in the Americas and Australia were likely to be impacted more than those in Europe and Asia.
Microsoft acknowledged it was a service update targeting an internal validation test ring that caused a crash in Azure AD backend services. “A latent code defect in the Azure AD backend service Safe Deployment Process (SDP) system caused this to deploy directly into our production environment, by passing our normal validation process,” officials said.
Azure AD is designed to be geo-distributed and deployed with multiple partitions across multiple data centers around the world, and is built with isolation boundaries. Microsoft normally applies changes across a validation ring that doesn’t include customer data, followed by four additional rings over the course of several days before they hit production. But this week the SDP didn’t correctly target the validation ring due to a defect and all rings were targeted concurrently causing service availability to degrade, Microsoft’s report says.
Microsoft engineering knew within five minutes of the problem that something was wrong. During the next 30 minutes, Microsoft started taking steps to expedite mitigation by scaling out some Azure AD services to handle the load once a mitigation would have been applied and failing over certain workloads into a backup Azure AD authentication system.
Unfortunately, Microsoft’s automated rollback failed due to the corruption of SDP metadata. So the team began manually updating the service configuration by bypassing the SDP system. Microsoft says the entire operation was completed by around 8 p.m. ET. Microsoft says “all service instances with residual impact were recovered” more than two hours after that.
Microsoft officials said they’ve fixed the latent code defect in the Azure AD backend SDP system; fixed the existing rollback system; and expanded the scope and frequency of rollback operation drills. The team still needs to apply more protections to the Azure AD SDP system to prevent these kinds of issues. It also needs to expedite the rollout of the Azure AD backup authentication system to all key services, and to onboard Azure AD scenarios to the automated communications pipeline to let affected customers know within 15 minutes of impact about what’s going on.
Microsoft’s report doesn’t mention a key problem noted by a number of users on Twitter this week: Microsoft’s admin dashboards for Office 365 and Azure require authentication to sign in and see them. Many users who were locked out couldn’t see the updates Microsoft was providing in the admin portals.
Microsoft’s report also doesn’t mention that the past couple of days customers in various geographies have been reporting problems with Exchange Online and Outlook on their mobile devices. (There also was a SharePoint Online glitch that affected some users yesterday.) Microsoft attributed that problem to a situation involving Exchange ActiveSync and “a recent configuration update to components that route user requests was the cause of impact.”
Today, October 1, Exchange and Outlook again were causing issues for users primarily located in Europe. Microsoft officials cited a recent configuration update as the cause of today’s issues.