- Google Cloud API Service before blaming for a general failure
- Most regions were back online in 40 minutes, but some took even more time
- The company has promised to protect itself from future failures and improve communication
After the recent breakdown of Google Cloud, which has taken sites such as Spotify, Cloudflare and Discord Offline, the company published its detailed report, sharing exactly why it failed customers.
The company says that the deep cause was a code problem in control of services – part of the API management system and verification of company policies.
More specifically, an unlikely automated quota update and a lack of management of appropriate errors have sparked a global crash loop, with 503 errors observed not only on Google cloud services, but also services using its APIs.
Google cloud outrage caused by the API problem
The breakdown assigned Google Cloud infrastructure, as well as other popular Google workspace applications such as Drive, Docs, Gmail and Calendar. However, third-party sites accessing the Google Cloud API, including the popular SPOTIFY music streaming platform, which has 678 users, as well as certain Cloudflare services, have also been assigned.
“On May 29, 2025, a new feature was added to the control of services for additional quota policy checks,” the company wrote in its incident report. “The problem with this change was that he had no treatment of appropriate mistakes and that he was not protected by the flag.”
Google Cloud boasted that its site reliability engineering team started to triau the incident within two minutes, after identifying the deep cause within 10 minutes. “The red button [to disable the serving path] was ready to deploy at ~ 25 minutes from the start of the incident, “said Google, the full deployment within 40 minutes.
Although smaller regions have recovered relatively quickly, larger regions like US-Central-1 have taken more time to return online-about two hours and 40 minutes in the case of this particular region.
In its incident mini-rapports on the day of the breakdown, Google Cloud promised to “do better”. Its more detailed report promises the usual responses in the future, such as improving static analysis and test practices, audit and modulation of service control architecture to contain future incidents, but the company has also committed to “improve [its] External communications “to better inform customers, ensuring that its communication infrastructure remains online even during these breakdowns in the future.