On November 17th from 15:55 ET to 16:27 ET we experienced 15 minutes of downtime and 17 minutes of extreme slowness (7-10 seconds P95 response time).
The issue was a hardware issue in one of our load balancers provided by our infrastructure provider. All our load balancers have a fail-safe mechanism that fallback faulty load balancers to the backup one.
After the incident was fixed, over the past day we’ve worked closely with our infrastructure provider engineers to figure out the reason the load balancer did not fallback to the backup LB. We can confirm that our infrastructure provider found out the issue and have patched their infrastructure and added tests to prevent similar issue from happening.