What Happened
At approximately 13:50, a critical hardware malfunction occurred on a data center server. The error message—“A PCI parity error was detected on a component”—pointed to a defective network card. As a result, the host machine became inoperable, and all virtual machines on this host went offline.
Complicating matters, this server was the designated failover server for the primary machine that had experienced a similar malfunction earlier in the day. Because the failover server was still recovering and performing a file system check, it was not fully prepared to handle the additional load. Consequently, our operations team had to manually redistribute services to other servers, resulting in extended downtime.
Conclusion Summary
This was the second hardware failure to occur on the same day—an exceptional occurrence given our strong track record of zero server faults over the past five years. Moving forward, we are focusing our efforts on faster automated recoveries and more robust load and stress testing to prevent similar incidents.
Assurance to Our Customers
We sincerely apologize for any inconvenience caused by this incident We recognize the inconvenience this second outage caused and are committed to preventing similar incidents. Our team is actively refining both automated and manual failover processes to ensure quicker recovery times. Through enhanced monitoring, comprehensive testing, and ongoing infrastructure improvements, we will continue delivering reliable, high-quality service. If you have any concerns or questions, please reach out to our support team at any time.
Detection
Our monitoring systems detected multiple connectivity alerts originating from the same physical host, immediately notifying our Network Operations Center (NOC).
Response Actions Taken
User Impact
While the automated failover partially succeeded, the event highlighted critical areas where additional failover capacity and faster manual procedures can further minimize future downtime.
Short-Term Actions
Long-Term Actions