What Happened
At approximately 00:50, a critical hardware malfunction occurred on a physical server in our data center. An error message—“A bus fatal error was detected on a component at slot 1”—indicated a failure in one of the PCI network interfaces. This caused the primary host to become inoperable, leading all virtual machines on that host to go offline. However, our backup server automatically took over services, ensuring that the majority of customers remained operational.
Conclusion
A hardware failure led to a partial outage, but our monitoring systems and backup infrastructure responded quickly to minimize disruption. Our team swiftly identified the root cause and resolved the incident. As part of our continuous improvement efforts, we are implementing additional steps to improve both our response times and our automated failover processes, ensuring that any such event is contained and resolved even more promptly in the future.
Assurance to Our Customers
We sincerely apologize for any inconvenience caused by this incident. Nuacom takes every disruption seriously, and we remain committed to providing a reliable, high-quality service at all times. Our team’s rapid response and the successful automatic failover process underscore our dedication to proactively managing issues. We will continue to strengthen our systems and procedures to prevent similar incidents from occurring in the future.
If you have any questions or need further information, please reach out to our support team at any time. Your trust in Nuacom is greatly appreciated, and we are steadfast in our mission to keep your communications running seamlessly.
Detection
Our robust monitoring tools detected multiple connectivity alerts related to a single physical host, triggering a high-priority alert for our Network Operations Center (NOC). A subsequent review of logs and diagnostic data confirmed that the NIC on PCI Slot 1 was the failure point.
Response Actions Taken
User Impact
Despite these challenges, our failover safeguards ensured that most customers experienced minimal disruption and the incident was contained and addressed as quickly as possible.
Short-Term Actions
Long-Term Actions