Investigating issues with Cloud PBX ,Severity: major

Incident Report for NUACOM

Postmortem

Incident Summary

  • Date and Time (Europe/Dublin): 27th Feb 2025, 00:50 – 09:06
  • Affected Services: Voice Services

Incident Description

What Happened
At approximately 00:50, a critical hardware malfunction occurred on a physical server in our data center. An error message—“A bus fatal error was detected on a component at slot 1”—indicated a failure in one of the PCI network interfaces. This caused the primary host to become inoperable, leading all virtual machines on that host to go offline. However, our backup server automatically took over services, ensuring that the majority of customers remained operational.

Conclusion
A hardware failure led to a partial outage, but our monitoring systems and backup infrastructure responded quickly to minimize disruption. Our team swiftly identified the root cause and resolved the incident. As part of our continuous improvement efforts, we are implementing additional steps to improve both our response times and our automated failover processes, ensuring that any such event is contained and resolved even more promptly in the future.

Assurance to Our Customers
We sincerely apologize for any inconvenience caused by this incident. Nuacom takes every disruption seriously, and we remain committed to providing a reliable, high-quality service at all times. Our team’s rapid response and the successful automatic failover process underscore our dedication to proactively managing issues. We will continue to strengthen our systems and procedures to prevent similar incidents from occurring in the future.

If you have any questions or need further information, please reach out to our support team at any time. Your trust in Nuacom is greatly appreciated, and we are steadfast in our mission to keep your communications running seamlessly.

Detection and Response

Detection
Our robust monitoring tools detected multiple connectivity alerts related to a single physical host, triggering a high-priority alert for our Network Operations Center (NOC). A subsequent review of logs and diagnostic data confirmed that the NIC on PCI Slot 1 was the failure point.

Response Actions Taken

  • 00:55 – Automated failover kicked in, restoring service for most affected clients.
  • 06:50 – NOC team began a detailed investigation.
  • 07:30 – Root cause pinpointed and verified as hardware failure.
  • 08:10 – Primary host restarted to stabilize operations.
  • 08:16 – Remaining services were manually transferred to the backup host, ensuring minimal further interruption.
  • 09:06 – Service containers on the primary host came back online once the hardware issue was resolved.

Impact Assessment

User Impact

  • Partial Outage – Approximately ~7.2% of our customer accounts experienced a temporary loss of calling functionality during low-traffic hours and partial service degradation during early business hours. Some clients were unable to place or receive calls until the failover and manual transfers were completed.

Despite these challenges, our failover safeguards ensured that most customers experienced minimal disruption and the incident was contained and addressed as quickly as possible.

Lessons Learned

  1. Enhanced Disaster Recovery Drills – We will conduct more frequent and comprehensive failover tests to further reduce potential downtime.

Preventative Measures

Short-Term Actions

  • Confirm the full operational status of all services immediately following the incident.
  • Heighten our monitoring protocols to proactively detect any new hardware anomalies.

Long-Term Actions

  • Schedule additional, more frequent disaster recovery exercises.
  • Expand our redundancy testing suite with new scenarios drawn from this incident to better anticipate future failures.
  • Add night shift NOC coverage to respond to incidents 27/7.
Posted Mar 09, 2025 - 21:58 GMT

Resolved

This incident has been resolved.
Posted Feb 27, 2025 - 10:37 GMT

Identified

The issue has been identified and a fix is being implemented.
Posted Feb 27, 2025 - 08:49 GMT

Investigating

We are investigating an incident affecting Cloud PBX. We will provide updates via email and Statuspage shortly.
Posted Feb 27, 2025 - 08:16 GMT