For the past couple of days there have been partial outages occurring in the form of network endpoint failures and routing ceasing to function for five to ten minute intervals, along with high latency. Unfortunately, this had occurred during my travels for work the past few days and neither I nor the other tech that helps with the NOC had time to triage and figure out what had happened.
After finally getting a moment to take a look at the network, I found that three of our routers were maxing out their allotted CPU limits and were failing to route traffic. What I discovered during a closer look at the affected routers is that during the last firmware upgrade, our logging settings got reset to defaults which means that all routers on our network were starting to experience an issue with running out of disk space, causing them to malfunction in a way that would max out their other resources. This also causes the GUI to go offline, meaning nothing could be done to fix this until I was able to tunnel in and pop a shell.
The fix to this issue was to remote into to each router and delete the offending log data and then reboot the routers one by one. Once that was handled, log limits were reconfigured in the GUI and services were brought back online and tested to ensure that all parts of the network are functional again. Furthermore, alert rules we placed into the NMS to fire off both email (SMS) and Discord notifications should any router’s disk become over 65% used. This should allow us to catch any future issues like this before they affect the network’s usability.
