Month: April 2024

Outages and what not…

We had an extended outage today that started around 0400EST and lasted
until around 1500EST due to either a kernel bug or hardware fault that
resulted in the primary physical host rebooting without warning. When
these reboots happen, it requires that one of our volunteer staff logs
into the management console and restart some services on our routing
gear due to a bug in those VMs; meaning routing is lost until someone
wakes up to deal with it. Not good for anyone relying on our network
for transit or hosting.

We reached out to the data center and they tested both the memory and
the CPU in our “Adrian” host to see if either had a fault in them and
both came back good. So on a hunch, we updated the board firmware to
a newer release and will be monitoring Adrian very closely for the next
few days to see if this fixes the issue.

Operational Again

Our team has gotten the network operational once again
and will continue to keep an eye on things for the next
few hours. In the event that we experience down time
again, project members are encouraged to keep up with
what our team is doing over on our status page.

Unexpected Reboot

Combing through the physical host logs and reading things line by
line turns out a set of errors that might be concerning to our
operations, but we aren’t sure just yet.

Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: Machine check events logged
Apr 25 12:04:34 adrian kernel: microcode: CPU12: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU13: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU14: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: microcode: CPU15: patch_level=0x08701021
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: CPU 12: Machine Check: 0 Bank 5: bea0000000000108
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: TSC 0 ADDR 1ffffa5e675c2 MISC d012000100000000 SYND 4d000000 IPID 500b000000000 
Apr 25 12:04:34 adrian kernel: mce: [Hardware Error]: PROCESSOR 2:870f10 TIME 1714061070 SOCKET 0 APIC 9 microcode 8701021
Apr 25 12:04:34 adrian kernel: microcode: CPU0: patch_level=0x08701021

We are continuing to check on the machine and watching for problems
that may be affecting us.

We got in touch with the data center and they are saying
“As these are unmanaged servers, we do not actively monitor
customer services, so we do not know when or why a services
goes offline.” Which we understand, the operation of the
hardware is on us, but this is something outside of our
project from the looks of it.

Our team will continue to work on restoring network services
and will monitor the system for the next few hours to see
if any issues arise.

We got some alarms sent to our team around 1214EST this
afternoon that large portions of our network went offline
without warning. Upon logging into our management consoles
and looking things over, it seems that our physical host
“Adrian” rebooted.

We have reached out to the data center to see what may have
happened and our team is in the process of restoring our
network functionality.