Parodius Da! 'Takosuke' image ©1992 Konami Co., Ltd.

No unsolicited advertisments, no banners, no spam; just like it was in 1991...

February's downtime post-mortem

First and foremost, sorry for the delayed post. I know some of you have been wanting to know what happened back in February; I haven't had the time to finish the write-up until now. My apologies.

During the evening (and following morning) of February 6th, we relocated all of our servers to a brand new datacenter which our co-location provider spent the past few years completing.

The improvements in the new datacenter are incredible; you'd have to visit it to truly understand. One data point that's worth noting is that our ambient server temperature dropped almost 15C as a result of the move! The datacenter's cold aisle containment implementation really does work. There's so many improvements with the new datacenter that it'd be excessive to list them all here.

The entire relocation move took approximately 7 hours, with an additional 5-6 hours spent doing reconfiguration of all our networking equipment, individual servers, and DNS. There were only two of us doing the work.

The trickiest part was dealing with the fact that we'd been given a new network IP block. Our previous block was larger than what we originally anticipated needing, and due to IPv4 shortages, we returned the block and got something smaller.

Obviously this affected DNS. Those of you who have domains hosted here through a registrar of your own choice probably received an Email from me, requesting you update your nameserver records to reflect the change.

Other changes which we took the opportunity to implement or change are:

  • Migration of tertiary DNS away from XName over to EveryDNS. Multiple XName DNS servers were unavailable for numerous days, supposedly due to DDoS attacks. EveryDNS offers more geographic redundancy too
  • Upgraded our network switch from 100mbit to gigabit
  • Upgraded most of our servers from 2-4GB of RAM to 8GB ECC RAM
  • Upgraded two of our servers to 64-bit FreeBSD 8.0-STABLE
  • Added a new 4-disk server which provides up to 30 days worth of data retention (backups) for all our systems

I hope this provides a thorough run-down of all the changes we made during the migratory period.

MySQL table corruption

During maintenance this morning, our MySQL server began acting oddly prior to maintenance starting. The MySQL server showed signs of NFS-related issues. Anyone familiar with UNIX knows how NFS timeouts can more or less indefinitely stall a userland program, and we found many of those. We've since found the root cause and fixed it, but by that time the damage had been done.

We had to reboot the MySQL server without cleanly shutting things down. Specifically, shutdown, reboot, etc. would all cause disk buffers to get flushed -- and that includes NFS -- so we had to tell the kernel to shut down without flushing any I/O buffers (e.g. any cached I/O transactions would be lost) using reboot -q -n. This is a big no-no in the BSD world, but the circumstances justified it.

Sadly, this had a major effect on MySQL. There were 8 or 9 tables which mysqlcheck reported as corrupted, and using the --repair flag fixed them, but some rows were lost. Thus, there could be some table integrity problems.

To date 3 users have reported problems with their sites: two reported missing forum posts on their phpBB-based forums, and one reported an entire site outage through WordPress.

This is the first time we've seen data loss of this severity. This issue was not caused by a hardware malfunction -- the MySQL table corruption was caused by the above reboot command being executed, required as a result of NFS problems on the server.

Steps are being taken to ensure this situation does not recur in the future.

Co-location provider site-wide outage (1 hour)

Between 12:46 and 13:47 PDT (UTC-0700), our co-location provider appeared to experience a massive full-site outage. The provider's telephone support was also knocked offline, as well as Email. We therefore could not escalate the issue, nor contact any support or management staff regarding the outage.

During this outage, Parodius users and visitors would have witnessed timeouts when attempting to access hosted sites or fetch Email. Any Email sent to your Parodius account or domain name hosted by us would have been delayed by approximately 75-120 minutes.

All Parodius servers and services remained functional during the outage. The hour-long incident was with our co-location provider.

We have escalated the severity of this situation to multiple senior management individuals, in attempt to ensure it does not recur in the future. Additionally, per our SLA agreement with our provider, we have requested a service credit.

Maintenance postponed, MySQL server upgraded

The previously-mentioned maintenance has been postponed until a later date. Scheduling issues were the cause of the delays.

Regarding MySQL services: the MySQL server is once again up and functional, and has been upgraded to FreeBSD 7.2-PRERELEASE amd64. Previously, it was an older OS and i386.

We apologise for the downtime.

Datacenter maintenance

We are currently in the process of performing maintenance on nearly all of our servers, which includes hardware upgrades and further addition of remote management capabilities, as well as some operating system upgrades.

At this time, standard HTTP/Web services are functional, but anything that relies on MySQL will be timing out or otherwise result in errors.

In a short while, HTTP/Web services will be unavailable as we perform said hardware upgrades.

POP3/IMAP service interruption

From approximately 05:25 to 13:25 PDT, the POP3/IMAP service used for obtaining mail was intermittently unavailable. Your mail client may have returned authentication failures or other error messages during this time.

The SMTP service (mail from the Internet sent to your account) was not impacted -- only the service used for retrieving mail from your account via POP3/IMAP.

The root cause appears to be some sort of bug in FreeBSD's OpenPAM framework, but we are still in the process of figuring out what ultimately happened and why.

We have made changes to our POP3/IMAP service configuration, removing use of OpenPAM entirely, so this situation should not recur in the future.

Primary web/shell server failure — bad RAM

Approximately a few minutes after midnight, our primary web/shell server began behaving erratically -- random daemons were segfaulting, and periodic system scripts were erroring in bizarre ways (individual bytes in system reports being corrupted). The cause of the problem was apparent: one of the RAM modules in the system had gone bad.

The problem went from minor to severe at approximately 04:00 PST. Web content was affected, during which time visitors may have witnessed odd behaviour with all sites.

I caught the problem shortly after waking up at around 05:00, and began working to mitigate impact. None of my mitigation ideas worked, so I was forced to migrate all accounts to a new box. The new server runs FreeBSD 7.1, has upgraded hardware, faster disks, uses ZFS to detect filesystem corruption, and is 64-bit.

Note that the migration from a 32-bit to a 64-bit system may require some users to recompile programs/software they have developed. Old binaries will not work. Some web boards, such as Matt's WWWBoard, often rely on C programs to "colourise" posts; these will need to be rebuilt.

Additionally, the new server uses a completely Apache MPM for content serving: suPHP and cgiwrap are no longer needed to ensure PHP and CGI security. This should allow users to run CGI binaries from wherever they wish, and are no longer limited to their /cgi-bin/ directory (although that directory should still function as before).

Users are urged to thoroughly test the new system, especially with regards to PHP and CGI scripts, to ensure things are working properly. If you find anything broken, please contact me immediately and I will do my best to fix the issue.