Parodius Da! 'Takosuke' image ©1992 Konami Co., Ltd.

No unsolicited advertisments, no banners, no spam; just like it was in 1991...

MySQL table corruption

During maintenance this morning, our MySQL server began acting oddly prior to maintenance starting. The MySQL server showed signs of NFS-related issues. Anyone familiar with UNIX knows how NFS timeouts can more or less indefinitely stall a userland program, and we found many of those. We've since found the root cause and fixed it, but by that time the damage had been done.

We had to reboot the MySQL server without cleanly shutting things down. Specifically, shutdown, reboot, etc. would all cause disk buffers to get flushed -- and that includes NFS -- so we had to tell the kernel to shut down without flushing any I/O buffers (e.g. any cached I/O transactions would be lost) using reboot -q -n. This is a big no-no in the BSD world, but the circumstances justified it.

Sadly, this had a major effect on MySQL. There were 8 or 9 tables which mysqlcheck reported as corrupted, and using the --repair flag fixed them, but some rows were lost. Thus, there could be some table integrity problems.

To date 3 users have reported problems with their sites: two reported missing forum posts on their phpBB-based forums, and one reported an entire site outage through WordPress.

This is the first time we've seen data loss of this severity. This issue was not caused by a hardware malfunction -- the MySQL table corruption was caused by the above reboot command being executed, required as a result of NFS problems on the server.

Steps are being taken to ensure this situation does not recur in the future.

Co-location provider site-wide outage (1 hour)

Between 12:46 and 13:47 PDT (UTC-0700), our co-location provider appeared to experience a massive full-site outage. The provider's telephone support was also knocked offline, as well as Email. We therefore could not escalate the issue, nor contact any support or management staff regarding the outage.

During this outage, Parodius users and visitors would have witnessed timeouts when attempting to access hosted sites or fetch Email. Any Email sent to your Parodius account or domain name hosted by us would have been delayed by approximately 75-120 minutes.

All Parodius servers and services remained functional during the outage. The hour-long incident was with our co-location provider.

We have escalated the severity of this situation to multiple senior management individuals, in attempt to ensure it does not recur in the future. Additionally, per our SLA agreement with our provider, we have requested a service credit.

Maintenance postponed, MySQL server upgraded

The previously-mentioned maintenance has been postponed until a later date. Scheduling issues were the cause of the delays.

Regarding MySQL services: the MySQL server is once again up and functional, and has been upgraded to FreeBSD 7.2-PRERELEASE amd64. Previously, it was an older OS and i386.

We apologise for the downtime.

Datacenter maintenance

We are currently in the process of performing maintenance on nearly all of our servers, which includes hardware upgrades and further addition of remote management capabilities, as well as some operating system upgrades.

At this time, standard HTTP/Web services are functional, but anything that relies on MySQL will be timing out or otherwise result in errors.

In a short while, HTTP/Web services will be unavailable as we perform said hardware upgrades.

POP3/IMAP service interruption

From approximately 05:25 to 13:25 PDT, the POP3/IMAP service used for obtaining mail was intermittently unavailable. Your mail client may have returned authentication failures or other error messages during this time.

The SMTP service (mail from the Internet sent to your account) was not impacted -- only the service used for retrieving mail from your account via POP3/IMAP.

The root cause appears to be some sort of bug in FreeBSD's OpenPAM framework, but we are still in the process of figuring out what ultimately happened and why.

We have made changes to our POP3/IMAP service configuration, removing use of OpenPAM entirely, so this situation should not recur in the future.

Primary web/shell server failure — bad RAM

Approximately a few minutes after midnight, our primary web/shell server began behaving erratically -- random daemons were segfaulting, and periodic system scripts were erroring in bizarre ways (individual bytes in system reports being corrupted). The cause of the problem was apparent: one of the RAM modules in the system had gone bad.

The problem went from minor to severe at approximately 04:00 PST. Web content was affected, during which time visitors may have witnessed odd behaviour with all sites.

I caught the problem shortly after waking up at around 05:00, and began working to mitigate impact. None of my mitigation ideas worked, so I was forced to migrate all accounts to a new box. The new server runs FreeBSD 7.1, has upgraded hardware, faster disks, uses ZFS to detect filesystem corruption, and is 64-bit.

Note that the migration from a 32-bit to a 64-bit system may require some users to recompile programs/software they have developed. Old binaries will not work. Some web boards, such as Matt's WWWBoard, often rely on C programs to "colourise" posts; these will need to be rebuilt.

Additionally, the new server uses a completely Apache MPM for content serving: suPHP and cgiwrap are no longer needed to ensure PHP and CGI security. This should allow users to run CGI binaries from wherever they wish, and are no longer limited to their /cgi-bin/ directory (although that directory should still function as before).

Users are urged to thoroughly test the new system, especially with regards to PHP and CGI scripts, to ensure things are working properly. If you find anything broken, please contact me immediately and I will do my best to fix the issue.

Greylisting feature removed

As a result of numerous user complaints and concerns over mail being delayed for long durations, or in some cases, mail never arriving (which we believe is the fault of other provider's SMTP servers not respecting the temporary failure codes that greylisting induces), we have completely removed our greylisting service on all mail.

The trade-off is that the amount of spam you receive will very likely increase. We're continuing to tune our spam detection software as a result of the above change.

However, incoming mail should no longer be delayed.

Migration to OpenBSD spamd

For quite some time we've been using a form of greylisting on our public mail server known as postgrey. It's been fairly reliable, but spammers have adapted to it quite a bit over the past few years.

Today, we migrated to OpenBSD spamd, which works in an entirely different manner. One drawback to using OpenBSD spamd is that there will be no more X-Greylist header added to mails (useful for determining how long a mail was delayed due to greylisting or other SMTP-related problems).

Another drawback is that users will not be able to use our mail server as an SMTP server, since OpenBSD spamd is what will be answering to connections on port 25. You should ideally be using your ISP's mail server for mail delivery. If this is a problem for you, and you really must use our mail server for outbound mail, let us know — we can work around this problem. :-)

If you encounter any substantial delays when receiving mail over the next few days, please let myself or the Parodius Staff know. We may have to add some specific SMTP servers to our whitelist configuration, but otherwise things should work smoothly.

Production server kernel panic

Our primary production server (that is to say, the web and mail server) experienced a kernel panic this morning at approximately 10:49 PST. No data was lost during the crash (except for a very long Email I was in the process of writing...). The server remained up for over 133 days.

Sadly the kernel panic did not generate a vmcore image, so we're not able to diagnose post-mortem what exactly caused the crash. Our best guess is that there was some form of inode or softupdate corruption occuring during a disk I/O write, but this could be a completely incorrect diagnosis. We are certain the issue was not caused by any form of hardware failure.

Specific details of the crash are publicly available.

We are currently in the process of rebuilding the operating system and related binaries, in hopes that within the past 133 days someone had intentionally or inadvertently fixed the issue we reported. There will be another brief outage due to this maintenance. We'll provide an update when we have completed the work.

UPDATE: We've finished the maintenance. It turns out there was indeed some form of soft update or inode corruption occuring, which has now hopefully been fixed. The results: 2 files were impacted (possibly corrupted), and 1 file was lost. All impact was to one specific users' data; no other accounts were impacted. Those files will be restored from backups later tonight, so ultimately no data was lost.

Migrated to a new registrar

A couple months ago, we migrated all the domains we own/manage over to a new registrar named eNom. So far they've been reliable, the control panel interface has been decent, and we haven't seen any sign of our records being sold to third-parties (such is the case with OpenSRS-based registrars, sadly).

Additionally, we added a couple nameservers to our list; big thanks to the folks over at XName for providing free slave zone services! (Yes, we dropped them a decently-sized donation. :) )

SPF records removed

In early May, we mentioned that we would be updating our DNS records to reflect support for SPF (Sender Policy Framework), in attempt to circumvent future spam, and also work together with other providers and users who rely on SPF.

However, our findings were somewhat inconclusive; a few different Parodius users informed us that Emails to themselves were on the verge of being marked as spam (by SpamAssassin). As it turned out, these mails were actually being given a very high score due to the SPF lookup being done by SA. For some reason, our SPF setup "wasn't working right"... except that the evidence being presented to us made no sense -- everything was, in fact, how it should be.

We took the time to ask some of the more clueful individuals on the spf-users mailing list, in hopes that someone there could inform us as to what the mistake was. For further details, see our thread.

The users were not very clueful at all, and there was a lot of speculation as to our OUTGOING mail being passed through SpamAssassin (which is in no way shape or form being done, nor is it even possible with our setup). Language barriers also became a major problem (which is odd, since all SPF documentation and details are in English). Finally, no one managed to shed any light as to what was really going on, despite all evidence presented.

Since we can't accept such flaws in technology, our SPF records have been removed from our DNS zones, and will not be put back until someone takes the time to explain exactly what's going on.

For now, it seems the SPF relies on some incredibly inane assumptions about server configuration -- from what we've seen, it's as if SPF expects you to have a machine physically named and dedicated to handling SMTP traffic. Systems using IP aliases seem to fall victim to strange assumptions being made by the SPF; something somewhere is making the assumption that the IP of whatever is handling the SMTP traffic should resolve to the same name as whatever gethostname(3) returns. If this is indeed done within SPF detection systems (or possibly related to sendmail; who knows!), this is a VERY bad assumption, and will eventually be noticed + discussed by other system administrators.

Beware of insecure OpenSRS-based registrars

Recently we at Parodius have become somewhat disappointed by Weblaunching, our present registrar, due to changes to their domain management system and strange integrations with other registrys such as Enom (we've been trying out their system as well; similar experiences). Due to this, we decided to look at other OpenSRS-sanctioned registrars to see who else was available... and we came upon SpyProductions.

While filing to transfer one of our domains (used solely as a web and hosting sandbox) to SpyProductions, we encountered quite a few "interesting" -- and downright insecure -- aspects of their transfer and billing processes:

  1. Login authentication is done using HTTP, not HTTP with SSL -- meaning, your login/password credentials are being sent over the Internet in plain-text.
  2. Domain transfers are done using HTTP, not HTTP with SSL -- meaning, all billing information is being sent over the Internet in plain-text. This includes your billing information, credit card number, and CVN.
  3. In addition, transfers use HTTP GET, where all contents of the form fields are placed into the URL for extraction. The side-effect of this is that your browser now has a page cached on your local hard disk which contains all of your billing information, including your CC and CVN. Using HTTP POST (with PHP sessions for the sensitive information) would be better.
  4. An SSL-based method of contact was found via their "make contact" link, under "Secure Contact Form". The certificate used hasn't been signed by a valid CA (choose View Certificate); instead, SpyProductions signed their own certificate, making it completely worthless as far as security goes. I guess they felt paying US$49 was unreasonable; I mean, who really needs a legitimate CA to sign their SSL cert? ;-)

Using Google, it was interesting to note that this registrar has had a history of being involved in legal battles where customers of theirs have induced legal situations by attempting to perform shady activities, such as registering domains like cocacola.info and other nonsense. Admittedly, this isn't up to the registrar to handle, but SpyProductions looks to be a one-man operation (you can find the owners' blog online).

He seems like a decent enough fellow, but regardless of that fact, I wouldn't bother registering a domain with them -- or if you already have, consider closing your account with them and getting your CC number changed. All of the above is an accident waiting to happen...

Support for Sender ID SPF records

Parodius is now publishing Sender ID SPF records. Our SPF records are presently using SOFTFAIL (~all); this means that mail which does not pass SPF tests will be marked as "potentially" being sent from an invalid sender, but does not induce a 100% failure. We are using SOFTFAIL "just in case" things don't work correctly.

Our SPF records presently do not apply to "subdomains" (i.e. foobar.parodius.com). In addition, our SMTP servers are now configured to do SPF lookups as well.

Important notes for Parodius users:

  1. Individuals sending mail through their ISP's mail servers with a user@parodius.com address may find that some mail may get rejected by Internet mail servers using SPF. These individuals should contact us to configure sending mail for user@parodius.com through our mail servers instead.
  2. Individuals bouncing (forwarding without changing headers) mail without changing the From: header line to match their own address may find that such mail may get rejected by Internet mail servers using SPF. This is a known limitation of SPF (the link content refers to bouncing as "forwarding", and forwarding as "remailing"). Users should configure their mail client to change the From: line, or do it manually, before bouncing mail. In the future, we will likely adopt the SRS model, which should address this issue.