During maintenance this morning, our MySQL server began acting oddly
prior to maintenance starting. The MySQL server showed signs of
NFS-related issues. Anyone familiar with UNIX knows how NFS timeouts
can more or less indefinitely stall a userland program, and we found
many of those. We've since found the root cause and fixed it, but by
that time the damage had been done.
We had to reboot the MySQL server without cleanly shutting things down.
Specifically, shutdown, reboot, etc. would all
cause disk buffers to get flushed -- and that includes NFS -- so we had
to tell the kernel to shut down without flushing any I/O buffers (e.g.
any cached I/O transactions would be lost) using
reboot -q -n. This is a big no-no in the BSD world, but
the circumstances justified it.
Sadly, this had a major effect on MySQL. There were 8 or 9 tables which
mysqlcheck reported as corrupted, and using the
--repair flag fixed them, but some rows were lost. Thus,
there could be some table integrity problems.
To date 3 users have reported problems with their sites: two reported
missing forum posts on their phpBB-based forums, and one reported an
entire site outage through WordPress.
This is the first time we've seen data loss of this severity. This
issue was not caused by a hardware malfunction -- the MySQL table
corruption was caused by the above reboot command being
executed, required as a result of NFS problems on the server.
Steps are being taken to ensure this situation does not recur in the
future.
Between 12:46 and 13:47 PDT (UTC-0700), our co-location provider appeared to
experience a massive full-site outage. The provider's telephone support was
also knocked offline, as well as Email. We therefore could not escalate the
issue, nor contact any support or management staff regarding the outage.
During this outage, Parodius users and visitors would have witnessed timeouts
when attempting to access hosted sites or fetch Email. Any Email sent to your
Parodius account or domain name hosted by us would have been delayed by
approximately 75-120 minutes.
All Parodius servers and services remained functional during the outage. The
hour-long incident was with our co-location provider.
We have escalated the severity of this situation to multiple senior management
individuals, in attempt to ensure it does not recur in the future. Additionally,
per our SLA agreement with our provider, we have requested a service credit.
The previously-mentioned maintenance has been postponed until
a later date. Scheduling issues were the cause of the delays.
Regarding MySQL services: the MySQL server is once again up
and functional, and has been upgraded to FreeBSD 7.2-PRERELEASE
amd64. Previously, it was an older OS and i386.
We apologise for the downtime.
We are currently in the process of performing maintenance on
nearly all of our servers, which includes hardware upgrades
and further addition of remote management capabilities, as
well as some operating system upgrades.
At this time, standard HTTP/Web services are functional, but
anything that relies on MySQL will be timing out or otherwise
result in errors.
In a short while, HTTP/Web services will be unavailable as we
perform said hardware upgrades.
From approximately 05:25 to 13:25 PDT, the POP3/IMAP service
used for obtaining mail was intermittently unavailable. Your mail
client may have returned authentication failures or other error
messages during this time.
The SMTP service (mail from the Internet sent to your account)
was not impacted -- only the service used for retrieving mail from
your account via POP3/IMAP.
The root cause appears to be some sort of bug in FreeBSD's OpenPAM
framework, but we are still in the process of figuring out what
ultimately happened and why.
We have made changes to our POP3/IMAP service configuration,
removing use of OpenPAM entirely, so this situation should not recur
in the future.
Approximately a few minutes after midnight, our primary web/shell server
began behaving erratically -- random daemons were segfaulting, and periodic
system scripts were erroring in bizarre ways (individual bytes in system
reports being corrupted). The cause of the problem was apparent: one of
the RAM modules in the system had gone bad.
The problem went from minor to severe at approximately 04:00 PST. Web
content was affected, during which time visitors may have witnessed odd
behaviour with all sites.
I caught the problem shortly after waking up at around 05:00, and began
working to mitigate impact. None of my mitigation ideas worked, so I was
forced to migrate all accounts to a new box. The new server runs FreeBSD
7.1, has upgraded hardware, faster disks, uses ZFS to detect filesystem
corruption, and is 64-bit.
Note that the migration from a 32-bit to a 64-bit system may require some
users to recompile programs/software they have developed. Old binaries
will not work. Some web boards, such as Matt's WWWBoard, often rely on
C programs to "colourise" posts; these will need to be rebuilt.
Additionally, the new server uses a completely Apache MPM for content
serving: suPHP and cgiwrap are no longer needed to ensure PHP and CGI
security. This should allow users to run CGI binaries from wherever
they wish, and are no longer limited to their /cgi-bin/
directory (although that directory should still function as before).
Users are urged to thoroughly test the new system, especially with
regards to PHP and CGI scripts, to ensure things are working properly.
If you find anything broken, please contact me immediately and I will
do my best to fix the issue.
As a result of numerous user complaints and concerns over mail being delayed
for long durations, or in some cases, mail never arriving (which we believe
is the fault of other provider's SMTP servers not respecting the temporary
failure codes that greylisting induces), we have completely removed our
greylisting service on all mail.
The trade-off is that the amount of spam you receive will very likely
increase. We're continuing to tune our spam detection software as a result
of the above change.
However, incoming mail should no longer be delayed.
For quite some time we've been using a form of greylisting on our public
mail server known as postgrey.
It's been fairly reliable, but spammers have adapted to it quite a bit over
the past few years.
Today, we migrated to OpenBSD spamd,
which works in an entirely different manner. One drawback to using OpenBSD spamd
is that there will be no more X-Greylist header added to mails (useful
for determining how long a mail was delayed due to greylisting or other SMTP-related
problems).
Another drawback is that users will not be able to use our mail server as an SMTP
server, since OpenBSD spamd is what will be answering to connections on port 25.
You should ideally be using your ISP's mail server for mail delivery. If this is
a problem for you, and you really must use our mail server for outbound mail,
let us know — we can work around this problem. :-)
If you encounter any substantial delays when receiving mail over the next few
days, please let myself or the Parodius Staff know. We may have to add some
specific SMTP servers to our whitelist configuration, but otherwise things
should work smoothly.
Our primary production server (that is to say, the web and mail server)
experienced a kernel panic this morning at approximately 10:49 PST.
No data was lost during the crash (except for a very long Email I was in
the process of writing...). The server remained up for over 133 days.
Sadly the kernel panic did not generate a vmcore image, so we're not
able to diagnose post-mortem what exactly caused the crash. Our best
guess is that there was some form of inode or softupdate corruption
occuring during a disk I/O write, but this could be a completely
incorrect diagnosis. We are certain the issue was not caused by
any form of hardware failure.
Specific details of the crash
are publicly available.
We are currently in the process of rebuilding the operating system
and related binaries, in hopes that within the past 133 days someone
had intentionally or inadvertently fixed the issue we reported. There
will be another brief outage due to this maintenance. We'll provide an
update when we have completed the work.
UPDATE: We've finished the maintenance. It turns out there was indeed
some form of soft update or inode corruption occuring, which has now
hopefully been fixed. The results: 2 files were impacted (possibly
corrupted), and 1 file was lost. All impact was to one specific users'
data; no other accounts were impacted. Those files will be restored from
backups later tonight, so ultimately no data was lost.
A couple months ago, we migrated all the domains we own/manage over to a new
registrar named eNom. So far they've been
reliable, the control panel interface has been decent, and we haven't seen any
sign of our records being sold to third-parties (such is the case with
OpenSRS-based registrars, sadly).
Additionally, we added a couple nameservers to our list; big thanks to
the folks over at XName for providing
free slave zone services! (Yes, we dropped them a decently-sized donation. :) )
In early May, we mentioned that we would be updating our DNS records to reflect support for
SPF (Sender Policy Framework), in attempt to circumvent future spam, and also work together
with other providers and users who rely on SPF.
However, our findings were somewhat inconclusive; a few different Parodius users informed us
that Emails to themselves were on the verge of being marked as spam (by SpamAssassin). As it
turned out, these mails were actually being given a very high score due to the SPF lookup being
done by SA. For some reason, our SPF setup "wasn't working right"... except that the evidence
being presented to us made no sense -- everything was, in fact, how it should be.
We took the time to ask some of the more clueful individuals on the spf-users mailing list, in
hopes that someone there could inform us as to what the mistake was. For further details,
see our thread.
The users were not very clueful at all, and there was a lot of speculation as to our OUTGOING
mail being passed through SpamAssassin (which is in no way shape or form being done, nor is it
even possible with our setup). Language barriers also became a major problem (which is odd,
since all SPF documentation and details are in English). Finally, no one managed to shed
any light as to what was really going on, despite all evidence presented.
Since we can't accept such flaws in technology, our SPF records have been removed from our
DNS zones, and will not be put back until someone takes the time to explain exactly what's
going on.
For now, it seems the SPF relies on some incredibly inane assumptions about server
configuration -- from what we've seen, it's as if SPF expects you to have a machine
physically named and dedicated to handling SMTP traffic. Systems using IP aliases seem to
fall victim to strange assumptions being made by the SPF; something somewhere is
making the assumption that the IP of whatever is handling the SMTP traffic should resolve
to the same name as whatever gethostname(3) returns. If this is indeed done within SPF
detection systems (or possibly related to sendmail; who knows!), this is a VERY bad
assumption, and will eventually be noticed + discussed by other system administrators.
Recently we at Parodius have become somewhat disappointed by
Weblaunching,
our present registrar, due to changes to their domain management system and strange integrations with other
registrys such as Enom (we've been trying out their system as well; similar experiences). Due to this, we
decided to look at other
OpenSRS-sanctioned
registrars to see who else was available... and we came upon
SpyProductions.
While filing to transfer one of our domains (used solely as a web and hosting sandbox) to SpyProductions, we
encountered quite a few "interesting" -- and downright insecure -- aspects of their transfer and billing processes:
- Login authentication is done using HTTP, not HTTP with SSL -- meaning, your login/password credentials are
being sent over the Internet in plain-text.
- Domain transfers
are done using HTTP, not HTTP with SSL -- meaning, all billing information is being sent over the Internet
in plain-text. This includes your billing information, credit card number, and CVN.
- In addition, transfers use HTTP GET, where all contents of the form fields are placed into the URL for
extraction. The side-effect of this is that your browser now has a page cached on your local hard disk which
contains all of your billing information, including your CC and CVN. Using HTTP POST (with PHP sessions for the
sensitive information) would be better.
- An SSL-based method of contact was found via their
"make contact"
link, under
"Secure Contact Form".
The certificate used hasn't been signed by a valid CA (choose View Certificate); instead, SpyProductions signed
their own certificate, making it completely worthless as far as security goes. I guess they felt
paying US$49
was unreasonable; I mean, who really needs a legitimate CA to sign their SSL cert? ;-)
Using Google, it was interesting to note that this registrar has had a history of being involved in legal
battles where customers of theirs have induced legal situations by attempting to perform shady activities,
such as registering domains like cocacola.info and other nonsense. Admittedly, this isn't
up to the registrar to handle, but SpyProductions looks to be a one-man operation (you can find the owners'
blog online).
He seems like a decent enough fellow, but regardless of that fact, I wouldn't bother registering a domain
with them -- or if you already have, consider closing your account with them and getting your CC number changed.
All of the above is an accident waiting to happen...
Parodius is now publishing
Sender ID SPF
records. Our SPF records are presently using SOFTFAIL (~all); this means that mail
which does not pass SPF tests will be marked as "potentially" being sent from an invalid sender, but does not
induce a 100% failure. We are using SOFTFAIL "just in case" things don't work correctly.
Our SPF records presently do not apply to "subdomains" (i.e. foobar.parodius.com). In addition, our SMTP
servers are now configured to do SPF lookups as well.
Important notes for Parodius users:
- Individuals sending mail through their ISP's mail servers with a
user@parodius.com address
may find that some mail may get rejected by Internet mail servers using SPF. These individuals
should contact us to configure sending mail for user@parodius.com through our mail servers
instead.
- Individuals bouncing (forwarding without changing headers) mail without changing the
From:
header line to match their own address may find that such mail may get rejected by Internet mail
servers using SPF. This is a
known limitation
of SPF (the link content refers to bouncing as "forwarding", and forwarding as "remailing"). Users should
configure their mail client to change the From: line, or do it manually, before bouncing mail.
In the future, we will likely adopt the
SRS
model, which should address this issue.
|