Saving Apache from unpredictable certain death
A few months ago we had a serious problem at the weekend: many of our production servers went down at 4am on Sunday.
We investigated, and found out that I'd made a Puppet configuration change that installed mod_wsgi on all of our servers that also had Munin installed.
Unfortunately there is a previously unknown bug in mod_wsgi, so if it's loaded into a running server with a reload
command, then the server starts crashing on every request. Apparently this is a common problem with Apache modules.
It didn't start crashing immediately for us, because the RPM package that we used to install mod_wsgi doesn't actually restart the server after doing so. I think that's another bug. In some ways it was lucky that it doesn't, because many of our servers would have gone down in the middle of a working day. But at 4am on Sunday morning, it took us a while to make the connection to a Puppet change that had happened during the past week.
When something like this happens, I like to make at least two changes to our systems that would prevent it from happening again, or catch it and notify us immediately if it did. In this case, I opened the bug report above. But it remains open, and so does our internal ticket, 4 months later. So I decided to take more immediate defensive action to protect us.
First I added the missing Puppet rules to restart Apache when mod_wsgi is installed for the first time:
class python26 {
case $operatingsystem {
CentOS: {
case $operatingsystemrelease {
/^6\./: {
package { python26_pkgs:
name => ['python', 'python-devel', 'python-tools', 'python-setuptools',
'mod_wsgi', 'python-virtualenv', 'python-pip',],
ensure => installed,
require => Class['aptivate-repo'],
# https://projects.aptivate.org/issues/3897
notify => Service['httpd'],
}
However this only protects us from one way that the bug can be triggered. If mod_wsgi is removed for any reason, or any other faulty module is added, then Apache will fail in the same way. So I wrote a Puppet recipe that checks whether the list of modules loaded in Apache has changed since the last Puppet run, and if so it restarts Apache gracefully:
class apache {
# Check the list of modules loaded in Apache, and if it has changed
# since the last run, then restart Apache.
exec { restart_apache_if_modules_change:
require => Service[$apache],
unless => '/usr/sbin/httpd -M 2> /var/run/httpd.modules.new && diff -u /var/run/httpd.modules.{prev,new}',
command => 'diff -u /var/run/httpd.modules.prev /var/run/httpd.modules.new; /etc/init.d/httpd graceful; cp /var/run/httpd.modules.{new,prev}',
logoutput => true,
}
}
Now I'm pretty confident that this bug won't bite us again on any servers that we control with Puppet.