May 28, 2016

Two sites on the same database server - PDOException: SQLSTATE[HY000][1129] Host 'nnn.nnn.nnn.nnn' is blocked because of many connection errors

A web search shows a number of forums and postings about the error message in the title of this post, but our particular problem involved two sites on the same web server and the same database server. One of the sites appeared to repeatedly bring down the other site.

We had had a working live site implemented in Drupal and wanted to add a separate staging site in order to have a more transparent workflow that is less risky.

So for the staging site, a separate Drupal instance was installed and a separate Drupal database was created using the same respective servers as for the live site.

The sites use the CiviCRM module, which has its own civicrm.settings.php file for configuration.

  - - - - -

After staging was created, testing of the staging site was showing very odd errors in the form of inconsistent user profiles and such.

With help from the primary developers, I was able to track down that the civicrm.settings.php file was incorrectly specifying the live database for use by the staging site. Instead, it should have been specifying the staging database.

I corrected the configuration in civicrm.settings.php so that staging was accessing its own database via its own database user.

Here are the original constants defined in civicrm.settings.php

// These ip addresses are not the actual ones.
define( 'CIVICRM_UF_DSN', 'mysql://live_db_user:user_password@
12.210.138.193/live_db?new_link=true' );

define( 'CIVICRM_DSN', 'mysql://live_db_user:user_password@
12.210.138.193/live_db?new_link=true' );

And the revised constants in civicrm.settings.php

// These ip addresses are not the actual ones.
define( 'CIVICRM_UF_DSN', 'mysql://staging_db_user:user_password@12.210.138.193/staging_db?new_link=true' );

define( 'CIVICRM_DSN', 'mysql://staging_db_user:user_password@12.210.138.193/staging_db?new_link=true' );


I then emailed our tester about the changes late at night, without doing any of my own testing. Mea culpa.

That's when things got interesting.

  - - - - -

The next morning, both the staging and the live sites were broken with the same error even though I had not changed any code on live.

PDOException: SQLSTATE[HY000][1129] Host '12.210.138.193' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts' in lock_may_be_available() (line 167 of /var/www/oursite.org/httpdocs/includes/lock.inc). 

Since the error message stated "unblock with 'mysqladmin flush-hosts' ", I logged onto the database server and did flush-hosts on the live database in order to keep the live site running. Live came back up. And I sighed with relief thinking that was enough.

Then I browsed to the staging site again to troubleshoot it. Now it came up with a different error.

PDOException: SQLSTATE[28000] [1045] Access denied for user 'staging_db_user'@'12.210.138.193' (using password: YES) in lock_may_be_available() (line 167 of /var/www/staging.oursite.org/httpdocs/includes/lock.inc).

A bit later, I browsed to the live site again. And it was broken again.

The pattern was: (1) fix the live site by running the flush-hosts command, (2) browse to the staging site, (3) and the live site is broken again.

Since keeping the live site running was a priority, this was driving me crazy!

Those of you who are more familiar with MySQL than I was may already know what had happened. With hindsight, the issue is so obvious that it makes me reflect on my own thinking processes.

  - - - - -

It was my colleague Luis who cracked this. The following explanation is due to him.

The basic problem is that the database user staging_db_user did not have privileges to access the database via the 12.210.138.193 host address (although it did have access through a different ip). So whenever I browsed to the staging site, connection errors would occur.

The number of connection errors would quickly accumulate and exceed the allowed maximum. The relevant variable in MySQL is max_connect_errors, with a default value of something like 1000.

When the maximum number of errors was reached, the MySQL server would then block any further connection requests from that host ip, including for the live site!

This explains why browsing to the staging site would result in access errors for the live site as well. Because of the common ip, the two sites were inadvertently coupled and mutually dependent.

In the end, we decided to use a different host ip address for the staging site to access the database precisely to avoid this cross-site brittleness.

Source:

Troubleshooting Problems Connecting to MySQL
https://dev.mysql.com/doc/refman/5.7/en/problems-connecting.html