Now Reading

Service outage 11/10 2:30pm CST

Around 2:30 CST today a Firehost engineer was performing routine maintenance on one of our new management nodes. The node failed, and was unable to restart or recover.

As of 3:00pm CST they are still working with VMWare (the software vendor) to sort out this issue.

All our nodes are HA, or high availability which should allow them to failover gracefully and come back up on another machine without issue. They are also backed up every evening. Redundancy is built into the system but this edge case it is obviously not working. We are weighing the pro/cons of restoring from a backup vs continuing to attempt repair of this node.

Unfortunately the cause of this failure, and the remedy are beyond our personal control. Thank you for your patience.

Update 4:00pm CST

We brought a new node online from a backup made earlier this morning. Full service has been restored. The entire VM is still locked and Firehost is still investigating that cause/remedy. The thing is wedged and prevented any of our automated fail-overs from working.

Update: Post Mortem (what happened)

Essentially a mount operation failed, and wedged the entire Virtual Machine, To make sure other VPS on that VM were protected from problems, we had to take it slow to repair the issue. Again, very sorry for the issue.

From Firehost:

…the VM was flagged for reconfiguring. The host and VM got into a bad state as a result of the missing volume containing the restore disk for ___. We were unable to recover control of the VM itself and were forced to contact VMWare for support. They assisted us in stabilizing the host and producing a workable clone of your system so that we could get your services back online. We realize that the clone operation took longer than desired. But given the potential issues surrounding bad state on hosts, we wanted to be certain we weren’t going to cause cascading failures.

Going forward, we are planning for all file-based restores to occur on our own internal VMs. We are also instituting stricter procedures and policy for those performing restores.

We sincerely apologize for this outage.

Comments
  1. This is the second time in at least two weeks something like this has happened. I’m concerned because I have recommended your service to a lot of clients and many use it. I’m sure it is frustrating to you as well. I know wpengine.com also uses firehost and it seems that they have a lot of problems because of that too. Is there some other upstream provider who is more reliable? I am pretty insistent with clients that they use WordPress-specific hosting, but firehost makes me look bad! Thanks…