I hear this all the time. Most people move out of a datacenter because something bad happened, and its usually a major power failure that causes the most trouble. In this article, I am going to outline and analyze a power failure event that occurred at an unnamed facility. This is a true story.
About 2 years ago I fielded a call from someone who lost power at their current data center provider. In addition to being down, they also had some equipment failures (power supplies and some RAM went bad in a few systems). Their provider told them that nothing was wrong with the UPS, rather, it was an issue with the utility caused by a brown out. As soon as I heard this, I told the person that this explanation was completely bogus.
Lets recap the cardinal rules of a good UPS:
1. An online UPS setups should always provide clean line power regardless of supply.
2. If an online UPS fails, an auto-sync transformer bridges line power and utility within 1 Hz and no power is lost, only backup capability is lost.
And lets recap what you need to do in order to make sure the above rules always apply:
1. Check your batteries every 3 months.
2. Replace a battery as soon as its internal resistance rises by 10%
3. Replace a battery as soon as its 4 years old, even if its internal resistance is still within spec.
4. Provide suitable cooling to the UPS.
5. CHECK THE BATTERIES.
I cant stress enough how important batteries are. The entire UPS is built around the concept of having working batteries. Almost every line-effecting outage of a UPS is due to a battery problem. At Quonix, we use Liebert Series 300 UPS systems that have had inverter boards fail, induction coils burn out, and input filter short out, and we NEVER lost output line power. That’s why the Liebert’s cost so much, they are designed to handle failures, but it requires good batteries.
Getting back to the story about the brown out. Any UPS that experiences a brown out or any kind of dirty power, would immediately engage batteries in order to provide clean power while it activates the GENSET cut-over. This requires the UPS to run on batteries for 5-7 seconds. If the batteries cant hold, the UPS will drop offline into bypass mode and auto-sync to utility line power. Once a UPS goes into bypass and syncs to utility power it no longer provides power protection or line conditioning. So all the dirty power goes straight through. If power was lost, GENSET power now comes straight through. And when utility power returns, the GENSET cuts out causing another small blip. This is why the server power supplies and RAM went bad. The dirty, and possibly surging power came right through the UPS into the rack cabinet.
Many providers dont properly maintain their batteries. They just assume the batteries will last 4-5 years. Not the case. I’ve seen brand new battery cabinets have 1 battery go bad after as little as 1 year. Sometimes its just a random manufacturer defect. And in many cases, all it takes is 1 bad battery to foul the entire array.
Want to be sure if your provider is on top of things? Easy, just ask for a copy of their UPS and battery preventative maintenance contract. If they have one, and they should, it should be easy to fax or email you a copy. You can even request a battery report. At Quonix, the vendor we use for our battery maintenance sends us a detailed graphical report with the health of each battery – voltage, impedance, internal resistance, temperature, and age.