Articles

System Failure is Not an Option

Article Author
Stephanie Maddocks
Publish Date
July 31, 2009
Article Tools
View all articles in the CEM Archive
Author: 
Stephanie Maddocks

In the movie Apollo 13, when flight controller Gene Kranz commands that “Failure is not an option,” he was referring to the challenges of bringing a damaged Apollo spacecraft safely back to Earth. Lucky for most casino information technology departments, they don’t have to worry about a successful spacecraft splashdown, although the technology that they administer could be considered as complex and complicated as many of the NASA systems. And with anything that is complex and complicated, there are many opportunities for failure in the 24/7 operational environment of the casino world.

System failures are catastrophic for many reasons: the cost of equipment replacement, the cost of data recovery, the cost of lost business, the cost of upset customers, the cost of bad publicity—the list is endless. Unfortunately, systems are not immune to downtime, and it does seem to be true that downtime is more likely to happen on busy days, weekends and holidays. The only upside to a system failure is that it exposes your weaknesses—usually not in a very pretty fashion—and provides an opportunity to learn from past mistakes. The three dynamics that contribute to the vast majority of computer system failures can be categorized as environmental, hardware and the human factor.

While any IT system failure is surprising, those caused by environmental factors tend to be those that are the least foreseen. If the air conditioning goes out, if the fire suppressant systems discharge, or if the power shuts off, you’ve experienced an environmental failure. Unless your casino’s locale has predictable weather patterns, you probably expect your environmental systems, like air conditioning and power, to just work. While your guests may not like temperatures creeping higher, your computer systems like it even less. Computers and network equipment are designed to operate efficiently between 50°F and 82°F, although the preferred temperature is below 70°F. If the main air conditioning systems go out, the server rooms and data closets will heat up in minutes. It only takes 10 to 15 minutes for a server room to reach 100°F or higher.

Power failures add an additional dimension to environmental system failures. While server rooms should be protected by uninterruptible power source (UPS) devices to avoid power spikes and failures, these devices are only designed to last for a brief period of time—just long enough to allow IT personnel to safely shut down the systems rather than have them crash with each electrical spike or outage. This does not resolve the operational issues that casinos face when gaming devices and systems shut down.

The second dynamic that creates system failures is computer and network hardware crashes. Best case scenario: a network component fails, you “hot swap” its replacement, and almost no one realizes there was a glitch. Worst case: the malfunction of a server or network component stops systems for days until replacement parts can be installed and servers and databases rebuilt. Monitoring software for computers, servers and network components can help to mitigate the risk of hardware failures, provided that someone is actually monitoring them and responding when components are at risk. Ensuring that backups are performed consistently, that they are actually successful in backing up the data, and that the data can be restored are additional mitigation strategies.

The third dynamic, and the most uncontrollable, is the human factor. When 16th century poet Alexander Pope wrote that “A little learning is a dangerous thing,” he was describing what happens when just enough knowledge is gained, and retained, to make a person’s resulting actions potentially troublesome. This is aptly demonstrated when a team member returns from a computer system training session and decides he or she wants to try out new skills on the production servers. With some systems, all it takes is one complex query to bring the database to its knees. And the human factor is not just limited to internal casino employees. Many times, outside vendors trigger system failures because of poor planning, lack of knowledge or untested processes. Whenever possible, it is preferable for vendors to test their upgrades, fixes and patches on a test server setup that mirrors the casino’s production environment to accurately be able to predict the outcome of their changes on the system.

I was involved in the installation of a casino management system in the mid-’90s where we experienced all three primary system failure dynamics. First, the system servers were located in a non-ventilated room at the top of a riverboat—there were no windows, no airflow and no air conditioning, just a power strip to be plugged into. It was July in Louisiana, and hot and humid would be a large understatement to describe the atmosphere in the server room. When it was reported to management that the server room was hot, the first solution was to place a window air conditioner on a chair in the room, with the cold air blowing on the server. We all learned that while cold air came out the front, even hotter air came out the back, increasing the room’s temperature even more. Plan B was to cool the room by placing ice on the roof, so a bucket brigade began bringing plastic paint buckets filled with ice up three flights of stairs to spread on the roof. We soon learned that the room was not waterproof, and we watched as water leaked all over the computer equipment. Finally, after the third day, the server finally died a painful death. The environment laid a foundation for disaster, the human decisions were ineffective to resolve the situation, and because of these failures, the hardware was destined to crash.

When Benjamin Franklin said “An ounce of prevention is worth a pound of cure,” he accurately described the value of planning and preparation. Secondary air handling systems, uninterruptible power sources, routine maintenance of environmental systems, hardware redundancy, and training and more training are all safeguards designed to prevent system failures. Yet they continue to happen. Knowing that the inevitable is always destined to happen, the reaction of the business to the system failure determines its ultimate outcome. Designing a disaster recovery action plan to minimize the impact on the casino guests and departmental operations is paramount and should incorporate planning for failure, training for failure and practicing for failure.

Plan for Failure
A good disaster recovery plan includes action planning for each type of failure—environmental, hardware and human. Planning for failure involves identifying and understanding the root causes of the failure, knowing what is needed to resolve the issue, and expediting the implementation of the solution. With any plan, it’s all about communication between departments to ensure that when a crisis looms, the proper departments and management are notified and called to action to resolve the issue as quickly as possible. Contingency planning must also be considered to ensure that if the initial disaster recovery plan doesn’t work, there’s always a Plan B.

Train for Failure
Training not only each member of the IT team but also all departments on what to do in the event of a system failure will help ensure that the solution can be implemented in an organized and efficient manner, minimizing the impact on casino guests. It is critical to train team members on manual backup processes for each one of your system-dependent job skills. If the food and beverage point-of-sale system crashes, are there manual forms for order taking and guest payment? If the ticketing system fails, does the slot floor staff understand the hand pay process when games lock up? If there is an environmental failure, is the proper equipment available for air handling or power backups? Each department needs to be aware of its system fault points and have plans in place to compensate in the event of a failure.

Practice for Failure
Practicing for failure includes holding drills and periodically testing team members to guarantee that they know what to do if there is a failure. For the IT department, rapid response is critical to prevent additional damage to systems and ensure system up time. Cool air must be flowing and the power must be on, but how many IT departments actually go through environmental failure drills? Does the IT staff know where fans are located, or at a minimum, who to call to have some delivered to the server room? Do they know how to shut down all systems in the event of a power failure? For hardware failures, are there spare parts and team members with the skill sets to replace them? Are there good backups in place that can be restored in the event that a system succumbs to a human factor issue? One very meticulous director of cage operations had a “crash cart” ready just in case the ticketing system went out and was fully prepared to staff additional cashout locations throughout the casino.

What lessons can be learned from system failures? Planning, training and practicing response strategies for when the inevitable system failure occurs will prepare the entire casino operation to take proactive action. Quality guest service is paramount and is the primary goal of any risk mitigation strategy. If failure is not an option, then successful resolution of a system failure is the only option.

 

Stephanie Maddocks is President of Power Strategies, a Las Vegas-based technology consulting company that provides technology selection, planning and implementation, and business operations services. She can be reached at (702) 460-6600 or stephmaddocks[at]gmail.com.

Comments

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.