It was the summer of the system crash for Canada’s major financial institutions. Despite infrastructures backed by standby processing systems and run on fault-tolerant networks, parts of three of Canada’s largest chartered banks’ debt and credit processing systems still failed.
The Royal Bank
led off the summer of disasters when an operational error in program maintenance on May 31 during a routine code change stalled direct deposit payroll. The effects were felt across the country. The Province of New Brunswick said more than 10,000 government employees, including Premier Bernard Lord, weren’t paid on June 3. The glitch led the national news for a week. Royal said on June 2 that two days of transactions would have to be processed manually. As many as 25 to 29 per cent of the bank’s 10 million clients were affected by the flop, says David Moorcroft, senior vice-president in charge of corporate communications for RBC Financial Group. Other groups of non-RBC clients at different institutions dependent on RBC transactions were affected by the ripple effect.
There was more to come. On July 29, one day before the last business day of the month, CIBC admitted that a system error was double-charging 60,000 personal line-of-credit accounts. Then, on Aug. 24, the Toronto Dominion Bank said some customer transactions made face to face with tellers in 500 branches in Ontario and British Columbia and by telephone weren’t reflected in their balances.
A SINGLE KEYSTROKE
The processing crashes affected a huge number of people. Cheques bounced. ATM cards couldn’t access cash. Each problem had a unique origin, but all showed the vulnerability of processing systems to human-induced shocks.
Problems at the Royal Bank began with a program change. The error, says Moorcroft, was in a single keystroke. “”It was a small piece of code in a table in the main transaction base. The code was tested, but the compounding event was that it was not detected in the test environment into which it was put. The environment simulated the code, but the error was not picked up in the right number of environments,”” Moorcroft added.
The error was discovered at 2 a.m. on June 1, deep into the data run for overnight transfers. The recovery was delayed until the bank was sure that the error did not endanger other systems or cheque processing alliances the Royal shares with other banks. Then everything had to be backed out and reprocessed, he said.
“”Once the error had been found, since backup systems had also been coded in parallel with the main system, Royal found it had to back out of all transactions run the night of May 31 then rerun them with the corrected code before the next processing window opened the next morning,”” Moorcroft says. “”Once that new window opened, the automated balancing system could not be used without manual intervention to verify the sequencing.””
The rerun wasn’t finished before the next day’s window opened. The system had to sequence two days’ transactions at once. Automated debt and credit systems had to be manually overridden.
The bank played catch-up through the weekend of June 4 and 5, giving priority to payroll transactions. By June 6, the bank was back on its customary flow of transactions. Client balances and other institutions’ accounts were now correct, though the dating of the adjusted transactions was not necessarily right. The bank continued to make adjustments for weeks afterward.
The direct cost of the breakdown was $9 million, says Moorcroft. “”There is another cost, lost revenue, but that we cannot quantify. We had staff dealing with the problem and, occupied with that, they could not sell other products.”” He might have added that some customers may have been sufficiently irritated or inconvenienced that they took their business elsewhere.
Each bank resolved its technical difficulties within hours or, in the Royal’s case, a few days of the breakdowns. The PR damage, though, isn’t as quick a fix. Here’s what enterprises can learn from the Summer of the Glitch:
1. Timing is everything. Program-ming changes should be timed for periods of lowest impact and greatest recovery time, says Craig Ballance, president of Toronto-based consultancy e-Finity Group Inc. Saturday and Sunday nights are good times for changes. The end of the month is the worst of all possible times.
2. Keep users in the loop. IT does not serve itself. It serves other departments and, ultimately, the customer. If a department is going to be affected by a process change, let the managers in on the process and give them sign-off, Ballance says.
3. Document everything. Royal Bank IT managers were able to take correct steps because they knew how their problem arose. “”You can’t back out of a programming error unless you know how you got there,”” says Drew Parker, associate professor of information systems at Simon Fraser University’s Faculty of Business Administration.
4. No finger-pointing. Use a no-blame failure review process in which the goal is to improve things. “”This is a cultural activity,”” says Yogi Schulz, president of Calgary-based Corvelle Management Systems. When you play the blame game, “”things get worse, not better.””
5. Reduce complexity. “”In distributed systems, the number of layers and components rises, increasing complexity,”” Schulz says. “”If you can limit the number of layers architecturally, you can improve reliability.””
6. Manage to maximize system reliability. “”Marketing departments want to differentiate the products of one institution from another,”” Schulz noted. “”But if the variations impact on system robustness and availability, they may not be well-advised. Too often, IT people are beaten down when they bring this up.””
7. Take the time to test system changes adequately. There is no substitute for rigorous impact analysis. We’ll accept a few dropped calls on a cell phone. Airplanes are held to more rigorous standards. “”Use the appropriate standard of system availability,”” Schulz says.
8. When building fault-tolerant systems, understand the full costs of failure. Companies such as banks can set a price on failure, but they may not be able to price in the full costs of massive, highly publicized breakdowns. Customers expect fast, precise service. Fail to deliver and the consequences may take a very long time to forget.