Another massive IT problem in the UK

Capt'n Midnight · 27-05-2017 09:00PM

To save some of their bandwidth here is the message on their site.
the foreign language sites also have this message - in English :rolleyes:

http://www.britishairways.com/

Following the major IT system failure experienced earlier today, with regret we have had to cancel all flights leaving from Heathrow and Gatwick for the rest of today, Saturday, May 27.

We are working hard to get our customers who were due to fly today onto the next available flights over the course of the rest of the weekend. Those unable to fly will be offered a full refund.

The system outage has also affected our call centres and our website but we will update customers as soon as we are able to.

Most long-haul flights due to land in London tomorrow (Sunday, May 28) are expected to arrive as normal, and we are working to restore our services from tomorrow, although some delays and disruption may continue into Sunday.

Due to the lengths of the delays lots of people will be entitled to several hundred euro.

theteal · 27-05-2017 09:23PM

mikeybrennan wrote: »

I'm interested but confused about this outsourcing

Who's in charge of IT once the major outage hits? Do UK based contractors take over?

Surely they do?

And what is the problem with the outsourced IT? Did they actually create the issue or cant solve it quick enough?

I genuinely don't know who's in charge of such issues. I don't work for them. I would assume so but I'm unaware of their ability and skill level. Where there, once upon a time, would have been a team of IT staff of different skills and levels on hand at site on different shift patterns, now there might be one guy there on a Saturday, maybe another on call (not precise numbers, just a hopefully extreme example) - I hope they know what they're doing and work well under pressure!

Outsourced call centres for crap like account lock-outs (I was going to give a few examples but I'm struggling to think of anything else I'd trust them with tbh) is fine but beyond that is just frustrating for users and just adds obstacles to quick resolution of issues.

markpb · 27-05-2017 11:01PM

degsie wrote: »

Have they never heard of UPSs

I love these kind of glib responses on boards. Do you think it's likely that they don't have UPS on their servers? Or do you think it's more likely to be a much more nuanced and complicated problem. AWS suffered a massive outage years ago despite the fact that they had plenty of battery backup, instead of power circuit on the inside of the UPS failed.

Deleted User wrote: »

They obviously dont have a proper Disaster Recovery plan either. A complete power outage in a datacentre should mean switching Operations to the backup or DR site and being operational in a matter of hours.

DR tests are like restoring your backup tapes - it's good practice to do (like eating your greens) but it's rarely a real life solution to a problem. DR tests are normally very simple affairs like throwing the switch on the power supply or internet connection but real life problems don't happen like that. Internet connections flap or selectively lose traffic. Power supplies cut out but the fluctuating that happens at the same time fries the power circuits.

IT is complex, especially at the scale that BA operate. Glib statements from the masses ignore that complexity and ignore the good work done by the teams managing it. Pretending it's a simple mistake that the experts in boards.ie would never have made is childish.

degsie · 27-05-2017 11:52PM

markpb wrote: »

I love these kind of glib responses on boards. Do you think it's likely that they don't have UPS on their servers? Or do you think it's more likely to be a much more nuanced and complicated problem. AWS suffered a massive outage years ago despite the fact that they had plenty of battery backup, instead of power circuit on the inside of the UPS failed.

DR tests are like restoring your backup tapes - it's good practice to do (like eating your greens) but it's rarely a real life solution to a problem. DR tests are normally very simple affairs like throwing the switch on the power supply or internet connection but real life problems don't happen like that. Internet connections flap or selectively lose traffic. Power supplies cut out but the fluctuating that happens at the same time fries the power circuits.

IT is complex, especially at the scale that BA operate. Glib statements from the masses ignore that complexity and ignore the good work done by the teams managing it. Pretending it's a simple mistake that the experts in boards.ie would never have made is childish.

What would your proposed solution entail?

27-05-2017 11:56PM

degsie wrote: »

Have they never heard of UPSs

Someone probably decided that it was unnecessary as there hadn't been any power cuts recently!

I remember working at a site in London where the IT manager told a story that back in the 1990's there was a phase of "downsizing" everything and auditors were brought in to determine what was "not required" for the core business.

All was fine until they had a "JCB induced power interruption" the UPS was rapidly draining ant the decision was made to fire up the standby generators, they failed to start. When the IT manager went to investigate, he discovered an empty space where it should have been, as part of the audit, it was deemed unnecessary and was disposed of.

Plan "B" was simply to gracefully shut everything down until power was restored, business managers went ape!

28-05-2017 12:03AM

degsie wrote: »

What would your proposed solution entail?

These days, they really should look to clustering servers over the WAN in such a way that if an entire building was to go bang, the system continues to operate with minimal loss of performance. Data being constantly sync'd between the separate physical sites on a virtual RAID array.

cml387 · 28-05-2017 12:06AM

You misunderstand the purpose of a UPS. It's not intended to keep the system going, it's only to give time to allow the system to shut down gracefully.

degsie · 28-05-2017 12:09AM

Deleted User wrote: »

These days, they really should look to clustering servers over the WAN in such a way that if an entire building was to go bang, the system continues to operate with minimal loss of performance. Data being constantly sync'd between the separate physical sites on a virtual RAID array.

And what about the cost for such infrastructure? How would this impact the bottom line for shareholders?

28-05-2017 12:11AM

degsie wrote: »

And what about the cost for such infrastructure? How would this impact the bottom line for shareholders?

Do they gamble on the cheap solutions in current use not devaluing their shares, yes they do.

So I don't expect any decent technical solutions to get past the boardroom!

28-05-2017 12:14AM

cml387 wrote: »

You misunderstand the purpose of a UPS. It's not intended to keep the system going, it's only to give time to allow the system to shut down gracefully.

Or to get an alternative supply up and running, crank up the Lister!

listermint · 28-05-2017 12:15AM

[Deleted User] wrote: »

Or to get an alternative supply up and running, crank up the Lister!

You called

cml387 · 28-05-2017 12:16AM

Deleted User wrote: »

Or to get an alternative supply up and running, crank up the Lister!

An oldie from my time in DEC.

TallGlass · 28-05-2017 12:16AM

Raging_Ninja wrote: »

Ah, so it's probably exactly like the RBS incident - pay peanuts, get monkeys. The amount of horror stories of IT support outsourced to India is unreal, it's a wonder anyone still does it.

To be honest, the problem isn't the guys in India, I work with guys from India and they are some of the brightest workers I have come across.

The problem from speaking to one of them is there working laws, which are completely different, in fact worlds apart from ours and the UK laws.

As an example, a guy recently told me in India, he came into work thinking it was going to be a normal shift ended up working 3 days solid day and night, sleeping barely at his desk. If he tried to say no, that was him done and the company would replace him.

That is just asking for a disaster. Over worked staff make mistakes and cause massive implications for the system. Computers don't just decide to make changes, some has to tell it what to do. If that person is 3 days into a shift, something is going to go tits up, sooner rather than later.

IT doesn't change in Ireland, the UK or India, the computer follows the same protocols as it would in another part of the world. The problem is working conditions.

IT outsourcing works, it works well. Looking for the cheapest person to do it, with a cheap workforce and non existent working laws is not.

28-05-2017 12:16AM

VinLieger · 28-05-2017 12:23AM

degsie wrote: »

And what about the cost for such infrastructure? How would this impact the bottom line for shareholders?

Very likely less than reschesuling and paying for all the knock on costs of a single day of all flights being canceled.

However similar to with the NHS updates the costs are only looked on as being worthwhile after the fact when the resulting problems of not spending money as a precautipn cost more than preventing the problems

28-05-2017 12:37AM

TallGlass wrote: »

To be honest, the problem isn't the guys in India, I work with guys from India and they are some of the brightest workers I have come across.

The problem from speaking to one of them is there working laws, which are completely different, in fact worlds apart from ours and the UK laws.

As an example, a guy recently told me in India, he came into work thinking it was going to be a normal shift ended up working 3 days solid day and night, sleeping barely at his desk. If he tried to say no, that was him done and the company would replace him.

That is just asking for a disaster. Over worked staff make mistakes and cause massive implications for the system. Computers don't just decide to make changes, some has to tell it what to do. If that person is 3 days into a shift, something is going to go tits up, sooner rather than later.

IT doesn't change in Ireland, the UK or India, the computer follows the same protocols as it would in another part of the world. The problem is working conditions.

IT outsourcing works, it works well. Looking for the cheapest person to do it, with a cheap workforce and non existent working laws is not.

True, many of the Indian IT staff are the cream of the crop and in my experience they often end up as specialist in one area and know sweet FA about the other elements of the system. It's not simply a case of thinking that the other parts of the system as "black boxes" but not even knowing they exist, as in the data gets from the hard disk of one server to another "by magic", they can spend a lot if time looking at the database structure while completely ignoring the fact that source data is missing and not checking with someone else who deals with that server.

As for the silly hours, I have heard snoring on a conference call before....

Capt'n Midnight · 28-05-2017 01:55AM

Deleted User wrote: »

These days, they really should look to clustering servers over the WAN in such a way that if an entire building was to go bang, the system continues to operate with minimal loss of performance. Data being constantly sync'd between the separate physical sites on a virtual RAID array.

You can even do it a drive level. Fibre optics means mirrored drives can be on a different site in a different town.

cml387 wrote: »

You misunderstand the purpose of a UPS. It's not intended to keep the system going, it's only to give time to allow the system to shut down gracefully.

Only for organisations that don't need 24/7. For them UPS's only need to keep the servers up long enough for the backup generators to kick in.

markpb wrote: »

I love these kind of glib responses on boards. Do you think it's likely that they don't have UPS on their servers? Or do you think it's more likely to be a much more nuanced and complicated problem.

The simple answer is that they weren't properly prepared.
BA have a turnover of £11Bn a year, so aren't a fly by night operation.

They've had plenty of time to sort out legacy systems and there is no way in hell they should have outsourced mission critical stuff until they had resilient systems in place.

Capt'n Midnight · 28-05-2017 02:10AM

cml387 wrote: »

An oldie from my time in DEC.

LOL

different story, servers went down but they stayed up
http://i.imgur.com/Sb2M5Qo.jpg

Yourself isit · 28-05-2017 02:30AM

markpb wrote: »

I love these kind of glib responses on boards. Do you think it's likely that they don't have UPS on their servers? Or do you think it's more likely to be a much more nuanced and complicated problem. AWS suffered a massive outage years ago despite the fact that they had plenty of battery backup, instead of power circuit on the inside of the UPS failed.

DR tests are like restoring your backup tapes - it's good practice to do (like eating your greens) but it's rarely a real life solution to a problem. DR tests are normally very simple affairs like throwing the switch on the power supply or internet connection but real life problems don't happen like that. Internet connections flap or selectively lose traffic. Power supplies cut out but the fluctuating that happens at the same time fries the power circuits.

IT is complex, especially at the scale that BA operate. Glib statements from the masses ignore that complexity and ignore the good work done by the teams managing it. Pretending it's a simple mistake that the experts in boards.ie would never have made is childish.

At the scale that BA operate their experts should keep downtime to 1-2 hours a year, and there should be no data loss.

Another massive IT problem in the UK

Comments