Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
If we do not hit our goal we will be forced to close the site.

Current status: https://keepboardsalive.com/

Annual subs are best for most impact. If you are still undecided on going Ad Free - you can also donate using the Paypal Donate option. All contribution helps. Thank you.

https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.

So what happened this time?

Dav · 2014-01-14 17:03:51

You go all year with no outages and then 2 come along within 10 days. So what happened today? First of all, it wasn't me! :D One of the database slaves had a major failure with it's hard disks. Before anyone asks, yes they were in RAID (1 to be exact), but both disks failed. It's rare that your redundancy fails at the same…

Comments

#32 14-01-2014 11:10PM

hatrickpatrick

Registered Users, Registered Users 2 Posts: 17,798 ✭✭✭✭

Join Date: June 2005

Posts: 17108

Jaysus lads, I thought I'd been sitebanned without explanation.
Given that I do things on a daily basis which would definitely justify such a ban, this incident unnerved me considerably.

0
#33 14-01-2014 11:10PM

laugh

Registered Users, Registered Users 2 Posts: 3,745 ✭✭✭

Join Date: May 2009

Posts: 3649

Do you just have one big unsharded schema?

How many read DBs do you guys use?

0
#34 14-01-2014 11:14PM

Dav

Closed Accounts Posts: 8,840 ✭✭✭

Join Date: February 1998

Posts: 7801

Tonight's kudos go to Chris, Colm, Conor and Alvis.

We're flipping all the switches back to where they were this morning and we'll continue to monitor if for the next couple of hours.

Our servers sit in Digiweb in Blanchardstown, there is no way I could have poured anything on them and I don't drink coffee

0
#35 14-01-2014 11:14PM

[Deleted User]
Posts: 18,160 ✭✭✭✭

Join Date: -

Posts: 17448

Seriously bad luck to have both disks in a RAID 1 fail at the same time.

Well done for getting everything up and running again.

0
#36 14-01-2014 11:50PM

entropi

Registered Users, Registered Users 2 Posts: 21,731 ✭✭✭✭

Join Date: May 2001

Posts: 21518

GJ guys! Managed to work hard at it again to return us back to relative normality. Kudos

0
Advertisement
#37 14-01-2014 11:53PM

wnolan1992

Registered Users, Registered Users 2 Posts: 14,009 ✭✭✭✭

Join Date: April 2011

Posts: 13726

Well done to the tech team again. Certainly earning their paycheques this week. :pac:

FYI, the Talk To... fora have reverted to the old style instead of the new swanky style.

0
#38 15-01-2014 12:05AM

r3nu4l

Registered Users, Registered Users 2 Posts: 17,388 ✭✭✭✭

Join Date: January 2006

Posts: 15531

Dav wrote: »

...and I don't drink coffee

Yeah, I'm gonna have to ask you to hand back your nerd badge. Sorry it had to come to this

:pac:

Fair play to one and all involved. Thanks for the hard work and effort

0
#39 15-01-2014 12:08AM

Professey Chin

Registered Users, Registered Users 2 Posts: 51,054 ✭✭✭✭

Join Date: September 2008

Posts: 50867

Good work guys
Horrible luck with the disks but nice to be back!

0
#40 15-01-2014 12:47AM

Princess Consuela Bananahammock

Registered Users, Registered Users 2 Posts: 33,779 ✭✭✭✭

Join Date: February 2006

Posts: 33440

Dav wrote: »

You go all year with no outages...

What, 14 days?

Everything I don't like is either woke or fascist - possibly both - pick one.

0
#41 15-01-2014 12:50AM

Mr. G

Moderators, Technology & Internet Moderators Posts: 4,622 Mod ✭✭✭✭

Join Date: August 2010

Posts: 4192

In fairness it was unexpected very rare for both to disks to fail. Fair play for getting it all back up and running.

0
Advertisement
#42 15-01-2014 12:53AM

Mr. G

Moderators, Technology & Internet Moderators Posts: 4,622 Mod ✭✭✭✭

Join Date: August 2010

Posts: 4192

entropi wrote: »

GJ guys! Managed to work hard at it again to return us back to relative normality. Kudos

Here's for another Sheldon pic

0
#43 15-01-2014 01:56AM

Irish Steve

Moderators, Motoring & Transport Moderators Posts: 6,524 Mod ✭✭✭✭

Join Date: February 2006

Posts: 5748

Dav wrote: »

You go all year with no outages and then 2 come along within 10 days.

So what happened today?

First of all, it wasn't me!

Ya reckon?:D:D

It was all those 40 million messages being put into one bucket last week.....

polished all the oxide off the surface of the discs, it was only a matter of time.....:P

Seriously, that was not good news, though it makes the MTBF concept of new discs a little challenging, they're not supposed to fail quite that close together. Any danger that it was power supply related, rather than mechanical, as that could take several drives out at the same time.

Whatever, well done to get it back that quickly.

Shore, if it was easy, everybody would be doin it.😁

0
#44 15-01-2014 02:40AM

irishgeo

Registered Users, Registered Users 2 Posts: 9,755 ✭✭✭

Join Date: March 2001

Posts: 9136

Are boards not using SSD drives?

0
#45 15-01-2014 05:42AM

IRLConor

Subscribers Posts: 4,077 ✭✭✭

Join Date: April 2007

Posts: 3925

laugh wrote: »

Do you just have one big unsharded schema?

I don't know about now, but as of 2 years ago sharding the boards.ie database would have been hilariously difficult to do. Pretty much every page served joined against the post table which accounts for the majority of the data. It's quite tricky to identify an axis along which the post table could be efficiently sharded without either rewriting large swaths of the code or creating maintenance nightmares.

Ross and I learned a lot about sharding the data when Ross was building the search system and that was a much simpler schema, with no joins and no legacy code to convert.

laugh wrote: »

How many read DBs do you guys use?

Two years ago it was one master and two slaves.

0
#46 15-01-2014 07:20AM

Taltos

Registered Users, Registered Users 2 Posts: 20,830 ✭✭✭✭

Join Date: February 2007

Posts: 19167

Hi guys.

When someone gets a chance can you please re-open the "Separation & Divorce" forum? Currently marked as closed.

Cheers.

0
#47 15-01-2014 09:34AM

cython

Registered Users, Registered Users 2 Posts: 4,844 ✭✭✭

Join Date: September 2005

Posts: 4589

Karsini wrote: »

Seriously bad luck to have both disks in a RAID 1 fail at the same time.

Well done for getting everything up and running again.

Definitely. I presume that the possibility of a controller issue resulting in an earlier failure somehow not being reported has been ruled out? I've seen a lot stranger happen with RAID 1, to be fair, such as one of the disks being weeks out of date and suddenly being switched over to as the read source. It resulted in the (temporary) apparent loss of all data entered in the meantime until it could be identified that the disks had been out of sync, and the up to date one was still working, just not in use.

0
#48 15-01-2014 10:23AM

route66

Registered Users, Registered Users 2 Posts: 1,012 ✭✭✭

Join Date: July 2009

Posts: 997

Dav wrote: »

One of the database slaves had a major failure with it's hard disks. Before anyone asks, yes they were in RAID (1 to be exact), but both disks failed. It's rare that your redundancy fails at the same time as the main device, but not unheard of.

For both disks to fail at the same time would be - I guess - a "winning the lotto" type chance.

More common would be a failed shared component - a backplane, a disk controller, a cable, etc. If this is the case, then the failure may come back. :eek:

Another common scenario with RAID 1 is that one disk (or bank of disks) fails, goes unnoticed/unreported, then the other disk fails - BANG!

Must go now and check my Lotto numbers

0
#49 15-01-2014 11:44AM

Boards.ie: Niamh

Boards.ie Employee, Boards Employee 2, Boards Employee 3 Posts: 12,597 ✭✭✭✭✭

Boards.ie Community Manager

Join Date: December 2010

Posts: 10871

Taltos wrote: »

Hi guys.

When someone gets a chance can you please re-open the "Separation & Divorce" forum? Currently marked as closed.

Cheers.

Alvis has re-opened that now

How can I? User's guide to some features on the new site

0
#50 15-01-2014 01:24PM

kippy

Registered Users, Registered Users 2 Posts: 19,415 ✭✭✭✭

Join Date: August 2005

Posts: 19140

Mr. G wrote: »

In fairness it was unexpected very rare for both to disks to fail. Fair play for getting it all back up and running.

They usually don't all right.
What tends to happen is one disk fails...........there is more pressure then on the other disk and that fails also within a shorter enough period of time, so it's critical to know that one disk failed as soon as possible in order to replace it before things get more awkward!
Been caught like that myself in the past on a RAID 5.

Well done on sorting it.

0
#51 15-01-2014 01:45PM

route66

Registered Users, Registered Users 2 Posts: 1,012 ✭✭✭

Join Date: July 2009

Posts: 997

kippy wrote: »

They usually don't all right.
What tends to happen is one disk fails...........there is more pressure then on the other disk and that fails also within a shorter enough period of time, so it's critical to know that one disk failed as soon as possible in order to replace it before things get more awkward!
Been caught like that myself in the past on a RAID 5.

Well done on sorting it.

With RAID 1, if a disk or bank of disks fail, the remaining healthy one(s) just continue to do their normal work; the extra copy of data just doesn't get written anywhere.

The exception is read activity on a RAID 1 setup: many make use of both sides of the setup to reduce read time. If there is a failure, then this extra efficiency is no longer available but I would expect this to just increase read time rather than causing the remaining healthy one(s) to die!

RAID 5 is completely different with data and parity data being written across all disks in the array.

0
Advertisement
#52 15-01-2014 03:26PM

Sauve

Registered Users, Registered Users 2 Posts: 22,646 ✭✭✭✭

Join Date: January 2010

Posts: 22116

NNNEEERRRRRDDDSSSS

(Sorry )

0
#53 15-01-2014 03:30PM

knucklehead6

Registered Users, Registered Users 2 Posts: 6,766 ✭✭✭

Join Date: October 2012

Posts: 6394

Sauve wrote: »

NNNEEERRRRRDDDSSSS

(Sorry )

says the mod of 5 different forums.....

Pot.. kettle.....

0
#54 15-01-2014 03:39PM

Nody

Moderators, Category Moderators, Arts Moderators, Business & Finance Moderators, Entertainment Moderators, Society & Culture Moderators Posts: 18,562 CMod ✭✭✭✭

Join Date: October 2006

Posts: 16574

route66 wrote: »

For both disks to fail at the same time would be - I guess - a "winning the lotto" type chance.

If you want to talk lotto numbers try this one on for size (yes it happend, yes I was impacted by it as was several hospitals etc. and I got the official and unoffical report from the event).

A country site is due to go through yearly emergency power test. Site is wired with 3 pairs of batteries (one pair is enough to run it for 30 min) and two diesel generators (one enough to power the whole site). Batteries are always on but only kick in if power gets broken, generators set to be running with in 1 min of power being cut.

Power is cut as planned at 1am local and wham all three pairs of batteries fail AND both generators refuse to start. Every single server and fibe connection goes down inc. every MUX at customer site etc. loses sync.

Oh happy happy days (it took over 8h to get everything up and running once the main power was turned on again)...

0
#55 15-01-2014 03:43PM

route66

Registered Users, Registered Users 2 Posts: 1,012 ✭✭✭

Join Date: July 2009

Posts: 997

Nody wrote: »

If you want to talk lotto numbers try this one on for size (yes it happend, yes I was impacted by it as was several hospitals etc. and I got the official and unoffical report from the event).

A country site is due to go through yearly emergency power test. Site is wired with 3 pairs of batteries (one pair is enough to run it for 30 min) and two diesel generators (one enough to power the whole site). Batteries are always on but only kick in if power gets broken, generators set to be running with in 1 min of power being cut.

Power is cut as planned at 1am local and wham all three pairs of batteries fail AND both generators refuse to start. Every single server and fibe connection goes down inc. every MUX at customer site etc. loses sync.

Oh happy happy days (it took over 8h to get everything up and running once the main power was turned on again)...

Ooops ...

0
#56 15-01-2014 05:04PM

TeddyTedson

Registered Users, Registered Users 2 Posts: 10,758 ✭✭✭✭

Join Date: October 2010

Posts: 10157

Why don't you guys just delete all the threads older than 5 years. They're Zombie threads and you'd save space on your hard drive

0
#57 15-01-2014 05:10PM

Insect Overlord

Moderators, Social & Fun Moderators, Society & Culture Moderators Posts: 31,134 Mod ✭✭✭✭

Join Date: September 2006

Posts: 28895

TeddyTedson wrote: »

Why don't you guys just delete all the threads older than 5 years. They're Zombie threads and you'd save space on your hard drive

Don't be silly. The source of social history, sense of community and plain old hilarity of some of the old content are what makes this site so great.

0
#58 15-01-2014 05:15PM

Sarky

Closed Accounts Posts: 31,967 ✭✭✭✭

Join Date: February 2003

Posts: 31377

Dav wrote: »

Our servers sit in Digiweb in Blanchardstown, there is no way I could have poured anything on them and I don't drink coffee

Well obviously if you drank coffee you'd have none to pour over the servers. J'ACCUSE!

0
#59 15-01-2014 06:06PM

damned_junkie

Registered Users, Registered Users 2 Posts: 44 ✭

Join Date: November 2004

Posts: 40

IRLConor wrote: »

Ross and I learned a lot about sharding the data when Ross was building the search system and that was a much simpler schema, with no joins and no legacy code to convert.

The sharded search set-up was ultimately ditched in favour of a single index with replication onto a second machine. Turns out the overhead from sharding is way higher than the gain from smaller indexes. AFAIK the second machine is a cold standby these days. The relatively low query load means you get better cache performance with a single node handling all the queries than with two nodes handling half each.

But yeah sharding the post table... I put a lot of noodle scratching into that one too, there's no clear way to do it easily. Chucking RAM and faster disks at it will probably keep it working for another few years though!

0
#60 15-01-2014 07:08PM

Dav

Closed Accounts Posts: 8,840 ✭✭✭

Join Date: February 1998

Posts: 7801

TeddyTedson wrote: »

Why don't you guys just delete all the threads older than 5 years. They're Zombie threads and you'd save space on your hard drive

It's nothing to do with space actually. The totallity of the posts table is about 25GB, PMs take about 10GB (I think) and attachments are running around 15GB.

So it's not huge volumes of data by any stretch of imagination, but as it's, in the case of the posts and PMs tables, plain text, that is a vast amount of information to be processed at any given time.

For those of you who don't know, 1 character of plain ASCII text = 1 byte.
1 Kilobyte = 1024 characters.
1 Gigabyte = approx 1 billion characters

So the boards post table contains about 25 billion characters of text in it.

Imagine now that you have to try and work with all that in some meaningful way and you'll understand why our databases are so difficult to work with.

Besides which, deleting the history of the site seems abhorrent to me, I notion of the thoughts, ideas, discussions and nonsense etc of the many thousands of people who've used the site over the years just being gone fills me with a sense of dread and "wrongness" that I can't put into words.

0
Advertisement
#61 15-01-2014 08:09PM

kaimera

Closed Accounts Posts: 16,392 ✭✭✭✭

Join Date: September 2001

Posts: 15471

Still on php & vb dav?

Do you have stats on how often old or 'archived' data is accessed? (what's considered old by the team?)

By users and/or unregs (is it google searches bringing views to threads or current users searching boards)

Can anything over 3/5 years (example) be shunted off to a separate disk as 'archive' and given RO perms? (save zombie threads being dug up for eg)

0