Mongo or mySql

Buford T Justice · 24-10-2017 08:08PM #1

3rd year project in college, and we're ripping of building a forum with threads, posts, thanks users etc... just like boards but on a simple level

Small enough scale just to demonstrate principles and so on.
I'm an RDB guy, and I've always used and think in this way when it comes to databases. However considering the purpose of the site, is there any argument for using a noSQL db instead, or is RDB given the relationship between users, thread, posts and thanks?

Talisman · 24-10-2017 09:00PM

A relational database is the traditional and recommended solution for creating a forum application. That does not mean you can't create one using NoSQL.

If you are determined to use MongoDB then look at the NodeBB code on GitHub.

Buford T Justice · 24-10-2017 09:25PM

No commitments either way tbf... just looking for opinions.

Idleater · 24-10-2017 10:07PM

From a learning perspective you could do either. There is no reason to specifically choose either, as the data is just going to be stored and retrieved.

However, from a project perspective what you leverage from either solution may differ. Where I work, we have both mySQL and MongoDB. The MongoDB holds a lot of the stats, and some content (more the larger or freeform/less structured content), whereas the mySQL holds most of the user and modelled objects.

In your case I doubt you'll have the transactional throughput to test MongoDB filtering streams etc, when compared to standard sql queries, but like I said, you could learn a few things along the way.

Since you are more familiar with RDBMS, I'd factor in more time getting a project up and running, as well as understanding vagaries and tricks with NoSQL style DB's.

There might be less "google the answer" for your particular problem domain for MongoDB, though I wouldn't know for sure on that.

14ned · 24-10-2017 10:16PM

Buford T Justice wrote: »

3rd year project in college, and we're ripping of building a forum with threads, posts, thanks users etc... just like boards but on a simple level

Small enough scale just to demonstrate principles and so on.
I'm an RDB guy, and I've always used and think in this way when it comes to databases. However considering the purpose of the site, is there any argument for using a noSQL db instead, or is RDB given the relationship between users, thread, posts and thanks?

Everything has a unique id simply by crypto hashing it. So if you hash all content to some SHA256 id, you can implement a complete forum with threads, posts, thanks, users, everything exclusively using a key-value store.

Obviously there will be zero transactional integrity, and no way of garbage collecting. It would be an addition-only store. But very, very fast, constant time inserts and lookups, automatic sharding over as many nodes as you like.

You can see why most social media platforms use that exact design. If you don't particularly care about losing other people's data occasionally, it's cheap, scalable, and gets the job done quick.

You can implement a key value store yourself very easily. Simply convert the 256-bit key into a hexadecimal string. Store every value in a file with that name. You're done, plus it'll scale linearly to CPU cores which is as good as it gets.

Niall

punk_one82 · 25-10-2017 12:30AM

14ned wrote: »

You can implement a key value store yourself very easily. Simply convert the 256-bit key into a hexadecimal string. Store every value in a file with that name. You're done, plus it'll scale linearly to CPU cores which is as good as it gets.

No need to reinvent the wheel?

I'd go with NoSQL. It's very quick and easy to get up and running and is perfect for stuff like this.

croo · 25-10-2017 11:27AM

With just a few exceptions, I've also used RDBMS systems for most of my career, though many times I've asked why? And usually the answer is we have the license and/or expertise already. But I ask because an RDBMS can be an incredibly complex & resource hungry beast... I use mostly Postgresql myself, if I have a choice, but I've worked a lot with Oracle & MS SQL Server over the years and they can gobble-up so much resources. And why? I've not researched it to say definitely, but I suspect the simple answer is; to provide atomicity. Very important for most (OLTP based) business applications that are tracking various business transactions, but is it always necessary? If faced with the OP's question, I think I would ask myself, "is atomicity important for my project?". If not I would seriously consider a non-RDBMS solution for persistence.

14ned · 25-10-2017 11:58AM

croo wrote: »

And why? I've not researched it to say definitely, but I suspect the simple answer is; to provide atomicity.

It's usually not the atomicity unless you have a terrible database design where updates frequently collide. But that's on the user's bad design choices if so.

Almost all the complexity in all databases is to workaround how very unpredictably long it takes to persist updates to non-volatile storage.

Even top end SSDs have a 1000x latency difference between writes at 99.999% and the best case and that's with a single thread doing only writes and no other i/o. A hard drive might be 50 times worse again if you are lucky.

Put this in the context of a website: if a single GET took 50,000x the time of all the other GETs, do you think admins would complain?

Now bear in mind which GET takes 50,000x longer is totally random. Sometimes GET /index.html takes 50,000x longer. Most of the time it does not.

That's what database implementers write tons of complexity to work around. A lot of the NoSQL databases only appear fast because they sacrifice immediate durability for late durability. If you enforce the same durability guarantees, they are often actually slower than traditional SQL databases simply due to less mature implementations.

A file system implements a good chunk of a NoSQL database, it just doesn't appear to on first glance. Views can be implemented as a directory of hard links to the content addressed values for example, so a specific forum on a site is simply a directory with hardlinks to the content posted to that forum, similarly a user account is a directory with hardlinks to all content posted by that user and so on.

Rendering a forum or user account into HTML thus is simply a directory enumeration following by sorting on last modified dates and chopping up into pagination. It's trivially easy to implement, and I'm not entirely sure if it's actually much more work than using a NoSQL database seeing as you actually have to install one of those.

Niall

croo · 25-10-2017 02:31PM

14ned wrote: »

It's usually not the atomicity unless you have a terrible database design where updates frequently collide.

I'd have to disagree with you there Niall, but given some of your other points, perhaps we are thinking of different problems. I specifically mentioned "(OLTP based) business applications" ... my world is that of ERPs. I cannot think of one such application that does not lean heavily on the concept of "transactions" with the database providing much of this.

14ned wrote: »

complexity in all databases is to workaround how very unpredictably long it takes to persist updates to non-volatile storage

I do agree with you here. Though I would consider much of this to be down to the fact that it needs to ensure a transaction happens in full or not at all. That is to say, yes slow storage raises problems but mostly because of the transactional nature, otherwise we could just queue up the writes and forget about it. So while bad design can definitely result in performance issues, the complexity & resources are consumed ensuring the transactional integrity is maintained while persisting to the slow disks [be they HDD or SDD ... or tape!

]

14ned wrote: »

in the context of a website

And this is why I said before " perhaps we are thinking of different problems". I realise the OP's query related to websites and that was my point. I didn't explicitly say it, it might be homework so I wanted him to consider it. I doubt that oen of the key functionalities provided by a rdbms (transactions security) is a requirement for a forum website and therefore he should consider other options. Note also I don't say NoSQL specifically, but other options!

Being a long time Postgresql user (and Ingres many moons ago too), I was always impressed by the results of Michael Stonebrakers (their creator) work! And I also noted he has a new db now... VoltDB. An "In Memory" database! I'm sure there are many others too.

I have an SME customer who I assume is a typical SME (it's the only one I worked for so I'm not certain!). They run a common business app on a windows server with 32gb of ram & an array of disks; running MS SQL Server. Performance can be pretty terrible. But their db is only about 8GB in size. I cannot help thinking that a huge proportion of the servers resources are being used to ensure the data is kept consistent ON THE DISKS... while in reality the whole thing could be in memory!

I've not yet experimented with VoltDB but I suspect that, in terms of performance, the "in memory" aspect will provides even bigger performance gains than NoSQL. Which reminds me of another database I heard about; ToroDB. It's interesting because it's 100% compatible with MongoDB, but it runs 10x faster than MongoDB (conservative estimates apparently!?). And this is the really interesting thing... it (ToroDB) does this running on top of the Postgresql db. Which sounds crazy but after listening to an explanation of how that was possible it made sense. My point being, just because it is NoSQL doesn't not automatically translate to increased performance... if ToroDB running on Postgresql can run faster than MongoDB it stands to reason that, with the correct design, Postgresql itself can run faster too!

Anyway, I think I'm way off the OP's original question!
If it were me I would be considering other solutions than an rdbms.

Colin

stevenmu · 25-10-2017 04:09PM

Buford T Justice wrote: »

3rd year project in college, and we're ripping of building a forum with threads, posts, thanks users etc... just like boards but on a simple level

Small enough scale just to demonstrate principles and so on.
I'm an RDB guy, and I've always used and think in this way when it comes to databases. However considering the purpose of the site, is there any argument for using a noSQL db instead, or is RDB given the relationship between users, thread, posts and thanks?

I think the main question that you have to ask yourself is do your relationships between your data have value in your solution.

I would suggest that for a forum they do, for e.g. you want to associate thanks with users, posts with threads and threads with forums etc. When you created your post here, you weren't just creating a post and possibly tagging it as a development post, you came specifically to the development forum and created a post there. So to my mind at least, a forum is relational and many or most of your operations will be relational in nature.

You can see why most social media platforms use that exact design.

Most seem to have actually realised that their data is highly relational, and that most of the value is actually in the relationships and not the content. So the current trend in social media is to go to the opposite end of the scale and Graph DBs are where they are going.

stevenmu · 25-10-2017 04:13PM

croo wrote: »

but I've worked a lot with Oracle & MS SQL Server over the years and they can gobble-up so much resources. And why?

Unused resources are wasted resources.

By default, most RDBMS will grab any resources they can and use them to increase performance, and if you want them to share with other services then you have to explicitly configure that.

14ned · 25-10-2017 04:25PM

croo wrote: »

I'd have to disagree with you there Niall, but given some of your other points, perhaps we are thinking of different problems. I specifically mentioned "(OLTP based) business applications" ... my world is that of ERPs. I cannot think of one such application that does not lean heavily on the concept of "transactions" with the database providing much of this.

You are absolutely right that atomicity of transactions is the thing which appears to consume all the effort. But in fact if you break down the implementation of transactional atomicity, it always reduces to the problem of durability. Durability is the "indivisible quantum" as it were.

The reason why is how atomicity of transactions must implemented if durability is to be maintained. Specifically, you need to enforce a strict sequenced ordering of writes, so imagine a transaction which updates items A, B and C as an atomic update. One algorithm might be:

Lock A, B and C for shared access
Issue concurrent writes to storage for the new values
fsync the writes of the values
lock A, B and C for exclusive access
set values for A, B and C to their new values
fsync the update of the values referred to by A, B and C
release the exclusive lock on A, B and C

The above is a very simple algorithm which no commercial database would use, but it shows the problem: 99.9% of the time is spent in fsync, during which locks are held preventing some form of concurrency.

There is a whole field of academia and research papers on different strategies for implementing updates of more than one thing atomically, and of course any recent filing system which offers any consistency guarantees of anything has the same problem. But all those algorithms ever do is merely shuffle around the fsyncs in various ways. You can't get rid of them, and they are always the dominant cost in SSD and HD based storage.

Now, there is a world of change coming soon in the form of SCM storage. With that, your storage appears to the CPU as RAM, and as soon as your uncached write of RAM completes (i.e. has left your CPU caches) your write is durable. To say this completely makes obsolete all current NoSQL and SQL databases designs is not an understatement. I have a toy experimental transactional key-value store at https://github.com/ned14/afio/tree/master/programs/key-value-store which reaches 700k inserts/sec/core and 2M lookups/sec/core. That's a 100% pure filing system implementation too, the implementation as you'll see from the source code is just a few hundred lines of code and even a first year compsci student will understand it readily. A less toy implementation could do a lot better again.

So, big disruptive changes are coming, but it'll be quite a few years before anyone in the database world sees much. I certainly can see a whole load of old established vendors eventually going to the wall, and disruptive startups becoming the new dominant database vendors. It'll be very interesting times I'd imagine.

Niall

14ned · 25-10-2017 04:55PM

stevenmu wrote: »

So the current trend in social media is to go to the opposite end of the scale and Graph DBs are where they are going.

No, they present their many databases as if a single graph database.

Nobody actually stores any significant amount of data in a graph database for a long list of very good reasons, but mostly because they are an arse to maintain and keep performance reasonable as you scale out.

One of the great strengths of RDBMS is precisely how you are straight jacketed into a rigid design and structure. That lets you use lots of performance tricks to take advantage of that rigidity as they grow in content. In contrast, native graph dbs need to be constantly and manually rebalanced as they scale out. Fine for static sized datasets though, but not for growing ones.

Niall

croo · 25-10-2017 07:19PM

14ned wrote: »

there is a world of change coming soon in the form of SCM storage. With that, your storage appears to the CPU as RAM

It's a few year old now, but this interview with Michael Stonebraker, the creator of Ingres, Postgres & more recently the VoltDB I mentioned. I found his explanation of the problems and his views on where we might be heading to be compelling.

SCM is new to me... I must read up. Thanks for the pointer.

ps. I've been reading some interesting articles about using block chains to implement distributed ledgers which might eventually impact how business apps might treat transactions.

14ned · 25-10-2017 09:13PM

croo wrote: »

It's a few year old now, but this interview with Michael Stonebraker, the creator of Ingres, Postgres & more recently the VoltDB I mentioned. I found his explanation of the problems and his views on where we might be heading to be compelling.

It was compelling once. But SCM completely upends things. For example, VoltDB, which is no slouch and a "NewSQL" database, might max out at 2700 tps/cpu (source: https://www.voltdb.com/blog/2017/10/02/comparing-fast-data-performance-a-comparison-of-voltdb-and-cassandra-benchmarks/)

That toy transactional key-value store I posted achieves 700,000 tps/cpu, with linear scaling to CPUs as far as 32. SCM is a game changer.

SCM is new to me... I must read up. Thanks for the pointer.

The hardware support for it in Intel CPUs only landed this year. This is very, very new tech. I was given an early peek by Intel, it has some really amazing latencies, still a lot slower than RAM, but far faster than a PCIe bus.

From a point of view of how to use it in your code today, it's easy. Simply format your drive with DAX enabled on supporting hardware. And then exclusively use memory maps for i/o. Works lovely.

ps. I've been reading some interesting articles about using block chains to implement distributed ledgers which might eventually impact how business apps might treat transactions.

Yeah the Bank of England is running an experimental blockchain with the major UK banks for overnight cash deposits. Commit latency is obviously awful, but it has unparalleled scalability. As in, every mobile phone on the planet could be a node. Storage and databases I think will be a good bet for long term employability at fine day rates. Due to lack of any other income this year, I've had to take a low end onsite contract in Dublin fixing bugs all day every day for twelve months on a low low day rate. I'm very glad for the income, but boy is it going to be a long twelve months.

Niall

Talisman · 26-10-2017 09:36AM

croo wrote: »

Which reminds me of another database I heard about; ToroDB. It's interesting because it's 100% compatible with MongoDB, but it runs 10x faster than MongoDB (conservative estimates apparently!?). And this is the really interesting thing... it (ToroDB) does this running on top of the Postgresql db. Which sounds crazy but after listening to an explanation of how that was possible it made sense. My point being, just because it is NoSQL doesn't not automatically translate to increased performance... if ToroDB running on Postgresql can run faster than MongoDB it stands to reason that, with the correct design, Postgresql itself can run faster too!

In 2015, it was revealed that the MongoDB 3.2 BI Connector was the Postgresql database so they're not adverse to leveraging the power of Postgresql either.

Talisman · 26-10-2017 10:05AM

14ned wrote: »

The hardware support for it in Intel CPUs only landed this year. This is very, very new tech. I was given an early peek by Intel, it has some really amazing latencies, still a lot slower than RAM, but far faster than a PCIe bus.

That's the power that Apache Kudu is leveraging based upon the pmem library.

14ned · 26-10-2017 01:32PM

Talisman wrote: »

That's the power that Apache Kudu is leveraging based upon the pmem library.

Thanks for the link. Looks like they offer the ability to store the block cache in NVRAM, which would include SCM. ZFS also lets you store its ARC and more importantly its write log in NVRAM. As to which approach is better, I look forward to the day I get to benchmark this stuff.

Niall

jmcc · 28-10-2017 06:14AM

14ned wrote: »

A file system implements a good chunk of a NoSQL database, it just doesn't appear to on first glance. Views can be implemented as a directory of hard links to the content addressed values for example, so a specific forum on a site is simply a directory with hardlinks to the content posted to that forum, similarly a user account is a directory with hardlinks to all content posted by that user and so on.

For small operations like a toy forum, this may be OK. But when it gets into directories with hundreds of thousands or millions of files, the choice of file system can be important from a scalability viewpoint. The workarounds that you mentioned will often be targeted at reducing the number of full directory reads where possible. Might be a bit esoteric for a college project though but the choice of file system might be something to consider.

Regards...jmcc

14ned · 28-10-2017 04:37PM

jmcc wrote: »

For small operations like a toy forum, this may be OK. But when it gets into directories with hundreds of thousands or millions of files, the choice of file system can be important from a scalability viewpoint. The workarounds that you mentioned will often be targeted at reducing the number of full directory reads where possible. Might be a bit esoteric for a college project though but the choice of file system might be something to consider.

Oh for sure. It takes about 0.3 secs to enumerate a million item directory on ext4 or NTFS, so you would likely organise the storage as forum/topic/hash-of-content and user/hash-of-content. The likelihood of any one forum topic exceeding a few hundred thousand posts is probably low for most forums, so render latency should always remain below 0.1 secs.

More problematic in practice will be block wastage. Some filing systems e.g. ZFS do an amazing job of packing small files with almost no wastage. Others like ext4 have a very small file optimisation, otherwise they always round up to block size (typically 4Kb). This is why git, which stores each item one per hash identified file, packs older items into packfiles which are just a large file of items with an index.

So yes I definitely concur that choice of filing system matters a lot if you're going to scale out. For his college project though, any filing system which permits hard links will be plenty sufficient.

Niall

Mongo or mySql

Comments