Advertisement
If you have a new account but are having problems posting or verifying your account, please email us on hello@boards.ie for help. Thanks :)
Hello all! Please ensure that you are posting a new thread or question in the appropriate forum. The Feedback forum is overwhelmed with questions that are having to be moved elsewhere. If you need help to verify your account contact hello@boards.ie

Database discussion for 2015.

Options
  • 01-12-2015 8:07pm
    #1
    Registered Users Posts: 2,228 ✭✭✭


    Hey All,

    I am building a new platform and will shortly require a database backend. Right now i'm rocking various CSV files which works, but obviously need to move to a DB if I want to scale in any way.

    I have experience with MySQL and SQL Server. As this is a personal project I would like to try out a new system which is popular in the industry. I've been reading up on NoSQL but it appears it's for a slightly different type of system.

    I know asking for recommendations like this on a forum isn't always the best approach, I guess what i'm really looking for is a seed to my research.

    Secondly, i'm considering gathering LOTS of data and mining it etc, I know there are different types of backends for this, i'm considering Cassandra for this as it seems to be the talk of the day and Apache wares generally are quite good, any opinions on this?

    Cheers folk.


Comments

  • Registered Users Posts: 6,041 ✭✭✭Talisman


    Have you considered PostgreSQL?
    - It's different from MySQL in that it is an Object RDBMS. You can nest objects within other objects.
    - It has support for concurrency without read locks which is good for scalability.
    - It's extensible, installing the PL/v8 plugin gives you the V8 JavaScript Engine (Chrome, Node.js) in the database.
    - Combining the PL/v8 plugin and the JSON data type would allow you to use the database like MongoDB. There's still a bit of ironing out to do but the 9.4 release has been a major leap forward.
    - It's the database of choice for the likes of IMDB and TripAdvisor.


  • Registered Users Posts: 586 ✭✭✭Aswerty


    Secondly, i'm considering gathering LOTS of data and mining it etc

    I might just mention that it'd be good to know what LOTS means. A lot of people think they need some fancy large data tool when half the time the bog standard relational databases will do the job.

    I'd also +1 PostgreSQL. I'm a big fan of SQL Server (though not it's licensing) as well - though I see you want a change of scenery.


  • Registered Users Posts: 2,030 ✭✭✭colm_c


    If you haven't done it before, it might be worth looking at something like MongoDB.

    It's not for all situations or data, but does have some really nice features.

    Totally different concept to traditional relational databases.

    It can be used in conjunction with a relational database so you get the best of both worlds.


  • Registered Users Posts: 9,370 ✭✭✭Phoebas


    When choosing between different kinds of database, you need to take your data requirements into account.
    For example, if your data is highly relational in nature then maybe a nosql/document database isn't for you. That's not to say that nosql databases can't handle relational data - they can, but you might run into problems with stuff like transaction support.

    Also consider that the likes of MongoDB doesn't yet have the mature tooling that you'd take for granted with the likes of SQL Server. You'll probably be spending some time on a command line.
    My experience of Mongo in particular is that its great with LOTS of data, really quick to get data in and out and a really powerful aggregations engine, but its a different mind set to traditional RDBMSs.

    Also, LOTs is a relative term. They can all handle lots of data.


    Edit: If part of your requirement is to learn a new database technology, take into account that MongoDB in Dublin have regular MeetUps in their offices in Dublin where they'll give you free beer and access to their technical support people. They also do free online training with certification.


  • Moderators, Society & Culture Moderators Posts: 9,674 Mod ✭✭✭✭Manach


    My own recommendation: Nothing wrong at all with RDMS like Oracle or MySql but if OP wants a built-in means to handle very large data sets as well as provide good visualisation options perhaps the Elasticsearch stack: with Logstash and Kibana?


  • Advertisement
  • Registered Users Posts: 199 ✭✭grouchyman


    +1 for postgres as well. Depending on the amount of data you want to store would you use Oracle XE. You could also look at firebird.


  • Registered Users Posts: 2,228 ✭✭✭techguy


    Hey guys, thanks a million for the great responses.

    So far nobody has started a fight over the superior DB :D
    grouchyman wrote: »
    +1 for postgres as well. Depending on the amount of data you want to store would you use Oracle XE. You could also look at firebird.

    I do read positive notes about Postgres all the time so I think that is in the running i'll look into it more but probably give it a lash. Looks like there's a driver for Go, written in pure Go to might I add.
    Manach wrote: »
    My own recommendation: Nothing wrong at all with RDMS like Oracle or MySql but if OP wants a built-in means to handle very large data sets as well as provide good visualisation options perhaps the Elasticsearch stack: with Logstash and Kibana?

    Well I will want to handle a lot of data but I do have a regular requirement for a non Big Data DB too, so a regular RDBMS might work for this.
    Phoebas wrote: »
    When choosing between different kinds of database, you need to take your data requirements into account.
    For example, if your data is highly relational in nature then maybe a nosql/document database isn't for you. That's not to say that nosql databases can't handle relational data - they can, but you might run into problems with stuff like transaction support.

    Also consider that the likes of MongoDB doesn't yet have the mature tooling that you'd take for granted with the likes of SQL Server. You'll probably be spending some time on a command line.
    My experience of Mongo in particular is that its great with LOTS of data, really quick to get data in and out and a really powerful aggregations engine, but its a different mind set to traditional RDBMSs.

    Also, LOTs is a relative term. They can all handle lots of data.


    Edit: If part of your requirement is to learn a new database technology, take into account that MongoDB in Dublin have regular MeetUps in their offices in Dublin where they'll give you free beer and access to their technical support people. They also do free online training with certification.

    Well, I think I definitely need a relational DB from a business perspective of my software. That said, I might tinker with some NoSQL for storing my log output and see what it's like to play with. I also create a report every time one of my components runs and that could be cool in NoQSL as it's not retainable data, more of a retrospective view of my programs run.

    Also, i'm actually in London at the moment so the meetup won't work but sounds awesome, i'll see if they are in London too..
    colm_c wrote: »
    If you haven't done it before, it might be worth looking at something like MongoDB.

    It's not for all situations or data, but does have some really nice features.

    Totally different concept to traditional relational databases.

    It can be used in conjunction with a relational database so you get the best of both worlds.

    See above, i'll probably give it a lash.
    Talisman wrote: »
    Have you considered PostgreSQL?
    - It's different from MySQL in that it is an Object RDBMS. You can nest objects within other objects.
    - It has support for concurrency without read locks which is good for scalability.
    - It's extensible, installing the PL/v8 plugin gives you the V8 JavaScript Engine (Chrome, Node.js) in the database.
    - Combining the PL/v8 plugin and the JSON data type would allow you to use the database like MongoDB. There's still a bit of ironing out to do but the 9.4 release has been a major leap forward.
    - It's the database of choice for the likes of IMDB and TripAdvisor.
    Aswerty wrote: »
    I might just mention that it'd be good to know what LOTS means. A lot of people think they need some fancy large data tool when half the time the bog standard relational databases will do the job.

    I'd also +1 PostgreSQL. I'm a big fan of SQL Server (though not it's licensing) as well - though I see you want a change of scenery.

    Well now, Postgres does seem pretty awesome to be fair, I feel like I have already made my decision on this one.

    With regard to LOTS of data, I honestly don't know. I think I mean more in the sense that those type tools might be better oriented for mining. Basically I will be scraping lot's of pricing data from all over the web, very frequently. I want to collect data for about 1-2 years so I think the data might get quite big.

    I will then want to mine it etc and be able to create various reports for market use etc.

    Thanks guys


  • Registered Users Posts: 23,212 ✭✭✭✭Tom Dunne


    techguy wrote: »
    So far nobody has started a fight over the superior DB

    That's because it's an unwinnable fight.;)


  • Registered Users Posts: 2,781 ✭✭✭amen


    Tom Dunne wrote: »
    That's because it's an unwinnable fight.;)


    we could try to win or at least have a draw though....


  • Registered Users Posts: 23,212 ✭✭✭✭Tom Dunne


    amen wrote: »
    we could try to win or at least have a draw though....

    Do I need to play the CODASYL card? Do I? Do I really? :pac:


  • Advertisement
  • Registered Users Posts: 2,781 ✭✭✭amen


    No need


  • Registered Users Posts: 18,272 ✭✭✭✭Atomic Pineapple


    Phoebas wrote: »
    Edit: If part of your requirement is to learn a new database technology, take into account that MongoDB in Dublin have regular MeetUps in their offices in Dublin where they'll give you free beer and access to their technical support people. They also do free online training with certification.

    Hi Phoebas - would love to learn more about this, do you have links or contacts that could help out?


  • Moderators, Computer Games Moderators, Technology & Internet Moderators Posts: 19,240 Mod ✭✭✭✭L.Jenkins


    Just installed MongoDB and Postgre on my Linux box, so I might give them a try, before I try a few others.


  • Registered Users Posts: 1,275 ✭✭✭bpmurray


    You mentioned that you planned on capturing "LOTS" of data. If you really mean lots, i.e. billions or even trillions of transactions per day, your DB needs to be able to handle that. If it's a RDBMS it will have to be Oracle, DB2 or Postgresql - the latter is free, so that's a definite advantage. Assuming you go with one of these, you'll find that the NoSQL DBs won't actually be any faster - they're probably easier to use for certain applications, but no faster.

    You should also consider other aspects of the data mining, apart from the DB, which will probably bring you into contact with search technologies like Solr and Lucene and even machine learning like Spark. Basically, you can't isolate the data side from the mining applications.


  • Registered Users Posts: 2,781 ✭✭✭amen


    bpmurray wrote: »
    You mentioned that you planned on capturing "LOTS" of data. If you really mean lots, i.e. billions or even trillions of transactions per day, your DB needs to be able to handle that. If it's a RDBMS it will have to be Oracle, DB2 or Postgresql - the latter is free, so that's a definite advantage. Assuming you go with one of these, you'll find that the NoSQL DBs won't actually be any faster - they're probably easier to use for certain applications, but no faster.

    Why do you think you couldn't you use SQL Server ?

    Anyway if you are really capturing billions of transactions per day you better have decent hardware multiple redundancies etc


  • Registered Users Posts: 1,275 ✭✭✭bpmurray


    amen wrote: »
    Why do you think you couldn't you use SQL Server ?

    Anyway if you are really capturing billions of transactions per day you better have decent hardware multiple redundancies etc

    Note that I'm talking about extremes here, not normal workloads. And of course you need redundancy and failover - isn't that normal for high-availability? SQL Server doesn't scale well in these situations - only those 3 RDBMS's can handle such throughputs.


  • Closed Accounts Posts: 22,651 ✭✭✭✭beauf


    techguy wrote: »
    ....With regard to LOTS of data, I honestly don't know.....I want to collect data for about 1-2 years so I think the data might get quite big....

    I don't have much experience in this area. But any DB questions seem to always require much more detailed estimates of the data and numbers involved.

    Out of curiosity what sort of hardware (and budget) would be needed for this.


  • Registered Users Posts: 4,357 ✭✭✭robbiezero


    Influxdb or Cassandra might be worth a look if its time series data.


  • Registered Users Posts: 2,228 ✭✭✭techguy


    robbiezero wrote: »
    Influxdb or Cassandra might be worth a look if its time series data.

    It's time series data in the sense that it will be related to how an items price and sales volume over time.

    My intention is that I can record various fields of information and then try and extract meaningful information from it.

    I know nothing about data science and I though this could be an interesting way of getting into it.

    Is Cassandra something that could help?


  • Registered Users Posts: 9,370 ✭✭✭Phoebas


    Hi Phoebas - would love to learn more about this, do you have links or contacts that could help out?

    The meetups in the Dublin office are arranged via meetup.com
    http://www.meetup.com/DublinMUG/

    The online training is on MongoDB university. (There is a charge for the exam)
    https://university.mongodb.com/


  • Advertisement
  • Registered Users Posts: 119 ✭✭Psychosis


    MongoDb is great if you are getting data from 50 different sources and with different document structures, e.g. JSON or XML results back from web API's.

    Its fast to learn and fast to use!


  • Registered Users Posts: 4,357 ✭✭✭robbiezero


    techguy wrote: »
    It's time series data in the sense that it will be related to how an items price and sales volume over time.

    My intention is that I can record various fields of information and then try and extract meaningful information from it.

    I know nothing about data science and I though this could be an interesting way of getting into it.

    Is Cassandra something that could help?

    Probably not in that case. Time series databases are more suited to measurements at periodic intervals i.e. saving the CPU Load every 5 minutes.


  • Registered Users Posts: 2,228 ✭✭✭techguy


    robbiezero wrote: »
    Probably not in that case. Time series databases are more suited to measurements at periodic intervals i.e. saving the CPU Load every 5 minutes.

    Ok, what would be better then something like Postqres or just throw the data into MongoDB.

    while there may be a few things that related it will be generally flat and there won't be any real business oriented tasks required.

    It's basically a good use case to get started on Mongo right?


  • Registered Users Posts: 6,041 ✭✭✭Talisman


    techguy wrote: »
    Ok, what would be better then something like Postqres or just throw the data into MongoDB.

    while there may be a few things that related it will be generally flat and there won't be any real business oriented tasks required.

    It's basically a good use case to get started on Mongo right?
    If you plan on storing the data long term then using MongoDB would not be advisable. It will lose your data.

    For a rational comparison of the two have a read of Postgres vs Mongo.

    If you just want to throw data somewhere to store it until a later date then my suggestion would be to create a bucket on S3 and fill it with JSON files. When the time comes to process the data, pull the files out and stick them into which ever platform best meets your use case.


  • Closed Accounts Posts: 2,046 ✭✭✭democrates


    Besides the RDBMS choice is the choice of model, 3nF is not the only show in town. Haven't tried it but anchor modelling certainly looks interesting. Since you'll be harvesting over 2 years the schema would face changes in the datasets parsed from any given site. Anchor modelling handles this by giving each attribute its own table, while changes to the modelling tool apparently produce sql to upgrade your db. Going by the forum it's being used with Postgres.
    Could play with the online modelling tool to see how it fits.
    src: http://www.anchormodeling.com/

    PS. the storage engine is also an important decision, for example I ran into alarming drive threshing with InnoDB on an ext4 volume :pac:


  • Registered Users Posts: 2,228 ✭✭✭techguy


    democrates wrote: »
    Besides the RDBMS choice is the choice of model, 3nF is not the only show in town. Haven't tried it but anchor modelling certainly looks interesting. Since you'll be harvesting over 2 years the schema would face changes in the datasets parsed from any given site. Anchor modelling handles this by giving each attribute its own table, while changes to the modelling tool apparently produce sql to upgrade your db. Going by the forum it's being used with Postgres.
    Could play with the online modelling tool to see how it fits.
    src: http://www.anchormodeling.com/

    PS. the storage engine is also an important decision, for example I ran into alarming drive threshing with InnoDB on an ext4 volume :pac:

    Some going pointers there thanks, but to be honest, I think i'll give the anchor modelling a miss. I've had a look and it seems cool, but there would be a lot to learn and it's got an infant feel to it and don't want to run into general problems later on. Since this project will actually be going into production I don't want any experimental products.

    Question about Mongo though. When you guys say you will lose data. Do you mean older data that has been there for ages or just random data anywhere in the sense that when you instruct it to write data it won't do it (deferred writes) then the data just disappears?

    If the former that should be ok, but if the latter that's not really good. I mean ever for logs thats crap.

    I would like to track steps in the fulfilment of an ecommerce transaction. For example, over a 1-2 week period, the order goes from ordered and through various other states in between before being delivered.

    Would this suit?


Advertisement