Database Design

Stainless_Steel · 22-05-2013 11:35am #1

Hi guys,

Have a few queries relating to SQL database design which I would like help with;

1. Tables vs Databases

Let's say I am trying to make things fast. So that a front-end application gets results from the SQL (MS SQL) as quick as possible. The dataset will be inherently large (for example, 10k records per day).

I want to split the data up into either seperate tables or databases. For example, 90% of queries only require the data for yesterday, so I could have a seperate table/DB with just that data.

Any recommendations? (Let me know if this is too vague)

2. String queries vs stored procedures

Should my front-end application pass a query string or use an SP? What are the benefits/drawbacks?

3. Transaction logs in MS SQL

If I have a disaster recovery process whereby I can load lost data with a few queries (data is stored on a seperate server). So what is the benefit of transaction logs?

The reason I ask is because they have caused me hardship in terms of disk space. Of course this is likely because I am not good at setting up / working with MS SQL.

I would appreciate some general guidance on how to handle the logs.

Hope my queries make sense, as you can see I am very much an amateur!

SS

Colonel Panic · 22-05-2013 11:49am

Is this for an actual application? Is the data actually relational or are you just dumping loads of records some data store?

Making multiple databases to me defeats the purpose. Your issue seems to be one of those known problems that is solved with things like indexing, table partitioning and database clustering.

Regarding stored procs versus queries. I always panic a bit when people post that you should always use stored procs. It really depends, there's a risk of putting too much business logic in the database.

Good questions though, it pays to learn the right way to do things rather than just bang it out and repeat the same mistakes forever.

SalteeDog · 22-05-2013 12:24pm

Stick with a single database - I cannot see how multiple databases would help.

Regarding stored procs or application queries - have you thought about defining views for your more complex queries and keeping your application sql as simple as possible.

Cannot help you too much with tx logs - depending on activity levels they can be a disk hogger - I would think you need to have a strategy to write them to external storage.

Unless you have been given ample disk space for your database you are going to be in trouble.

seamus · 22-05-2013 12:29pm

Sounds a bit like an assignment question

1. 10k new records per day isn't that large. That's only 3.5m records per year. General idea with most database servers is to avoid rewriting data unless you need to. For example, in order to clear out yesterday-1's data, you still need to put it somewhere because you might still need it. This involves taking it out of the "Live" database and putting it into another database just in case you need to access it. That's big overhead for very little return. For performance improving tables where data sets are unevenly accessed, what you want is table partitioning. Basically what this does is allow you to move data onto different files while the data all still appears to reside in the same table. Data can be partitioned on basically any criteria.
So for example, you can set up 3 partitions - one for data less than a week old, one for data between a week and six months old, and one for all data over six months old. The data up to a week old can sit on an SSD (it's a small dataset), the stuff up to six months old could sit on a standard drive in the server, and the rest of the stuff (which is nearly never accessed) could sit on a 10Gb network link.
When you go to retrieve data from the table, you don't care how old it is, MSSQL will handle the retrieval. Likewise MSSQL can handle the dynamic partitioning - you don't need to go in every day and move data yourself.
Having separate tables/databases for different days is just a complete mess.

2. Long story short, SPs are faster and less likely to be compromised by SQL injection. From the front-end developer's point of view though they're less flexible and harder to debug. SP -v- query is a long running debate. SPs major drawback is that you're inexorably tied to the database platform and possibly even the version of the database software. If you try to move platform, you have to rewrite your SPs (or your frontend code) from the ground up, and software upgrades can kill your application if syntax or native functionality changes. I would tend to say that if you're a programmer who needs a database, avoid stored procedures. If you're a database developer providing functionality to a front-end programmer, use SPs.

3. Transaction logs provide point-in-time recovery of a database. So while a database backup allows you to restore the database to the time you backed it up, the transaction log allows you to "replay" all of the transactions which took place since the last full backup. Meaning that you can restore almost all of your data. Your current process allows you to copy data from a backup database, but presumably this is, "Restore data for row X in table Y". What would you do if your entire live database fell over and the data was unrecoverable? How would you use your secondary database to restore your live database to the condition it was just before it died?
If point-in-time recovery isn't that important, or you're confident that your secondary database could be used to recover the live one completely, then you can turn off transaction logging by changing the database recovery mode to SIMPLE*. This means that if you lose the database, you can only recover data back to the last full backup.
If you decide to stick with transaction logging, there are two ways to keep the size down:
1. Perform more frequent full database backups. A full DB backup empties the transaction log (because it's only needed to restore data from after the last full backup). This however is not ideal from a performance POV.
2. Perform frequent transaction log backups. You can backup the transaction log on its own, moving it to a backup disk or whatever. Transaction log backups don't interfere with performance because you don't need to access any database resources while you're doing it.

1 is quicker than 2 for performing full database restores.

*Unless you are using transaction log shipping for your backup database

Beano · 22-05-2013 1:07pm

seamus wrote: »

Sounds a bit like an assignment question

... SPs major drawback is that you're inexorably tied to the database platform and possibly even the version of the database software. If you try to move platform, you have to rewrite your SPs (or your frontend code) from the ground up, and software upgrades can kill your application if syntax or native functionality changes. I would tend to say that if you're a programmer who needs a database, avoid stored procedures. If you're a database developer providing functionality to a front-end programmer, use SPs.

I'm not quite sure what you are getting at here. If you have all your sql inline in the front end code then you are also tied to the database platform. similarly any version change that requires a rewrite will affect inline sql as much as SPs. Having it all in SPs also means that moving platforms is actually easier. All your database logic is in one place rather than scattered around inside client side code.

Think of a situation where an update to one table requires updates of associated records. would that do that using inline sql and handle the transactions in client side code or would you just write a sp that does this for in the language best suited to writing proper transactions?

Another point worth noting is that if your sql code is in SPs then hotfixing any database related problems is incredibly easy. No code rebuilds required.

There is nothing to be gained by using inline sql. It is the work of the devil and its practitioners should have their developer licenses removed.

Colonel Panic · 22-05-2013 1:26pm

If you want to play that game, database logic shouldn't be scattered around inside client code whether you use SPs or inline SQL queries.

Handling transactions client side is a perfectly acceptable.

ChRoMe · 22-05-2013 1:31pm

You really should be using an ORM like hibernate, writing raw SQL is madness unless its a case where a custom query is required.

Colonel Panic · 22-05-2013 1:34pm

I mostly agree. I'm more versed in Entity Framework as I come from a .Net background in managed language land but be it an ORM, a custom DAL with your own queries in there or whatever, there should never be direct SQL API code in anything other than some abstraction that your application consumes.

I don't think an ORM is mandatory any more than I do SPs.

seamus · 22-05-2013 1:40pm

Beano wrote: »

I'm not quite sure what you are getting at here. If you have all your sql inline in the front end code then you are also tied to the database platform. similarly any version change that requires a rewrite will affect inline sql as much as SPs. Having it all in SPs also means that moving platforms is actually easier. All your database logic is in one place rather than scattered around inside client side code.

Except that if it's decided to move from (for e.g.) MSSQL to NOSQL, then you have no way of porting over your SPs.

In a well written application, your database access can be encapsulated in in a data access layer which while using inline SQL, still provides abstraction between the application and the database, so your application knows nothing of how the DB works.
You can even use a completely abstracted inline SQL like Linq which basically allows complete portability because the data access layer is handled by the Linq engine.

I agree on principle that bare inline SQL dotted about the code is no more portable than SPs, but you can still use inline SQL without having actual SQL statements littered through your application.

Stainless_Steel · 22-05-2013 2:04pm

Thanks for all the replies! All the info posted has been useful and given me more to think about.

Colonel Panic wrote: »

Is this for an actual application? Is the data actually relational or are you just dumping loads of records some data store?

Making multiple databases to me defeats the purpose. Your issue seems to be one of those known problems that is solved with things like indexing, table partitioning and database clustering.

Yes, I am making an actual information system to be used by 50+ users. As you can guess, I am not a pro dev. It is a hobby, but I think I can do a sufficient job on this application rather than paying for it.

Thanks for the hint re table partitioning - I didn't even know I could do that!

seamus wrote: »

Sounds a bit like an assignment question

1. 10k new records per day isn't that large. That's only 3.5m records per year. General idea with most database servers is to avoid rewriting data unless you need to. For example, in order to clear out yesterday-1's data, you still need to put it somewhere because you might still need it. This involves taking it out of the "Live" database and putting it into another database just in case you need to access it. That's big overhead for very little return. For performance improving tables where data sets are unevenly accessed, what you want is table partitioning. Basically what this does is allow you to move data onto different files while the data all still appears to reside in the same table. Data can be partitioned on basically any criteria.
So for example, you can set up 3 partitions - one for data less than a week old, one for data between a week and six months old, and one for all data over six months old. The data up to a week old can sit on an SSD (it's a small dataset), the stuff up to six months old could sit on a standard drive in the server, and the rest of the stuff (which is nearly never accessed) could sit on a 10Gb network link.
When you go to retrieve data from the table, you don't care how old it is, MSSQL will handle the retrieval. Likewise MSSQL can handle the dynamic partitioning - you don't need to go in every day and move data yourself.
Having separate tables/databases for different days is just a complete mess.

2. Long story short, SPs are faster and less likely to be compromised by SQL injection. From the front-end developer's point of view though they're less flexible and harder to debug. SP -v- query is a long running debate. SPs major drawback is that you're inexorably tied to the database platform and possibly even the version of the database software. If you try to move platform, you have to rewrite your SPs (or your frontend code) from the ground up, and software upgrades can kill your application if syntax or native functionality changes. I would tend to say that if you're a programmer who needs a database, avoid stored procedures. If you're a database developer providing functionality to a front-end programmer, use SPs.

3. Transaction logs provide point-in-time recovery of a database. So while a database backup allows you to restore the database to the time you backed it up, the transaction log allows you to "replay" all of the transactions which took place since the last full backup. Meaning that you can restore almost all of your data. Your current process allows you to copy data from a backup database, but presumably this is, "Restore data for row X in table Y". What would you do if your entire live database fell over and the data was unrecoverable? How would you use your secondary database to restore your live database to the condition it was just before it died?
If point-in-time recovery isn't that important, or you're confident that your secondary database could be used to recover the live one completely, then you can turn off transaction logging by changing the database recovery mode to SIMPLE*. This means that if you lose the database, you can only recover data back to the last full backup.
If you decide to stick with transaction logging, there are two ways to keep the size down:
1. Perform more frequent full database backups. A full DB backup empties the transaction log (because it's only needed to restore data from after the last full backup). This however is not ideal from a performance POV.
2. Perform frequent transaction log backups. You can backup the transaction log on its own, moving it to a backup disk or whatever. Transaction log backups don't interfere with performance because you don't need to access any database resources while you're doing it.

1 is quicker than 2 for performing full database restores.

*Unless you are using transaction log shipping for your backup database

Excellent info, thanks. I think I will look into your suggested partitioning strategy of having 3 partitions for the different date ranges.

It looks like I will go the Stored Procedure route. The last time I did such a project I used all inline code. It became a bit of a mess to read/debug to be honest.

Another question I have is related to indexes. I ran a query on
sys.dm_db_index_physical_stats on the DB from my last project and the average fragmentation % is 27%!! Is there a way to prevent this or is it just a case of reorganising indexes frequently? Any tips in this regard?

Thanks again to all who have posted.

SS

Beano · 22-05-2013 2:16pm

Colonel Panic wrote: »

If you want to play that game, database logic shouldn't be scattered around inside client code whether you use SPs or inline SQL queries.

Handling transactions client side is a perfectly acceptable.

Was that a reply to me? what "game" am i playing? and who mentioned anything database logic scattered around inside client code?

I would have thought that a DAL was assumed. Inline SQL does not below anywhere inside your client code. anywhere.

Beano · 22-05-2013 2:19pm

seamus wrote: »

Except that if it's decided to move from (for e.g.) MSSQL to NOSQL, then you have no way of porting over your SPs.
...

I cant understand why anyone would move to that rubbish. I thought we were only talking about serious databases.

shanec1928 · 22-05-2013 2:25pm

Stainless_Steel wrote: »

Thanks for all the replies! All the info posted has been useful and given me more to think about.

Yes, I am making an actual information system to be used by 50+ users. As you can guess, I am not a pro dev. It is a hobby, but I think I can do a sufficient job on this application rather than paying for it.

Thanks for the hint re table partitioning - I didn't even know I could do that!

Excellent info, thanks. I think I will look into your suggested partitioning strategy of having 3 partitions for the different date ranges.

It looks like I will go the Stored Procedure route. The last time I did such a project I used all inline code. It became a bit of a mess to read/debug to be honest.

Another question I have is related to indexes. I ran a query on
sys.dm_db_index_physical_stats on the DB from my last project and the average fragmentation % is 27%!! Is there a way to prevent this or is it just a case of reorganising indexes frequently? Any tips in this regard?

Thanks again to all who have posted.

SS

when creating the index adding padding to it should reduce the fragmentation when adding new data to the tables.

Colonel Panic · 22-05-2013 2:33pm

Beano wrote: »

Was that a reply to me? what "game" am i playing? and who mentioned anything database logic scattered around inside client code?

I would have thought that a DAL was assumed. Inline SQL does not below anywhere inside your client code. anywhere.

Yeah okay...

Stainless_Steel · 22-05-2013 3:36pm

shanec1928 wrote: »

when creating the index adding padding to it should reduce the fragmentation when adding new data to the tables.

How do I add padding?

Colonel Panic · 22-05-2013 3:40pm

It's an option when you create the index in the first place.

Giblet · 22-05-2013 7:59pm

Beano wrote: »

I cant understand why anyone would move to that rubbish. I thought we were only talking about serious databases.

Eh, what? NoSQL databases tend to work better for denormalised views. They are used by Google and the like.

amen · 23-05-2013 1:32pm

when creating the index adding padding to it should reduce the fragmentation when adding new data to the tables.

Depending on how often the data in your table changes you should really have weekly/monthly jobs to rebuild your indexs, defrag the tables, recompile your sps etc.

A Frag of 27% is not good. You should be able to keep it below 5% and really below 1%.

Except that if it's decided to move from (for e.g.) MSSQL to NOSQL, then you have no way of porting over your SPs.

I often hear this but when I ask for examples where someone moved their main application DB from MS SQL->Oracle or MySQL->MS SQL or MySQL->Oracle there are never any good examples.

neil_hosey · 27-05-2013 3:53pm

Please steer clear of stored procedures..

Use an ORM.

You could do this very simply using hibernate or entity framwork depending on whether you are using java or .NET.

Entity Framework for example will generate a very basic DAL for you, which will done fine for something as small as what you are doing.

neil_hosey · 27-05-2013 3:55pm

amen wrote: »

Depending on how often the data in your table changes you should really have weekly/monthly jobs to rebuild your indexs, defrag the tables, recompile your sps etc.

A Frag of 27% is not good. You should be able to keep it below 5% and really below 1%.

I often hear this but when I ask for examples where someone moved their main application DB from MS SQL->Oracle or MySQL->MS SQL or MySQL->Oracle there are never any good examples.

Today! I moved a large application from SQL Server to MySql with very little change needed.. It is a Code first EF solution, using the repository pattern and service meaning the only changes I needed was the one new reference to MySql, and the writing of a DB initializer to generate the database for migration.

Database Design

Comments