Newsletter February 2022

Dealing with historical data is a topic that will come up sooner or later. Hopefully you will find it interesting to read about the challenges involved and about two strategies for two classes of data.

If you don't want to receive my newsletter, let me know.

Best regards,

Paul van Oordt

Dealing with historical data

Frequently recurring questions in my practice are: When is it time to remove historical data from the database? And: How can that be done efficiently?

For tables with tens of millions of rows, removing a significant chunk of data, whether or not as part of a move, can be a tricky business. Not only are the delete statements expensive and often blocking, issues can also arise with foreign keys that refer to the data. And when the archived data needs to be consulted, the existing applications and code no longer work. After all, the data is in another database.

Database size can become a problem with backups taking longer and longer. Because removing old data is often difficult and expensive, it makes sense to first look for easier ways to reduce the size of the database. For example, removing redundant or unused indexes or reclaiming free space created by deletes or updates. It is also good to be aware of the ways to speed up or reduce backups, such as the use of differential or filegroup backups, striping and backup compression. And of course compression of the data itself is also something to investigate.

Query performance can also be a reason to purge historical data, even though data that is not touched in queries need not be an obstacle. However, the indexing must then be really good. Obviously, the larger a table, the greater the effect of a badly tuned query doing a full table or index scan will be.

Below I describe two strategies for limiting datasize, usable for two different classes of tables.

'Primary' tables

By 'primary' tables I mean tables that are intertwined in the network of foreign keys and that are read and modified on a large scale for all kinds of business events, and therefore usually also contain a considerable number of indexes. Examples of such tables are 'customer', 'product', 'order', 'invoice' et cetera.

The approach to limit the footprint of these tables starts by marking data as historical. This is possible, for example, with a non-nullable bit column 'archived'. In anticipation of the possible deletion of that data, queries are rewritten so that by default they only search non-archived data. This can be done by systematically adding the condition archived = 0. Changing the queries is quite easy when stored procedures are used. Optionally, a view with this filter can also be used. The filter must contain the hard value "0", not a parameter as in archived = @archived. Existing indexes can then be filtered for archived = 0 and thus become much smaller. (It is good to note that even though filtering is done on archived, an index with that column as key makes no sense, since it is not selective at all.) You now have queries that touch less data and are therefore faster, and also have much smaller indexes. You can now even add some extra included columns to these smaller indexes to speed up queries further. As a bonus, the rows marked as archived can be removed more quickly, if deemed necessary. After all, fewer indexes need to be updated for these deletes.

Log data

For the sake of convenience I call the other category of tables 'log data'. These tables have usually no foreign keys pointing to them. The data is also rarely queried and there are no or only a few indexes. Furthermore, the old data can often really be deleted, and does not need to be moved.

For this data there is only the problem of expensive and blocking delete statements. Table partitioning is then ideal. For example, in a table in which the data and indexes are partitioned, a month's worth of data can be deleted instantaneously as a metadata operation. It is recommended to configure partitioning when creating the table, rather than waiting for the problem to occur, because it is difficult (expensive and blocking) to partition an existing large table. In practice, that foresight is not always present. Fortunately, solutions are still conceivable in which a partitioned version of the table is gradually phased in, while the old unpartitioned version is phased out.

Handling historical data, something to explore further? I am happy to assist you.

Working now also from Antwerp

I like to work on the customer's premisses, with a few people and a large screen in a room, or together behind a laptop. Because my assignments usually only last one or a few days, I don't make a fuss about travel times. It's good to know that since a few months I've been staying in Antwerp regularly, and I'd love to come to you from there too.

Thanks for reading.

February 2022; the newsletter appears a couple of times per year; subscribe / unsubscribe