Feeds:
Posts
Comments

Archive for the ‘Performance’ Category

With the constantly increasing data volumes that needs to be handled in data warehouses, SQL Server data compression can be a good way to save on storage – and perhaps even get a performance gain in throughput capacity.

As with most of my other posts, the purpose is not to cover the how to details of a manual implementation in SQL Server, just as I will not cover the technical differences between row and page compression. The purpose of this post is to make you aware of the existence of data compression, what it does and how it is implemented in timeXtender.

What is data compression?

With the danger of being too simplistic, if you think of data compression as being to the contents of a database what zip is to the file system, you are pretty close. It does exactly what the name indicates, it compresses the data, hence saves storage just as zipping a file.

There are two types of data compression, row compression and page compression. Row compression saves storage by using what is called a variable length storage format even for fixed length data types such as CHAR and by not storing zero and NULL values. Page compression uses a more complex technique, including considering redundancy in data.

Using data compression

Considering the nature of staging and data warehouse databases, page compression is well suited for data warehouse databases. Row compression is better suited for staging databases, due to the many changes (updates) normally performed in the data cleansing process.

For ease of use, there are only two parameters in timeXtender that controls the use of data compression

  • Table classification Any table classified as large dimension table or large fact table, uses data compression if supported by the target
  • Deployment target The deployment target for the staging or data warehouse database needs to be an Enterprise Edition of SQL Server 2008+

And just to clarify. A staging database table will always use row compression, whereas a data warehouse table will always use page compression.

Data Compression

Deployment Target

Storage savings

Our field experience shows the following average storage savings

Compression type Storage savings
Row (staging) 40-60 %
Page (data warehouse) 50-90 %

In addition we have seen performance gains, that can be be explained by the reduced disk IO. If the bottle neck in general is IO and there is available memory on the server, you can expect a performance gain.

A word of caution

When you first use data compression in a database, SQL Server changes the internal storage of the database. This will prevent you from restoring a backup of a compression enabled database onto another SQL Server instance, not using data compression.

Software prerequisites

Data compression is a feature that requires SQL Server Enterprise Edition, thus also requires an Enterprise Edition license of timeXtender. It is supported from SQL Server 2008 and timeXtender version 4.5.

References

If you are interested in behind the scenes details of data compression, SQL Server Central is good source. Search for data compression (and make sure not to confuse it with backup compression).

Read Full Post »

In this fourth and final part in the incremental load series, I will explain the sliding window incremental load strategy. As mentioned in the first post in the series, this is a hybrid incremental load strategy. It can combine source based incremental load, target based incremental load, custom selection rules in scenarios where it might not be possible to reliably identify new, modified or deleted records in the source and at the same time, a full extract from the source as implemented by target based incremental load is too much data.

A couple of reasons when this might come in handy are

  • The field values used for incremental selection are not assigned incrementally
  • The field values used for incremental selection can be manually modified (user entered value vs. system field)
  • Long running transactions on the source system cause transactions to be “committed behind me”

Common to all sliding window load strategies are, that I need to decide a safe window that would prevent gaps in data. As mentioned in other posts, one of the worst things that can happen to a data warehouse is gaps in data, becase this will slowly but surely lead to loss of confident in the data warehouse.

Example 1 – 90 days window from last max date

In this scenario I will be using the invoice date from the last extraction as fix point for a sliding window of 90 days. I will use the subtraction feature under source based incremental load, to dynamically subtract the 90 days from whatever value the last maximum value of the invoice date was.

Using a sliding window like this, is often a very effective way to combine a well performing incremental load strategy, while still ensuring that any records that might be modified “back in time” is still covered. In this scenario, it is assumed that no records are modified further back than 90 days from the last transaction date, as defined by the invoice date field.

Example 2 – 90 days from current date

In this scenario I would like to use the current date as the fix point for a 90 days sliding window. The business rules has established that no invoices can be created or modified further back than 90 days backwards from the current date. I do not know in advance how many of the records in this window may already exist, but it is assumed that processing 90 days of data will be much faster than processing the entire table.

This is using a table level selection rule, that is essentially just a part of the SQL WHERE clause used when extracting from the source.

The custom selection rule above will cause the table SELECT statement to be implemented as

SELECT [HotelID], [Number],[InvoiceDate],[ExchangeRate]
FROM [dbo].[Invoices]
WHERE [InvoiceDate] >=DATEADD(D,-90,getdate())

The selection rule will appear under the source table

Combine source- and target based incremental load strategies

It is also possible to implement a hybrid load strategy using both source based incremental load and target based incremental load. If both load strategies are used on the same table, the following separation of responsibilities is used

  • Source based incremental selection rules controls the extraction, thus determines the contents of the raw table instance
  • Target based incremental keys and events controls the destination, thus determines which records are inserted, updated and deleted

Warning: Be very careful if you consider combining source based incremental load for extraction, and delete event handling on target based incremental load. Keeping in mind that records that does not exist in the raw table will be deleted in the valid table, this can lead to unintended behavior.

Read Full Post »

After having been sidetracked by other topics for a few weeks, I have finally come around to continue the series on incremental load strategies. In this third part I will explain the target based incremental load strategy in detail. Like in the previous post, I will first explain the data flow and then do a step by step implementation of this load strategy for an existing table in timeXtender.

Target based incremental load really only comes in play, when there is no way at all to implement a source based incremental load strategy. In other words, Target based incremental load is used, when there is no way at all to identify new and/or modified records in the source. One very common scenario where this is the case, is when the source system does not have a guaranteed incremental sequence number, created date/time or modified date/time. This is seen with more or less the entire range of the most common ERP systems.

With this in mind, I will expect a load pattern like this

  • Every extract from the source will be a full load
  • Every load in terms of records, will be equal to last load + new records – deleted records
  • 80-90 % of the extracted records will already exist in the valid table instance
  • Every load will be incrementally larger than the previous load, as more records are added to the sourceThinking

Keeping this flow from the first post in the series in mind, target based incremental load is all about optimizing the load process after extraction. This is done by minimizing the number of records that needs to be inserted into-, updated in- (and deleted from) the valid table instance. It should be obvious, that if the write operations on the valid table instance can be minimized to the 10-20% affected data, the overall load process should run much faster than a full load.

 

And finally, the logical data flow for target based incremental load strategy is:

  1. Delete all records from the raw table
  2. Insert all data from the source into the raw table
  3. Perform data cleansing on all records in the raw table
  4. Perform data quality checks on all records in the raw table
  5. Insert all new records into the valid table
  6. Update all existing records in the valid table if needed
  7. Delete all records from the valid table, that was deleted in the source

As an important note, please be aware that this is not the exact physical flow, as it implemented using set based SQL stored procedure.

How to enable target based incremental load in timeXtender

In the following I will implement a target based incremental load strategy on a single table. For comparison with other load strategies, I will use the Invoices table from the Happy Hotels demo and training database. A couple of assumptions about the scenario:

  • Invoice numbers are not assigned sequentially
  • Identical number series for invoice numbers can be used across hotels
  • Invoice date can be changed manually in the source system, thus is not reliable for incremental selection

Step 1 – Enable target based incremental load

On the destination: To enable incremental load, right-click the table and select the Advanced Settings option on the context menu. In the Advanced Settings dialog, mark the Target Based checkbox as shown below.

Step 2 – Define an incremental key

Once target based incremental load has been enabled, the destination table will have a new yellow colored field group Incremental Settings as seen below.

 Selecting this new field group, opens the following dialog pinned to the right side of the application.

Similar to the source based incremental example, the natural key is a composite key made of the fields HotelID and Number. I will specify these fields in the Target Based Incremental Keys section of the dialog. This key will be used to compare records in the raw table with records in the valid table, so that it is possible to determine if a record exists or not.

Step 3 – Select incremental events

In the last configuration step, I need to select which of the insert, update and delete events that should be handled. I need to select at least one but can also select all of them, making the target based incremental load logically identical to a full load.

As an important note, please be aware that the delete event is handled by comparing all records in the valid table with all records in the raw table. Those records that does not exist in the raw table based on the key specified in step 2, will be deleted from the valid table.

Step 4 – Deploy and execute

On the destination: To deploy and execute, right-click the table and select the deploy and execute context menu option.

A new schema changes step will appear in the deployment process, after incremental load has been enable. This extra step protects the table from unnecessary future deployments and will be run as the first step for any future deployments. When a structure change is done to the table that requires a new full load, it will warn us.

Step 5 – Verify

I will see the immidate results after the first incremental load, as it would have only inserted/updated those records in the valid table, meeting the criteria for inserts and updates. As I need to see changes in source data to really verify the effect of target based incremental load, I will simply verify the performance impact by examining execution times of a full load vs. my target based incremental load.

When would you not use a target based incremental load strategy?

Before answering this question, I will explain a bit about how the incremental key is implemented behind the scenes. If I select the show system control fields context-menu option on the same table as shown in step 2, it reveals a field called Incremental Hash Key in the incremental settings field group. This field is the actual implementation of the incremental key. It is implemented as a hashed, computed column that is persisted in the table so that a non clustered index can be created on it. This provides me with a well performing single field, that is used for comparison between incoming and existing records on the destination.

This implementation is very benificial for large tables and in general, the larger the better. However, the computed column as described above does come with a bit of overhead and I would normally not recommend using a target based incremental load strategy on tables smaller than 5-10 million records. This is not an exact science, as many factors on the actual system has high impact on this. One of the most important factors is the performance of the io subsystem (disks), where i.e. bad write performance lowers the point at which target based incremental loads could be implemented.

Read Full Post »

In this post I will talk about surrogate keys and how to use them for dimension keys in a data warehouse. As one of the key concepts of data warehousing, it is often the center of heated discussions between “true believers” of the Kimball recommendations and what we can call a more pragmatic approach.

If you do not really care about all the nitty gritty details and just want to learn how to implement surrogate keys, please feel free to dive straight into the how to video.

Why use a surrogate key?

We will start by defining a surrogate key as a single field, integer based non-meaningful key. A surrogate key is a brilliant way to handle some of the challenges you will face in data warehousing. Some of these are

It is hard to imagine a SCD type 2 dimension without surrogate keys. In a type 2 dimension, multiple instances of a dimension member will exist over time. A surrogate key is the way to identify a specific instance of a member in this context.

  • Simplification of a composite business key (wide, multiple fields)

The simplification of dealing with a single field vs. a composite key to identify a single record in a table should be obvious.

  • Shielding from changes in meaningful business keys

Using surrogate keys is an effective way to shield the data model of the data warehouse against changes in the source system. One cause of this could be a replacement of a line of business system. With a surrogate based data warehouse, the entire data model would be unaffected by such a change.

  • Join optimization

If you are able to base every join between fact and dimension tables on a single integer value, you will experience a significant query performance over equivalent joins made on multiple fields, especially if they’re non integer types.

Performance Impact

Following the definition of a surrogate key as a single field, integer based non-meaningful key that will improve join performance, it will outperform a business key implemented as either non-numeric data types or even worse, composite keys as they are implemented in many source systems.

We will not go into the discussion and many details of SQL Server performance (heap tables vs. clustered tables, indexes, pages etc.), as you can find entire blogs and web sites dedicates to this. For now we will simply conclude, that it’s considered best practice to have the smallest possible, single field, integer based primary key. And this is exactly what the DW_Id field in timeXtender is. It is a integer based field, used as the single field in the physical primary key on any table in timeXtender. It is also an identity field, meaning that is is assigned a non-meaningful, incremental value automatically when a record is being inserted in a table. And there we have it, a built-in well performing surrogate key on any table.

When not to use surrogate keys

I am aware that even suggesting that there can be cases, when a surrogate key is both unnecessary and unwanted is an open invitation to criticism. Nevertheless I’m going to take the risk and give you a couple of examples

  • The maintainability factor

What is sometimes overlooked or ignored, is that many data warehouses these days are built- and maintained by people who are not full time data modelers or data warehouse developers. If the person(s) maintaing the data warehouse feel at home in the line of business systems feeding the data warehouse, it might be safer and more efficient to keep the business keys.

  • Performance and storage is the main driver

Although it is most often true, that implementing surrogate keys will increase join performance and reduce storage, it might not always be the case. Or it might be insignificant to the overall performance or storage. If you i.e. consider a date dimension, both bigint and datetime data types has a storage size of 8 bytes. If all your joins can be based on a single datetime based field, what is the value of the integer based surrogate keys? And if you consider the price you have to pay, when performing the lookup required to replace the datetime based key in your fact tables with the surrogate key, what is the affect on overall performance?

  • The warehouse is only used to feed a OLAP database

If all you really use your data warehouse for is to feed a OLAP database, it might not be worth implementing surrogate keys. Best practice documents from Microsoft indicates, that to improve processing time and reduce the dimension size, you should consider using numeric keys for dimension keys when the number of members of a single attribute exceeds 1 million. I find it safe to assume that most businesses will have dimensions way smaller than this.

And finally, the how to video

Please refer to the following video, showing you both how to implement this on the data warehouse and how to use it in a OLAP dimension.

Read Full Post »

In this second part of the series about incremental load, we will examine the source based incremental load strategy in detail. First we will look into the data flow of this load strategy and then we will do a step by step implementaiton of this load strategy on an existing table in timeXtender. It is assumed that you are already know what a full load is, but if you don’t please see the first post in this series.

Source Based Incremental Load limits the number of records that are loaded into the raw table, hence decreasing the time spent on data transfer. Since only new or modified records are present in the raw table, the data cleansing and data quality process will also decrease compared to a full load strategy. Source based incremental load cannot handle deletes from the data source. Records deleted in the source system will be present in timeXtender until another full load is made.

The foundation for source based incremental load, is one or more guaranteed incremental value field(s) of numeric of datetime data types. With this load strategy, we will persist the maximum values for these incremental values after each execution, and use them as dynamic filters for the next execution. Automatic incremental value fields, also known as identity or auto number are great candidates for incremental loads, just as automatic fields such as created date or modified date are equally good. On the other hand you need to be careful when using “application values” for which values can be entered manually. With manually entered values, we are very often not guaranteed that the values will be incremental.

If source based incremental load is based on a non guaranteed incremental value, it will very likely leed to gaps in the extracted data. The direct consequence of data gaps are loss of confidence in the data warehouse.

The data flow for Source based incremental load strategy is:

  1. Delete all records from the raw table.
  2. If a Last Max Value for the incremental rules field(s) exists, those are applied as a where-clause in next step. If not, a full load is made.
  3. Insert new or modified data from the source table in the raw table.
  4. Perform data cleansing on all records in the raw table.
  5. Perform data quality checks on all records in the raw table.
  6. If a record exists only in the Raw table, then the record is inserted in the valid table.
  7. If a record exists in both Raw and Valid, then the record in the valid table is updated.
  8. Insert/Update Last Max Value for the Incremental Rules field(s).

How to enable source based incremental load in timeXtender

In the following we will implement a source based incremental load strategy on a single table. We will use the Invoices table from the Happy Hotels demo and training database. A couple of assumptions about the scenario:

  • Invoices are by definition considered as incremental, as we wouldn’t expect an invoice to be modified.
  • We expect that the numbers for invoices are assigned sequentially
  • We expect that invoices are created sequentially, thus we wouldn’t create an invoice with an earlier invoice date than the latest created invoice

Step 1 – Enable source based incremental load

On the destination: To enable incremental load, right-click the table and select the Advanced Settings option on the context menu. In the Advanced Settings dialog, mark the Source Based checkbox as shown below.

Step 2 – Define a primary key

A primary key is needed to be able to determine if an incoming record is a insert or update (if the record is new or an existing record was modified in the source). In this example it is a composite key, made of the fields HotelID and Number. Please be note that a primary key in this context is solely used to check for uniqueness and will not be implemented as a physical key in the database.

On the destination: To select/deselect a field as a primary key, right-click the field and select the Include in primary key or Exclude from primary key context menu option. This is done on the table instance in the destination area, as it affects the

Step 3 a – Add incremental selection rule

Incremental selection rules represents the field(s) that contains the incremental value(s). The rule of thumb is, that everything that is on a single line represents a single rule.

On the source: To add a rule, right-click the table and select the Add incremental selection rule context-menu option. Then select the field(s) that represents a single rule.

Step 3 b – Add incremental selection rule

 On the source: To add a rule, right-click the table and select the Add incremental selection rule context-menu option. Then select the field(s) that represents a single rule.

To define a overlap of one day, specify a value of 1 in the days field under subtraction

Once both incremental selection rules have been defined, the table in the source area will look similar to this

Note: The ability to define overlaps is a very important part of source based incremental loading. It is used to prevent data gaps caused by long running transactions and dirty reads, by subtracting a fixed value from the current maximum value after last execution. The subtraction forces the starting point for the next execution back to start from an earlier point than indicated by the last maximum value. The size of the subtraction is determined by measuring the actual length of transactions, but this will be a separate topic in another post.

Step 4 – Deploy and execute

On the destination: To deploy and execute, right-click the table and select the deploy and execute context menu option.

A new schema changes step will appear in the deployment process, after incremental load has been enable. This extra step protects the table from unnecessary future deployments and will be run as the first step for any future deployments. When a structure change is done to the table that requires a new full load, it will warn us.

We will also see this information, telling us that a full load is needed to establish a reference load for future incremental loads.

Step 5 – Verify

We should see the immidate results after the first incremental load, as it would have only extracted and processed those records meeting the selection rules for inserts and updates. An easy way to verify the results after last execution is to use the preview feature on the destination table.

The valid table for this example contains 10000 records

The raw table contains only 12 records, which are the results of the minus 1 day overlap

And finally the incremental table displays the persisted maximum values for the incremental rules

In the next post you will learn in detail about the target based incremental load strategy and how to implement this in timeXtender.

Read Full Post »

This is the first of a four part series about Incremental Load in timeXtender. The primary focus of this series is to introduce the common load strategies used in timeXtender. The four parts in the series are

  1. Overview
  2. Source based incremental load
  3. Target based incremental load
  4. Sliding window incremental load

 Load strategies

The diagram below illustrates the performance of the different load strategies as listed above. A full load is slowest, as it would extract, transform and load all records for every execution. As data volume would normally increase over time, full loads will continue to be slower over time. In the other end, a true source based incremental load, for which you are able to clearly identify only inserted records will run much faster, as only new data are extracted, processed and loaded into the destination.

Default load strategy

The default load strategy for any table in timeXtender is called full load, which means that all records are extracted, transformed and loaded from the source into its destination. This default data flow executes in the following main steps:

  1. Delete all records from the raw table.
  2. Insert all data from the source into the raw table.
  3. Perform data cleansing on all records in the raw table.
  4. Perform data quality checks on all records in the raw table.
  5. Delete all records in the valid table.
  6. Insert all valid records into the valid table.

This load strategy is sometimes also called snapshot load or point-in-time load. In most cases this load strategy is sufficient, but with large amounts of data you should consider implementing an incremental load strategy for at least some tables to decrease the load time for your data warehouse.

Incremental loads

The main reason to implement any kind of incremental load strategy, is to optimize the load performance of a data warehouse or in other words, getting the data as fast as possible from the data source into the data warehouse. Although ETL and data warehousing by nature is batch-oriented, optimizing the load performance will in most cases allow near real time updates of the data warehouse. Keeping in mind the three steps of ETL – Extract, Transform and Load, we want to optimize as many of these steps as possible.

The trick behind an incremental load is quite simply to know which records in the data source have been inserted, modified or deleted since the last extraction. If you can minimize the load to only process these delta data, it should be obvious that the process should run faster than a full load. How (much) you can optimize using incremental load strategies highly depend on the source(s) you are extracting from.

Source based incremental load

This load strategy handles inserted and updated data in the source, but doesn’t handle deletes. It relies on one or more incremental value fields of a numeric or date/time data type, such as transaction number, transaction date or inserted date. It uses a primary key on the destination to determine if an incoming record is an insert or update. To prevent gaps in data caused by long running transactions in the source system, an optional overlap can be added to each incremental selection rule.

Target based incremental load

This load strategy can be configured to handle inserts, updates and deletes in the source. It assumes that there are no reliable incremental value fields available in the source, thus compares the incoming records to the existing records to determine if it’s an insert, update or delete. To optimize performance, the comparison is done on a hashed value that is often the result of a composite key.

Sliding window incremental load

This is hybrid that combines dynamic selection rules with source based and/or target based incremental load techniques. It is often used in scenarios where there is no clear way to identify inserted or updated data in the source. A business rule might then define that you will in example load all data for the past 30 days, the past quarter or year for every execution. This will process more data than logically needed, as the majority of data will already be at the destination. It will still be significantly faster than a full load, as it is assumed that the load size will be much smaller than the entire data set.

When to implement incremental loads?

Although nothing would prevent you from implementing an incremental load strategy for all tables, this would normally be considered as overdoing it.

  • Incremental loads are well suited for large fact tables with a relative low percentage of inserts/updates inbetween every execution of the ETL process.
  • If you have a large fact table with almost every record being updated, it might be better to use a full load strategy.
  • If you have a large fact table and need to check for deletes, it might be better to us a full load strategy.

In the next post you will learn in detail about the source based incremental load strategy and how to implement this in timeXtender.

Read Full Post »

%d bloggers like this: