Feeds:
Posts
Comments

Archive for June, 2011

In this fourth and final part in the incremental load series, I will explain the sliding window incremental load strategy. As mentioned in the first post in the series, this is a hybrid incremental load strategy. It can combine source based incremental load, target based incremental load, custom selection rules in scenarios where it might not be possible to reliably identify new, modified or deleted records in the source and at the same time, a full extract from the source as implemented by target based incremental load is too much data.

A couple of reasons when this might come in handy are

  • The field values used for incremental selection are not assigned incrementally
  • The field values used for incremental selection can be manually modified (user entered value vs. system field)
  • Long running transactions on the source system cause transactions to be “committed behind me”

Common to all sliding window load strategies are, that I need to decide a safe window that would prevent gaps in data. As mentioned in other posts, one of the worst things that can happen to a data warehouse is gaps in data, becase this will slowly but surely lead to loss of confident in the data warehouse.

Example 1 – 90 days window from last max date

In this scenario I will be using the invoice date from the last extraction as fix point for a sliding window of 90 days. I will use the subtraction feature under source based incremental load, to dynamically subtract the 90 days from whatever value the last maximum value of the invoice date was.

Using a sliding window like this, is often a very effective way to combine a well performing incremental load strategy, while still ensuring that any records that might be modified “back in time” is still covered. In this scenario, it is assumed that no records are modified further back than 90 days from the last transaction date, as defined by the invoice date field.

Example 2 – 90 days from current date

In this scenario I would like to use the current date as the fix point for a 90 days sliding window. The business rules has established that no invoices can be created or modified further back than 90 days backwards from the current date. I do not know in advance how many of the records in this window may already exist, but it is assumed that processing 90 days of data will be much faster than processing the entire table.

This is using a table level selection rule, that is essentially just a part of the SQL WHERE clause used when extracting from the source.

The custom selection rule above will cause the table SELECT statement to be implemented as

SELECT [HotelID], [Number],[InvoiceDate],[ExchangeRate]
FROM [dbo].[Invoices]
WHERE [InvoiceDate] >=DATEADD(D,-90,getdate())

The selection rule will appear under the source table

Combine source- and target based incremental load strategies

It is also possible to implement a hybrid load strategy using both source based incremental load and target based incremental load. If both load strategies are used on the same table, the following separation of responsibilities is used

  • Source based incremental selection rules controls the extraction, thus determines the contents of the raw table instance
  • Target based incremental keys and events controls the destination, thus determines which records are inserted, updated and deleted

Warning: Be very careful if you consider combining source based incremental load for extraction, and delete event handling on target based incremental load. Keeping in mind that records that does not exist in the raw table will be deleted in the valid table, this can lead to unintended behavior.

Read Full Post »

After having been sidetracked by other topics for a few weeks, I have finally come around to continue the series on incremental load strategies. In this third part I will explain the target based incremental load strategy in detail. Like in the previous post, I will first explain the data flow and then do a step by step implementation of this load strategy for an existing table in timeXtender.

Target based incremental load really only comes in play, when there is no way at all to implement a source based incremental load strategy. In other words, Target based incremental load is used, when there is no way at all to identify new and/or modified records in the source. One very common scenario where this is the case, is when the source system does not have a guaranteed incremental sequence number, created date/time or modified date/time. This is seen with more or less the entire range of the most common ERP systems.

With this in mind, I will expect a load pattern like this

  • Every extract from the source will be a full load
  • Every load in terms of records, will be equal to last load + new records – deleted records
  • 80-90 % of the extracted records will already exist in the valid table instance
  • Every load will be incrementally larger than the previous load, as more records are added to the sourceThinking

Keeping this flow from the first post in the series in mind, target based incremental load is all about optimizing the load process after extraction. This is done by minimizing the number of records that needs to be inserted into-, updated in- (and deleted from) the valid table instance. It should be obvious, that if the write operations on the valid table instance can be minimized to the 10-20% affected data, the overall load process should run much faster than a full load.

 

And finally, the logical data flow for target based incremental load strategy is:

  1. Delete all records from the raw table
  2. Insert all data from the source into the raw table
  3. Perform data cleansing on all records in the raw table
  4. Perform data quality checks on all records in the raw table
  5. Insert all new records into the valid table
  6. Update all existing records in the valid table if needed
  7. Delete all records from the valid table, that was deleted in the source

As an important note, please be aware that this is not the exact physical flow, as it implemented using set based SQL stored procedure.

How to enable target based incremental load in timeXtender

In the following I will implement a target based incremental load strategy on a single table. For comparison with other load strategies, I will use the Invoices table from the Happy Hotels demo and training database. A couple of assumptions about the scenario:

  • Invoice numbers are not assigned sequentially
  • Identical number series for invoice numbers can be used across hotels
  • Invoice date can be changed manually in the source system, thus is not reliable for incremental selection

Step 1 – Enable target based incremental load

On the destination: To enable incremental load, right-click the table and select the Advanced Settings option on the context menu. In the Advanced Settings dialog, mark the Target Based checkbox as shown below.

Step 2 – Define an incremental key

Once target based incremental load has been enabled, the destination table will have a new yellow colored field group Incremental Settings as seen below.

 Selecting this new field group, opens the following dialog pinned to the right side of the application.

Similar to the source based incremental example, the natural key is a composite key made of the fields HotelID and Number. I will specify these fields in the Target Based Incremental Keys section of the dialog. This key will be used to compare records in the raw table with records in the valid table, so that it is possible to determine if a record exists or not.

Step 3 – Select incremental events

In the last configuration step, I need to select which of the insert, update and delete events that should be handled. I need to select at least one but can also select all of them, making the target based incremental load logically identical to a full load.

As an important note, please be aware that the delete event is handled by comparing all records in the valid table with all records in the raw table. Those records that does not exist in the raw table based on the key specified in step 2, will be deleted from the valid table.

Step 4 – Deploy and execute

On the destination: To deploy and execute, right-click the table and select the deploy and execute context menu option.

A new schema changes step will appear in the deployment process, after incremental load has been enable. This extra step protects the table from unnecessary future deployments and will be run as the first step for any future deployments. When a structure change is done to the table that requires a new full load, it will warn us.

Step 5 – Verify

I will see the immidate results after the first incremental load, as it would have only inserted/updated those records in the valid table, meeting the criteria for inserts and updates. As I need to see changes in source data to really verify the effect of target based incremental load, I will simply verify the performance impact by examining execution times of a full load vs. my target based incremental load.

When would you not use a target based incremental load strategy?

Before answering this question, I will explain a bit about how the incremental key is implemented behind the scenes. If I select the show system control fields context-menu option on the same table as shown in step 2, it reveals a field called Incremental Hash Key in the incremental settings field group. This field is the actual implementation of the incremental key. It is implemented as a hashed, computed column that is persisted in the table so that a non clustered index can be created on it. This provides me with a well performing single field, that is used for comparison between incoming and existing records on the destination.

This implementation is very benificial for large tables and in general, the larger the better. However, the computed column as described above does come with a bit of overhead and I would normally not recommend using a target based incremental load strategy on tables smaller than 5-10 million records. This is not an exact science, as many factors on the actual system has high impact on this. One of the most important factors is the performance of the io subsystem (disks), where i.e. bad write performance lowers the point at which target based incremental loads could be implemented.

Read Full Post »

After having traveled for the past almost two weeks, including a trip to Karlsruhe, Germany for the IT & Dev Connections conference, I was working my way down the list of unread emails. I came across an email from a partner, who was complaining about it being difficult to find a specific table in the timeXtender meta model once a project reaches a certain size. Apparently he did not know about the search function that has been in timeXtender for quite some time now. This made me realize that we probably haven’t been to good at telling about it, thus I will take this opportunity to explain how this small, basic and hard to live without function works.

In short, the search function can be used to search for a word or a sentense in the timeXtender meta model. This includes not only all standard objects and attributes within the model, such as field names or table names, but also the user defined documentation added to the meta model. The following example of a simple search should give you a pretty good idea about how to use this function to be even more productive with timeXtender.

A search example

In the following example, I will search the meta model for the word “customer”. I have selected the “Search in descriptions” option, which means that it will search both system attributes and user defined documentation. The search result will use the same icons as used in the tree views in timeXtender, thus it should be clear which found objects are tables, fields, dimensions etc.

Step 1 – Start the search dialog

 The search can be found on the Tools.Show Shortcuts menu option. This will show you that search can be accessed using the CTRL + F keyboard shortcut, which is how I am going to fire it up. The first time it will look like this

 

Step 2 – Search criteria

In this example I will search for a single word, ignoring case differences but including the descriptions added as documentation.

 

Step 3 – Search results

As you can see below, the search word was found in several objects including table name, field name, lookup field name, dimension and dimension level. It even found the cube “Sales”, that clearly does not have “customer” as part of the name. This was found because the description added to the cube (note the bold font), contains the search word.

To work with any object, either double-click it in the search results using the mouse or use the “Select in tree” button. This will select the object in the main dialog, without you having to scroll up and down the tree view.

Read Full Post »

In this post I will talk about surrogate keys and how to use them for dimension keys in a data warehouse. As one of the key concepts of data warehousing, it is often the center of heated discussions between “true believers” of the Kimball recommendations and what we can call a more pragmatic approach.

If you do not really care about all the nitty gritty details and just want to learn how to implement surrogate keys, please feel free to dive straight into the how to video.

Why use a surrogate key?

We will start by defining a surrogate key as a single field, integer based non-meaningful key. A surrogate key is a brilliant way to handle some of the challenges you will face in data warehousing. Some of these are

It is hard to imagine a SCD type 2 dimension without surrogate keys. In a type 2 dimension, multiple instances of a dimension member will exist over time. A surrogate key is the way to identify a specific instance of a member in this context.

  • Simplification of a composite business key (wide, multiple fields)

The simplification of dealing with a single field vs. a composite key to identify a single record in a table should be obvious.

  • Shielding from changes in meaningful business keys

Using surrogate keys is an effective way to shield the data model of the data warehouse against changes in the source system. One cause of this could be a replacement of a line of business system. With a surrogate based data warehouse, the entire data model would be unaffected by such a change.

  • Join optimization

If you are able to base every join between fact and dimension tables on a single integer value, you will experience a significant query performance over equivalent joins made on multiple fields, especially if they’re non integer types.

Performance Impact

Following the definition of a surrogate key as a single field, integer based non-meaningful key that will improve join performance, it will outperform a business key implemented as either non-numeric data types or even worse, composite keys as they are implemented in many source systems.

We will not go into the discussion and many details of SQL Server performance (heap tables vs. clustered tables, indexes, pages etc.), as you can find entire blogs and web sites dedicates to this. For now we will simply conclude, that it’s considered best practice to have the smallest possible, single field, integer based primary key. And this is exactly what the DW_Id field in timeXtender is. It is a integer based field, used as the single field in the physical primary key on any table in timeXtender. It is also an identity field, meaning that is is assigned a non-meaningful, incremental value automatically when a record is being inserted in a table. And there we have it, a built-in well performing surrogate key on any table.

When not to use surrogate keys

I am aware that even suggesting that there can be cases, when a surrogate key is both unnecessary and unwanted is an open invitation to criticism. Nevertheless I’m going to take the risk and give you a couple of examples

  • The maintainability factor

What is sometimes overlooked or ignored, is that many data warehouses these days are built- and maintained by people who are not full time data modelers or data warehouse developers. If the person(s) maintaing the data warehouse feel at home in the line of business systems feeding the data warehouse, it might be safer and more efficient to keep the business keys.

  • Performance and storage is the main driver

Although it is most often true, that implementing surrogate keys will increase join performance and reduce storage, it might not always be the case. Or it might be insignificant to the overall performance or storage. If you i.e. consider a date dimension, both bigint and datetime data types has a storage size of 8 bytes. If all your joins can be based on a single datetime based field, what is the value of the integer based surrogate keys? And if you consider the price you have to pay, when performing the lookup required to replace the datetime based key in your fact tables with the surrogate key, what is the affect on overall performance?

  • The warehouse is only used to feed a OLAP database

If all you really use your data warehouse for is to feed a OLAP database, it might not be worth implementing surrogate keys. Best practice documents from Microsoft indicates, that to improve processing time and reduce the dimension size, you should consider using numeric keys for dimension keys when the number of members of a single attribute exceeds 1 million. I find it safe to assume that most businesses will have dimensions way smaller than this.

And finally, the how to video

Please refer to the following video, showing you both how to implement this on the data warehouse and how to use it in a OLAP dimension.

Read Full Post »

Before getting into the how to details of the implementation, let’s start by defining what early arriving facts are. An early arriving fact is what happens, when a fact record arrives at the data warehouse before its full dimension context. Or in other words, when we lookup the surrogate key in our dimension table using the fact table natural key, the dimension member doesn’t exist yet.

In many cases when we load our data warehouse, we would assume that dimensions and facts would arrive at the same time. Or at least we would expect that dimension members would exist for the fact records being loaded. Although the reasons why we have to be able to handle early arriving facts are many, multiple source systems and/or different load frequences seems to be the dominating factors. We might i.e. be loading our dimensions using a full load strategy with one load schedule, while using an incremental load strategy for our fact table with a close to real-time frequency.

For the sake of this post, consider a company selling products to customers using invoices to capture the sale. We can illustrate using the following simplified star schema

How to enable early arriving facts in timeXtender

In the following we will implement early arriving fact handling for the above scenario. Please be aware of the following default behaviour in timeXtender

  • Unless early arriving fact handling has been explicitly added to a fact table, a missing dimension key is assumed to be an error
  • The relationship settings determines if an error is of severity warning or error (foreign key behavior)
  • Relationship settings are inherited from the project default, when a relationship is created

 The steps 1-5 shown below are for a single dimension, and should be repeated for every dimension.

 Step 1 – Open the Add Related Records Dialog

On the dimension table, right-click and select the Add Related Records context menu option. This will open the dialog shown below, in which the remaining configuration will take place.

Step 2 – Table level parameters

  • In the Create Records from table drop down list, select the source table for the records to be created in this dimension table.
  • Set the Record Condition to Not Exists.
  • Set the Data Destination Table to the Raw instance of the table.

These parameters instructs timeXtender to create distinct keys in the raw instance of the dimension table, when the dimension key from the fact table doesn’t exist in the dimension table.

Step 3 – Field mapping

  • Map the natural key fields from the fact table to the corresponding fields in the dimension table
  • Map any additional attributes from the fact table to the corresponding fields on the dimension table
  • Define default values for remaining fields on the dimension table

Step 4 – Define the match condition

In the Conditions section, define the joins between the fact table and the dimension table. These are used with the record condition to determine if there is a match or not and what to do about it. These joins should be the same as when we perform the lookup of our surrogate key in the dimension table. In the example above the join is made on a single field, the customer key.

Step 5 – Confirm

Click the OK button to confirm and close the dialog.

As it can be seen from the screen shot below, a new group Table Transformations appears in the timeXtender UI under the receiving dimension table.

Dimension processing

Since we created our missing dimension keys in the raw table instance, it is up to the SCD Type 1 and 2 settings for the dimension table to determine what to do with those extra records. In most cases we simply replace the dummy/default/missing attributes on the dimension with the true values coming from the source (type 1 updates). As an alternative we can choose to create the missing dimension keys in the valid table instance, but then we asume full responsibility for the insert i.e. making sure there are no violations of the primary key.

References
Kimball talks about early arriving facts in his 2004 design tips.

Read Full Post »

%d bloggers like this: