Feeds:
Posts
Comments

Archive for May, 2011

In this second part of the series about incremental load, we will examine the source based incremental load strategy in detail. First we will look into the data flow of this load strategy and then we will do a step by step implementaiton of this load strategy on an existing table in timeXtender. It is assumed that you are already know what a full load is, but if you don’t please see the first post in this series.

Source Based Incremental Load limits the number of records that are loaded into the raw table, hence decreasing the time spent on data transfer. Since only new or modified records are present in the raw table, the data cleansing and data quality process will also decrease compared to a full load strategy. Source based incremental load cannot handle deletes from the data source. Records deleted in the source system will be present in timeXtender until another full load is made.

The foundation for source based incremental load, is one or more guaranteed incremental value field(s) of numeric of datetime data types. With this load strategy, we will persist the maximum values for these incremental values after each execution, and use them as dynamic filters for the next execution. Automatic incremental value fields, also known as identity or auto number are great candidates for incremental loads, just as automatic fields such as created date or modified date are equally good. On the other hand you need to be careful when using “application values” for which values can be entered manually. With manually entered values, we are very often not guaranteed that the values will be incremental.

If source based incremental load is based on a non guaranteed incremental value, it will very likely leed to gaps in the extracted data. The direct consequence of data gaps are loss of confidence in the data warehouse.

The data flow for Source based incremental load strategy is:

  1. Delete all records from the raw table.
  2. If a Last Max Value for the incremental rules field(s) exists, those are applied as a where-clause in next step. If not, a full load is made.
  3. Insert new or modified data from the source table in the raw table.
  4. Perform data cleansing on all records in the raw table.
  5. Perform data quality checks on all records in the raw table.
  6. If a record exists only in the Raw table, then the record is inserted in the valid table.
  7. If a record exists in both Raw and Valid, then the record in the valid table is updated.
  8. Insert/Update Last Max Value for the Incremental Rules field(s).

How to enable source based incremental load in timeXtender

In the following we will implement a source based incremental load strategy on a single table. We will use the Invoices table from the Happy Hotels demo and training database. A couple of assumptions about the scenario:

  • Invoices are by definition considered as incremental, as we wouldn’t expect an invoice to be modified.
  • We expect that the numbers for invoices are assigned sequentially
  • We expect that invoices are created sequentially, thus we wouldn’t create an invoice with an earlier invoice date than the latest created invoice

Step 1 – Enable source based incremental load

On the destination: To enable incremental load, right-click the table and select the Advanced Settings option on the context menu. In the Advanced Settings dialog, mark the Source Based checkbox as shown below.

Step 2 – Define a primary key

A primary key is needed to be able to determine if an incoming record is a insert or update (if the record is new or an existing record was modified in the source). In this example it is a composite key, made of the fields HotelID and Number. Please be note that a primary key in this context is solely used to check for uniqueness and will not be implemented as a physical key in the database.

On the destination: To select/deselect a field as a primary key, right-click the field and select the Include in primary key or Exclude from primary key context menu option. This is done on the table instance in the destination area, as it affects the

Step 3 a – Add incremental selection rule

Incremental selection rules represents the field(s) that contains the incremental value(s). The rule of thumb is, that everything that is on a single line represents a single rule.

On the source: To add a rule, right-click the table and select the Add incremental selection rule context-menu option. Then select the field(s) that represents a single rule.

Step 3 b – Add incremental selection rule

 On the source: To add a rule, right-click the table and select the Add incremental selection rule context-menu option. Then select the field(s) that represents a single rule.

To define a overlap of one day, specify a value of 1 in the days field under subtraction

Once both incremental selection rules have been defined, the table in the source area will look similar to this

Note: The ability to define overlaps is a very important part of source based incremental loading. It is used to prevent data gaps caused by long running transactions and dirty reads, by subtracting a fixed value from the current maximum value after last execution. The subtraction forces the starting point for the next execution back to start from an earlier point than indicated by the last maximum value. The size of the subtraction is determined by measuring the actual length of transactions, but this will be a separate topic in another post.

Step 4 – Deploy and execute

On the destination: To deploy and execute, right-click the table and select the deploy and execute context menu option.

A new schema changes step will appear in the deployment process, after incremental load has been enable. This extra step protects the table from unnecessary future deployments and will be run as the first step for any future deployments. When a structure change is done to the table that requires a new full load, it will warn us.

We will also see this information, telling us that a full load is needed to establish a reference load for future incremental loads.

Step 5 – Verify

We should see the immidate results after the first incremental load, as it would have only extracted and processed those records meeting the selection rules for inserts and updates. An easy way to verify the results after last execution is to use the preview feature on the destination table.

The valid table for this example contains 10000 records

The raw table contains only 12 records, which are the results of the minus 1 day overlap

And finally the incremental table displays the persisted maximum values for the incremental rules

In the next post you will learn in detail about the target based incremental load strategy and how to implement this in timeXtender.

Read Full Post »

This is the first of a four part series about Incremental Load in timeXtender. The primary focus of this series is to introduce the common load strategies used in timeXtender. The four parts in the series are

  1. Overview
  2. Source based incremental load
  3. Target based incremental load
  4. Sliding window incremental load

 Load strategies

The diagram below illustrates the performance of the different load strategies as listed above. A full load is slowest, as it would extract, transform and load all records for every execution. As data volume would normally increase over time, full loads will continue to be slower over time. In the other end, a true source based incremental load, for which you are able to clearly identify only inserted records will run much faster, as only new data are extracted, processed and loaded into the destination.

Default load strategy

The default load strategy for any table in timeXtender is called full load, which means that all records are extracted, transformed and loaded from the source into its destination. This default data flow executes in the following main steps:

  1. Delete all records from the raw table.
  2. Insert all data from the source into the raw table.
  3. Perform data cleansing on all records in the raw table.
  4. Perform data quality checks on all records in the raw table.
  5. Delete all records in the valid table.
  6. Insert all valid records into the valid table.

This load strategy is sometimes also called snapshot load or point-in-time load. In most cases this load strategy is sufficient, but with large amounts of data you should consider implementing an incremental load strategy for at least some tables to decrease the load time for your data warehouse.

Incremental loads

The main reason to implement any kind of incremental load strategy, is to optimize the load performance of a data warehouse or in other words, getting the data as fast as possible from the data source into the data warehouse. Although ETL and data warehousing by nature is batch-oriented, optimizing the load performance will in most cases allow near real time updates of the data warehouse. Keeping in mind the three steps of ETL – Extract, Transform and Load, we want to optimize as many of these steps as possible.

The trick behind an incremental load is quite simply to know which records in the data source have been inserted, modified or deleted since the last extraction. If you can minimize the load to only process these delta data, it should be obvious that the process should run faster than a full load. How (much) you can optimize using incremental load strategies highly depend on the source(s) you are extracting from.

Source based incremental load

This load strategy handles inserted and updated data in the source, but doesn’t handle deletes. It relies on one or more incremental value fields of a numeric or date/time data type, such as transaction number, transaction date or inserted date. It uses a primary key on the destination to determine if an incoming record is an insert or update. To prevent gaps in data caused by long running transactions in the source system, an optional overlap can be added to each incremental selection rule.

Target based incremental load

This load strategy can be configured to handle inserts, updates and deletes in the source. It assumes that there are no reliable incremental value fields available in the source, thus compares the incoming records to the existing records to determine if it’s an insert, update or delete. To optimize performance, the comparison is done on a hashed value that is often the result of a composite key.

Sliding window incremental load

This is hybrid that combines dynamic selection rules with source based and/or target based incremental load techniques. It is often used in scenarios where there is no clear way to identify inserted or updated data in the source. A business rule might then define that you will in example load all data for the past 30 days, the past quarter or year for every execution. This will process more data than logically needed, as the majority of data will already be at the destination. It will still be significantly faster than a full load, as it is assumed that the load size will be much smaller than the entire data set.

When to implement incremental loads?

Although nothing would prevent you from implementing an incremental load strategy for all tables, this would normally be considered as overdoing it.

  • Incremental loads are well suited for large fact tables with a relative low percentage of inserts/updates inbetween every execution of the ETL process.
  • If you have a large fact table with almost every record being updated, it might be better to use a full load strategy.
  • If you have a large fact table and need to check for deletes, it might be better to us a full load strategy.

In the next post you will learn in detail about the source based incremental load strategy and how to implement this in timeXtender.

Read Full Post »

Having the right permissions is a fundamental requirement of any tool, and timeXtender is no exception to this. In a world where we often have administrative rights in our development environments, we sometimes forget some of the basic permissions needed. One such example is to get a list of all System DSN entries on a machine. In timeXtender we need this permisssion, when you create a ODBC data source.

If you receive an error message similar to the following “Requested registry access is not allowed“, please follow the steps below to grant permission to read the list of System DSN entries in the registry.

WARNINGS:
You need administrative rights to complete this guide.

As stated in any Microsoft knowledge base article involving changes to the registry, if you use Registry Editor incorrectly, you may cause serious problems that may require you to reinstall your operating system. we cannot guarantee that you can solve problems that result from using Registry Editor incorrectly. Use Registry Editor at your own risk.

  1. Determine which account is used by timeXtender
  2. From the Start menu, click Run, type regedit, and then click OK to start Registry Editor.
  3. In the HKEY_LOCAL_MACHINE window, browse to the HKLM\SOFTWARE\ODBC\ODBC.INI key. The ODBC.INI key contains a subkey for each ODBC SYSTEM DSN that is registered on the machine.
  4. Right-click the ODBC.INI key and from the context menu, click Permissions.
  5. Add the user account that you identified in step 1 to the list of users who can access this key, and select the Read check box for this user.
  6. Click Apply, and then close Registry Editor.

Read Full Post »

%d bloggers like this: