Feeds:
Posts
Comments

Archive for the ‘Data Warehouse’ Category

In this post I will show you how to use prototyping to demonstrate business requirements and get rapid and actionable feedback. How you got the business requirements to demonstrate, is not important for the purpose of this post, but assuming that you have read some of my earlier blog posts, it should be clear that I recommend taking an agile approach.

The goals for the prototype

  • Demonstrate business requirements
  • Deliver fast, using minimum resources
  • Minimize rework

I am going to build a prototype using the dimensional model approach as per best practice. Although not strictly necessary for the purpose of a prototype, I am even using surrogate keys, as I will most likely reuse the prototype for the permanent implementation after one or more very short iterations.

The four easy steps

  • Build the data warehouse star schema
  • Enter sample data in the model
  • Build the dimensional model in OLAP
  • Build a test front end in Excel using Pivot Tables

Step 1 – Build the data warehouse star schema

In a dimensional model, the business requirements can be illustrated as this star schema (also known as a star join schema). For those unfamiliar with dimensional modeling, the yellow table is called a fact table. It is the center of a star schema and represents the numeric measurements of an event and has keys (links) to the dimensions that describe the event. The blue tables are called dimension tables. In this example we can describe an event: A sale and describe it in the context of customer, product sold and the point in time at which the event took place.

Remembering that the purpose is to demonstrate the business requirements as fast as possible and with a minimum of rework, I will implement the dimensional model above with manually entered sample data in the data warehouse database. When I build the dimensional model in the timeXtender meta layer it will look like this. Please note the red fields at the bottom of the fact table. These fields are lookup fields that are used to translate the natural key joins between my dimension and fact tables into surrogate based joins.

Step 2 – Enter sample data in the model

Using the custom data feature in timeXtender, I will enter just enough data to be able to demonstrate the model. As a side note, you would be amazed at the difference it makes to your business stakeholders, if you use recognizable data compared to just more or less random values.

At this point in time, the model including the manually entered demo data only exists in the timeXtender metal model in memory. If I use the deploy and execute features at this point, timeXtender wll generate all necessary SQL code in the physical data warehouse database, load it with demo data and save the project definition to the timeXtender project repository database. The processed fact table will look like this.

  • If you do not need a multidimensional representation for your data (an OLAP cube), the data are at this point already accessible from your front end of choice.

Step 3 – Build the dimensional model in OLAP (Analysis Services)

I will use a drag and drop approach to create a very simple OLAP representation of the star schema. This will include standard dimensions, standard measures and will look like this in timeXtender

Step 4 – Build a test front end in Excel

I will use the most common of all BI front end tools to build a view on the data that I have just built. A Pivot Table is an excellent way to get a feel of the dimensions and measures of a cube.

What is next?

The Excel workbook I just created can now be delivered to my business stakeholders. They can then use this to verify the business requirements and provide instant feedback. The business stakeholders might come back with extensions or modifications to the dimensional model after seeing the first result. This might lead me down one or more additional iterations of the above steps, adjusting the model according to more detailed requirements. This is all fine and exactly why prototyping is such an efficient way to interact between data warehouse developers and stakeholders. The time I spend at this early point in the process will save me from late and costly rework.

Once the model has been verified I can continue to identify data sources needed to load actual data into the dimensional model just built. I will most likely need to make minor changes to the dimensional model on the data warehouse, such as different data types for non system fields. Except for these minor changes, the entire dimensional model including the cube can be reused and put into production after proper testing with real life data.

 

Video link

My colleague Thomas Lørup Duun has made a video on this topic available on YouTube. The video is actually only 15 minutes in real time, so the header in this post is a bit conservative :-)

Read Full Post »

I will be the first to admit it, documentation is not fun. Talking to the business, understanding their requirements, building the dimensional model, developing the physical data warehouse and delivering the results to the business. That is both fun and valuable. Documentation is simply something I have to do.

I do it for the internal and external auditors. I do it for the next developer on the project. I do it for myself, because I cannot remember every little detail of everything I did, and unlike what some developers might say – not all code is self explaining. I do documentation because I have to, and because it is invaluable to the business and the continued development, expansion and enhancement of the data warehouse.

Since there is no doubt about the importance of doing the documentation, and since it is in no way fun, I can fortunately document an entire project in five minutes.

As you are most likely well aware of at this time, timeXtender is a meta data driven tool that generates code for SQL, SSIS and SSAS. Or to emphasize even more, timeXtender is the meta data thus just needs to put it in a document.

The elements of documentation

The automated PDF/XPS documentation that can be generated in timeXtender contains the following elements

  • Automatic meta data. Every field, table, data source, dimension, cube, measure etc. is automatically included in the documentation.
  • Custom documentation. Virtually every element within a timeXtender project can be described further using custom documentation in Rich Text format.

Add a custom description to an element

If you want to add what we at timeXtender call inline documentation, you simply

  1. Right click an element such as a table, field or measure
  2. Select the Description option in the context menu
  3. Add the description (which can be formatted using the simple editor or you can even paste in images from the clipboard)

Inline documentation is very powerful, fast and easy to add, simply just because you can describe/document an element on the element it self. There is no other tools involved. When working with a project you can also instantly see if documentation has been added to an element.

Custom Documentation

Generating the document

Creating the actual document is as easy as this

  1. Right click the project
  2. Select the Documentation option in the context menu
  3. Specify the document format, path and description settings
  4. Specify any optional settings such as colors and font
  5. Hit the OK button to generate the document

The documentation setup

 

The document will open once it has been created. The screen shot below shows a PDF formatted document. Please notice the attachment icon to the left. Any custom added descriptions will be attached inline to the document (depending on the setting in the above dialog).

 

Documentation PDF

Read Full Post »

With the constantly increasing data volumes that needs to be handled in data warehouses, SQL Server data compression can be a good way to save on storage – and perhaps even get a performance gain in throughput capacity.

As with most of my other posts, the purpose is not to cover the how to details of a manual implementation in SQL Server, just as I will not cover the technical differences between row and page compression. The purpose of this post is to make you aware of the existence of data compression, what it does and how it is implemented in timeXtender.

What is data compression?

With the danger of being too simplistic, if you think of data compression as being to the contents of a database what zip is to the file system, you are pretty close. It does exactly what the name indicates, it compresses the data, hence saves storage just as zipping a file.

There are two types of data compression, row compression and page compression. Row compression saves storage by using what is called a variable length storage format even for fixed length data types such as CHAR and by not storing zero and NULL values. Page compression uses a more complex technique, including considering redundancy in data.

Using data compression

Considering the nature of staging and data warehouse databases, page compression is well suited for data warehouse databases. Row compression is better suited for staging databases, due to the many changes (updates) normally performed in the data cleansing process.

For ease of use, there are only two parameters in timeXtender that controls the use of data compression

  • Table classification Any table classified as large dimension table or large fact table, uses data compression if supported by the target
  • Deployment target The deployment target for the staging or data warehouse database needs to be an Enterprise Edition of SQL Server 2008+

And just to clarify. A staging database table will always use row compression, whereas a data warehouse table will always use page compression.

Data Compression

Deployment Target

Storage savings

Our field experience shows the following average storage savings

Compression type Storage savings
Row (staging) 40-60 %
Page (data warehouse) 50-90 %

In addition we have seen performance gains, that can be be explained by the reduced disk IO. If the bottle neck in general is IO and there is available memory on the server, you can expect a performance gain.

A word of caution

When you first use data compression in a database, SQL Server changes the internal storage of the database. This will prevent you from restoring a backup of a compression enabled database onto another SQL Server instance, not using data compression.

Software prerequisites

Data compression is a feature that requires SQL Server Enterprise Edition, thus also requires an Enterprise Edition license of timeXtender. It is supported from SQL Server 2008 and timeXtender version 4.5.

References

If you are interested in behind the scenes details of data compression, SQL Server Central is good source. Search for data compression (and make sure not to confuse it with backup compression).

Read Full Post »

Inspired by a question I got the other day about data types and data type conversion in timeXtender, I guess it is about time to cover that subject in this blog. Although the title of this post talks about data type conversion, I will spend just as much time explaining data type handling in general. This includes a general description of how source objects are transformed into the common, standardized SQL Server structure we implement with our code generation.

Staging database

With exception of custom tables, every table in the staging database is derived directly from the source table, making it logically a 1:1 copy of the source table.

Object names (tables and fields) are as in the source, unless it contains illegal characters to the destination SQL Server. Any such character will normally be replaced with an underscore to make it a valid object name.

The data type of the source will automatically be converted into the SQL Server equivalent chosen by timeXtender. With exception of ODBC and text file data sources, this data type mapping cannot be changed.

Custom tables and fields

In both staging and data warehouse databases, it is possible to add custom tables, just as it is possible to add custom fields to any table. For those fields you decide the data type, using what I will call abstract data types. By this I mean that you do not specify actual SQL data types, but the parameters you specify determines the SQL data types used behind the scene.

Tip. If you hover the mouse over a field in the tree view in timeXtender, a tool tip will show the actual SQL data type for the field.

 

SQL Data type Field Parameters
varchar(50) Data Type varchar
nvarchar(20) Data Type nvarchar
decimal(38,3) Data Type decimal

 

Staging to data warehouse mapping

When a field is first copied to a data warehouse table, it inherits the data type from its source, normally a staging database. Unlike the fields derived from a data source table, the data type of these fields can be changed. To edit the data type of a field, simply right click the field and choose edit.

The following shows a field that was copied from a staging field of type varchar and that has been changed to a type datetime.

Data Type Edit

 

Although the data type conversion is not visually clear on this screen shot, the following shows a field of data type varchar, being copied from a staging database table into a field of type datetime in a data warehouse table.

When a field is copied from a staging database to a data warehouse like this, most implicit conversions are done automatically.

Data Movement Different Types

 

Mapping to another field

The following shows a custom transformation, used to copy a field from a character based data type (varchar) to a field of type datetime. In this example it is a simple assignment of a value, but you can also use CONVERT and CAST statements if needed.

Custom Transformation

 

Same field transformation

Although the transformations that can be done on a field is virtually unlimited, the one limitations that does exist is, that the data type cannot be changed. As an example, the following shows a date only transformation on a datetime field on a table. It removes the time part of a datetime value such as 2011-07-05 22:54 and stores it in the same field as 2011-07-05 00:00, still persisted as datetime.

DateOnly Transformation

Read Full Post »

After having been sidetracked by other topics for a few weeks, I have finally come around to continue the series on incremental load strategies. In this third part I will explain the target based incremental load strategy in detail. Like in the previous post, I will first explain the data flow and then do a step by step implementation of this load strategy for an existing table in timeXtender.

Target based incremental load really only comes in play, when there is no way at all to implement a source based incremental load strategy. In other words, Target based incremental load is used, when there is no way at all to identify new and/or modified records in the source. One very common scenario where this is the case, is when the source system does not have a guaranteed incremental sequence number, created date/time or modified date/time. This is seen with more or less the entire range of the most common ERP systems.

With this in mind, I will expect a load pattern like this

  • Every extract from the source will be a full load
  • Every load in terms of records, will be equal to last load + new records – deleted records
  • 80-90 % of the extracted records will already exist in the valid table instance
  • Every load will be incrementally larger than the previous load, as more records are added to the sourceThinking

Keeping this flow from the first post in the series in mind, target based incremental load is all about optimizing the load process after extraction. This is done by minimizing the number of records that needs to be inserted into-, updated in- (and deleted from) the valid table instance. It should be obvious, that if the write operations on the valid table instance can be minimized to the 10-20% affected data, the overall load process should run much faster than a full load.

 

And finally, the logical data flow for target based incremental load strategy is:

  1. Delete all records from the raw table
  2. Insert all data from the source into the raw table
  3. Perform data cleansing on all records in the raw table
  4. Perform data quality checks on all records in the raw table
  5. Insert all new records into the valid table
  6. Update all existing records in the valid table if needed
  7. Delete all records from the valid table, that was deleted in the source

As an important note, please be aware that this is not the exact physical flow, as it implemented using set based SQL stored procedure.

How to enable target based incremental load in timeXtender

In the following I will implement a target based incremental load strategy on a single table. For comparison with other load strategies, I will use the Invoices table from the Happy Hotels demo and training database. A couple of assumptions about the scenario:

  • Invoice numbers are not assigned sequentially
  • Identical number series for invoice numbers can be used across hotels
  • Invoice date can be changed manually in the source system, thus is not reliable for incremental selection

Step 1 – Enable target based incremental load

On the destination: To enable incremental load, right-click the table and select the Advanced Settings option on the context menu. In the Advanced Settings dialog, mark the Target Based checkbox as shown below.

Step 2 – Define an incremental key

Once target based incremental load has been enabled, the destination table will have a new yellow colored field group Incremental Settings as seen below.

 Selecting this new field group, opens the following dialog pinned to the right side of the application.

Similar to the source based incremental example, the natural key is a composite key made of the fields HotelID and Number. I will specify these fields in the Target Based Incremental Keys section of the dialog. This key will be used to compare records in the raw table with records in the valid table, so that it is possible to determine if a record exists or not.

Step 3 – Select incremental events

In the last configuration step, I need to select which of the insert, update and delete events that should be handled. I need to select at least one but can also select all of them, making the target based incremental load logically identical to a full load.

As an important note, please be aware that the delete event is handled by comparing all records in the valid table with all records in the raw table. Those records that does not exist in the raw table based on the key specified in step 2, will be deleted from the valid table.

Step 4 – Deploy and execute

On the destination: To deploy and execute, right-click the table and select the deploy and execute context menu option.

A new schema changes step will appear in the deployment process, after incremental load has been enable. This extra step protects the table from unnecessary future deployments and will be run as the first step for any future deployments. When a structure change is done to the table that requires a new full load, it will warn us.

Step 5 – Verify

I will see the immidate results after the first incremental load, as it would have only inserted/updated those records in the valid table, meeting the criteria for inserts and updates. As I need to see changes in source data to really verify the effect of target based incremental load, I will simply verify the performance impact by examining execution times of a full load vs. my target based incremental load.

When would you not use a target based incremental load strategy?

Before answering this question, I will explain a bit about how the incremental key is implemented behind the scenes. If I select the show system control fields context-menu option on the same table as shown in step 2, it reveals a field called Incremental Hash Key in the incremental settings field group. This field is the actual implementation of the incremental key. It is implemented as a hashed, computed column that is persisted in the table so that a non clustered index can be created on it. This provides me with a well performing single field, that is used for comparison between incoming and existing records on the destination.

This implementation is very benificial for large tables and in general, the larger the better. However, the computed column as described above does come with a bit of overhead and I would normally not recommend using a target based incremental load strategy on tables smaller than 5-10 million records. This is not an exact science, as many factors on the actual system has high impact on this. One of the most important factors is the performance of the io subsystem (disks), where i.e. bad write performance lowers the point at which target based incremental loads could be implemented.

Read Full Post »

In this post I will talk about surrogate keys and how to use them for dimension keys in a data warehouse. As one of the key concepts of data warehousing, it is often the center of heated discussions between “true believers” of the Kimball recommendations and what we can call a more pragmatic approach.

If you do not really care about all the nitty gritty details and just want to learn how to implement surrogate keys, please feel free to dive straight into the how to video.

Why use a surrogate key?

We will start by defining a surrogate key as a single field, integer based non-meaningful key. A surrogate key is a brilliant way to handle some of the challenges you will face in data warehousing. Some of these are

It is hard to imagine a SCD type 2 dimension without surrogate keys. In a type 2 dimension, multiple instances of a dimension member will exist over time. A surrogate key is the way to identify a specific instance of a member in this context.

  • Simplification of a composite business key (wide, multiple fields)

The simplification of dealing with a single field vs. a composite key to identify a single record in a table should be obvious.

  • Shielding from changes in meaningful business keys

Using surrogate keys is an effective way to shield the data model of the data warehouse against changes in the source system. One cause of this could be a replacement of a line of business system. With a surrogate based data warehouse, the entire data model would be unaffected by such a change.

  • Join optimization

If you are able to base every join between fact and dimension tables on a single integer value, you will experience a significant query performance over equivalent joins made on multiple fields, especially if they’re non integer types.

Performance Impact

Following the definition of a surrogate key as a single field, integer based non-meaningful key that will improve join performance, it will outperform a business key implemented as either non-numeric data types or even worse, composite keys as they are implemented in many source systems.

We will not go into the discussion and many details of SQL Server performance (heap tables vs. clustered tables, indexes, pages etc.), as you can find entire blogs and web sites dedicates to this. For now we will simply conclude, that it’s considered best practice to have the smallest possible, single field, integer based primary key. And this is exactly what the DW_Id field in timeXtender is. It is a integer based field, used as the single field in the physical primary key on any table in timeXtender. It is also an identity field, meaning that is is assigned a non-meaningful, incremental value automatically when a record is being inserted in a table. And there we have it, a built-in well performing surrogate key on any table.

When not to use surrogate keys

I am aware that even suggesting that there can be cases, when a surrogate key is both unnecessary and unwanted is an open invitation to criticism. Nevertheless I’m going to take the risk and give you a couple of examples

  • The maintainability factor

What is sometimes overlooked or ignored, is that many data warehouses these days are built- and maintained by people who are not full time data modelers or data warehouse developers. If the person(s) maintaing the data warehouse feel at home in the line of business systems feeding the data warehouse, it might be safer and more efficient to keep the business keys.

  • Performance and storage is the main driver

Although it is most often true, that implementing surrogate keys will increase join performance and reduce storage, it might not always be the case. Or it might be insignificant to the overall performance or storage. If you i.e. consider a date dimension, both bigint and datetime data types has a storage size of 8 bytes. If all your joins can be based on a single datetime based field, what is the value of the integer based surrogate keys? And if you consider the price you have to pay, when performing the lookup required to replace the datetime based key in your fact tables with the surrogate key, what is the affect on overall performance?

  • The warehouse is only used to feed a OLAP database

If all you really use your data warehouse for is to feed a OLAP database, it might not be worth implementing surrogate keys. Best practice documents from Microsoft indicates, that to improve processing time and reduce the dimension size, you should consider using numeric keys for dimension keys when the number of members of a single attribute exceeds 1 million. I find it safe to assume that most businesses will have dimensions way smaller than this.

And finally, the how to video

Please refer to the following video, showing you both how to implement this on the data warehouse and how to use it in a OLAP dimension.

Read Full Post »

Before getting into the how to details of the implementation, let’s start by defining what early arriving facts are. An early arriving fact is what happens, when a fact record arrives at the data warehouse before its full dimension context. Or in other words, when we lookup the surrogate key in our dimension table using the fact table natural key, the dimension member doesn’t exist yet.

In many cases when we load our data warehouse, we would assume that dimensions and facts would arrive at the same time. Or at least we would expect that dimension members would exist for the fact records being loaded. Although the reasons why we have to be able to handle early arriving facts are many, multiple source systems and/or different load frequences seems to be the dominating factors. We might i.e. be loading our dimensions using a full load strategy with one load schedule, while using an incremental load strategy for our fact table with a close to real-time frequency.

For the sake of this post, consider a company selling products to customers using invoices to capture the sale. We can illustrate using the following simplified star schema

How to enable early arriving facts in timeXtender

In the following we will implement early arriving fact handling for the above scenario. Please be aware of the following default behaviour in timeXtender

  • Unless early arriving fact handling has been explicitly added to a fact table, a missing dimension key is assumed to be an error
  • The relationship settings determines if an error is of severity warning or error (foreign key behavior)
  • Relationship settings are inherited from the project default, when a relationship is created

 The steps 1-5 shown below are for a single dimension, and should be repeated for every dimension.

 Step 1 – Open the Add Related Records Dialog

On the dimension table, right-click and select the Add Related Records context menu option. This will open the dialog shown below, in which the remaining configuration will take place.

Step 2 – Table level parameters

  • In the Create Records from table drop down list, select the source table for the records to be created in this dimension table.
  • Set the Record Condition to Not Exists.
  • Set the Data Destination Table to the Raw instance of the table.

These parameters instructs timeXtender to create distinct keys in the raw instance of the dimension table, when the dimension key from the fact table doesn’t exist in the dimension table.

Step 3 – Field mapping

  • Map the natural key fields from the fact table to the corresponding fields in the dimension table
  • Map any additional attributes from the fact table to the corresponding fields on the dimension table
  • Define default values for remaining fields on the dimension table

Step 4 – Define the match condition

In the Conditions section, define the joins between the fact table and the dimension table. These are used with the record condition to determine if there is a match or not and what to do about it. These joins should be the same as when we perform the lookup of our surrogate key in the dimension table. In the example above the join is made on a single field, the customer key.

Step 5 – Confirm

Click the OK button to confirm and close the dialog.

As it can be seen from the screen shot below, a new group Table Transformations appears in the timeXtender UI under the receiving dimension table.

Dimension processing

Since we created our missing dimension keys in the raw table instance, it is up to the SCD Type 1 and 2 settings for the dimension table to determine what to do with those extra records. In most cases we simply replace the dummy/default/missing attributes on the dimension with the true values coming from the source (type 1 updates). As an alternative we can choose to create the missing dimension keys in the valid table instance, but then we asume full responsibility for the insert i.e. making sure there are no violations of the primary key.

References
Kimball talks about early arriving facts in his 2004 design tips.

Read Full Post »

Older Posts »

%d bloggers like this: