Feeds:
Posts
Comments

Archive for the ‘Dimensional Modeling’ Category

My last post touched the sensitive topic of Data Vault vs. Star Schemas and it was as expected like putting my hands in the hornets nest. It is clearly a topic that releases some strong emotions. The traffic on my blog went crazy (probably aided by Dan Linstedt’s post on his blog) and I am expecting a call from WordPress, trying to sell me an upgrade to a dedicated server :-)

As one of the comments on the last post stated, there seems to be agreement that star schemas are best for reporting. If the Data Vault is to be considered as the backend Data Warehouse, I guess we would then be discussing the work done in between the data sources and the dimensional model.

A couple of the key words from the discussion were repeatable, automatable and generatable, along with words like auditable, flexibility and adaptability. I do not know many organizations that would not like to have all of the above from all everything they do.

Assumption: A dimensional model can be automated too

I will try to get a little deeper in understanding Data Vault and the added value over a pure Kimball Star Schema dimensional model. The assumption I am making here is, that Data Vault is not seen as the only way to automate the generation of a dimensional model.

At timeXtender we promote an agile approach (methodology) and automation to speed up implementation. The later is what our software is doing. It is a Meta data driven design and build tool, centered around a dimensional model including staging, data cleansing and optionally also cubes and much more.

It seems that there is also agreement on automation and agile, which also aims at delivering fast and frequent to the business.

Data Vault: You never have to go back to the source

This is a very frequent statement from the Data Vault community.

I am going to assume, that behind that statement are a number of assumptions such as that the entire source model was discovered in the first attempt, does not include any new sources, that the sources were not modified since last time and perhaps more importantly, that a change in the reporting model is not related to a process change in the source system(s).

Data Vault: You can change you mind without impact to the data model

This is also sometimes presented as Data Vault is the only forgiven data model, that allows you to make mistakes and change your mind. One of the common examples used for this is a change of keys in a Kimball type II history enabled dimension. If you have such a dimension based on a Data Vault, you can change the keys and simply recreate the dimension with a historical correct result.

I completely agree that this is one of the challenges in a Kimball model, as it is difficult to recreate history. Especially because the Data Warehouse is often the only place this history is tracked. This is probably also the one scenario where I see Data Vault modeling adding true value to dimensional modeling.

But again I am assuming that there are some of the same assumptions behind it as above. I have also dealt with changes like that countless times, both in my time as a customer and from the vendor side of the eco system. In my experience, a change like that is almost always happening concurrently with a change of a business process that involves the way the source is used and very often also with a a change or extension of the source system, such as adding new fields or tables.

Where does the hours go in Data Vault?

What I am trying to get to with the above, is that there are some major, and understated assumptions behind some of the key benefits of the Data Vault model in my humble opinion.

So what I am wondering about is this: With the level of automation and ease of which “round trips between the Data Vault and the star schema” goes, where does the hours go in Data Vault?

A word on comments

I really appreciate the discussion my last post caused, and have of course also read the posts that Dan posted on his blog afterwards. Once again I welcome your thought and comments, but I would appreciate keeping it in a mature tone. If you cannot comment in a non-hostile, non-demeaning way, I will not waste my time responding.

Read Full Post »

If you follow the Kimball dimensional modeling approach, most dimensions will be star dimensions, some might be snowflake. Sometimes a business process will not fit this model, where one fact is associated with a single dimension member. This is when many to many dimensional modeling can be applied.

The example for this post will be the joint bank account scenario (accounts with multiple owners). I know this has been used over and over again, but hey this gives you the chance to make a direct comparison of the implementation in timeXtender with BIDS.

The joint accounts example

I will look at Jane and John, a couple with both personal and joint accounts

  • Jane has a personal checking account (balance 1000)
  • John has a personal checking account (balance 1000)
  • Jane and John have a joint savings account (balance 1000)

In this example I want the balance of the joint saving account to appear under both Jane and John. The main purpose here is to show how the many to many model affects the aggregation over the dimensions (the total is not equal to simply the sum of all members).

The model can be further refined by i.e. adding a ownership calculation to split the balance, but this will not be he focus of this post.

The following Excel view of the finished cube shows that Jane and John have a balance of 1000 each on their personal checking accounts, and both have access to the 1000 on the joint savings account. At the same time, the total balance of all accounts is still only 3000.

 

M2M Excel Pivot

 

The data model

The data model for the joint accounts can be seen below. In a real life scenario I would use surrogate keys, but to keep the focus on solving the main problem here, I will be using natural keys.

By examining the structure it appears that there is no direct relationship between the fact and the customer dimension tables (fact Account Balance and dim Customers). This relationship can only be established using the many to many join table (factLess CustomerAccount).

 

M2M DW Tables

The OLAP model

I trust that you already know the basics of how to create dimensions and cubes in timeXtender, thus I will only cover the specifics needed to add the many to many relationship.

The OLAP model will consist of the following

  • Account dimension
  • Customer dimension
  • Account Balance cube with fact tables (measure groups)
    • fact Account Balance (for balance measure)
    • factLess CustomerAccount (intermediate fact w/dummy measure)

Prior to adding the many to many relationship, the model will look this in timeXtender

image

 

Please notice that the Customer dimension is not related to the fact Account Balance fact table

M2M Dimension Relations

 

Adding the many to many relationship

The steps to add the many to many relationship are

  1. Right-click the Customer dimension under cube dimensions. On the context menu, select the Add Many To Many Relation
  2. In the dialog that appears, select the fact and factless tables as shown below

M2M Many 2 Many Join

 

The many to many relationship now appears under the cube dimension

 

M2M OLAP Model Complete

Read Full Post »

Agile and rapid are some of the most frequently used buzz words in the data warehousing community these days, so I thought it was about time to put these in context of timeXtender.

Those who know our tool and company are well aware, that timeXtender was created from day one for agile implementation of ETL and data warehouses and that the tool is a rapid development tool.

Just to set expectations right, this and following posts will not be an attempt to cover in complete any specific agile methodology like scrum, just as it will not be a complete coverage of the Kimball dimensional modeling approach.

Agile methodologies

In software development there are several iterative and incremental software development methodologies commonly referred to as agile software development methodologies, including scrum and extreme programming.

In agile methodologies, teams are usually cross-functional and will most often include a customer/business representative. There is no fixed team size, but they are generally quite small from 3-5 people.

User stories is one of the artifacts commonly used in agile development and one that I will emphasize for use with agile dimensional modeling.

Dimensional modeling

Kimball encourages using the four-step dimensional design process to consistently consider four steps in a particular order. In short the steps are

  1. Select the business process to model
  2. Define the grain of the fact table
  3. Define the dimensions for the fact
  4. Define the numeric measures of the fact

Consistency in the model using conformed dimensions and facts is key to a scalable data warehouse. The star schema(s) that is the result of the dimensional design process is usually considered the smallest deliverable and often represents several days or weeks of work for developers to build.

Agile + Dimensional Modeling

The ever changing requirements for data warehouses today makes it natural to adopt elements from agile methodologies, to be able to keep up with the changing requirements. If you are not able to deliver in very fast, short iterations, chances are that your dimensional models will be outdated before they get implemented.

I do not remember a single project, where the customer did not have a few requirements, that was key to the business and represented urgent needs. It usually does not take much analysis and discussion to agree, that these early identifiable requirements should be implemented first.

It is about getting actionable information as fast as possible, mainly because it serves a urgent business need, but in todays economy projects are also under pressure to deliver return on investment faster than ever before.

Agile + Iterative

Agile and iterative really goes hand in hand, as agile development promotes a set of techniques using iterative development for rapid delivery. As agile traditionally focuses only on the development part of the data warehouse lifecycle, I suggest to focus more on being agile rather than using agile. This will include being agile and using iterative techniques in the requirements specification phase also.

At a high level the agile approach can be illustrated by the following Iterative Lifecycle

In the next few posts, I will introduce user stories and prototyping around dimensional modeling as a way to remain true to a scalable dimensional model, while using a deliver fast, deliver often approach.

Read Full Post »

In this post I will show you how to use prototyping to demonstrate business requirements and get rapid and actionable feedback. How you got the business requirements to demonstrate, is not important for the purpose of this post, but assuming that you have read some of my earlier blog posts, it should be clear that I recommend taking an agile approach.

The goals for the prototype

  • Demonstrate business requirements
  • Deliver fast, using minimum resources
  • Minimize rework

I am going to build a prototype using the dimensional model approach as per best practice. Although not strictly necessary for the purpose of a prototype, I am even using surrogate keys, as I will most likely reuse the prototype for the permanent implementation after one or more very short iterations.

The four easy steps

  • Build the data warehouse star schema
  • Enter sample data in the model
  • Build the dimensional model in OLAP
  • Build a test front end in Excel using Pivot Tables

Step 1 – Build the data warehouse star schema

In a dimensional model, the business requirements can be illustrated as this star schema (also known as a star join schema). For those unfamiliar with dimensional modeling, the yellow table is called a fact table. It is the center of a star schema and represents the numeric measurements of an event and has keys (links) to the dimensions that describe the event. The blue tables are called dimension tables. In this example we can describe an event: A sale and describe it in the context of customer, product sold and the point in time at which the event took place.

Remembering that the purpose is to demonstrate the business requirements as fast as possible and with a minimum of rework, I will implement the dimensional model above with manually entered sample data in the data warehouse database. When I build the dimensional model in the timeXtender meta layer it will look like this. Please note the red fields at the bottom of the fact table. These fields are lookup fields that are used to translate the natural key joins between my dimension and fact tables into surrogate based joins.

Step 2 – Enter sample data in the model

Using the custom data feature in timeXtender, I will enter just enough data to be able to demonstrate the model. As a side note, you would be amazed at the difference it makes to your business stakeholders, if you use recognizable data compared to just more or less random values.

At this point in time, the model including the manually entered demo data only exists in the timeXtender metal model in memory. If I use the deploy and execute features at this point, timeXtender wll generate all necessary SQL code in the physical data warehouse database, load it with demo data and save the project definition to the timeXtender project repository database. The processed fact table will look like this.

  • If you do not need a multidimensional representation for your data (an OLAP cube), the data are at this point already accessible from your front end of choice.

Step 3 – Build the dimensional model in OLAP (Analysis Services)

I will use a drag and drop approach to create a very simple OLAP representation of the star schema. This will include standard dimensions, standard measures and will look like this in timeXtender

Step 4 – Build a test front end in Excel

I will use the most common of all BI front end tools to build a view on the data that I have just built. A Pivot Table is an excellent way to get a feel of the dimensions and measures of a cube.

What is next?

The Excel workbook I just created can now be delivered to my business stakeholders. They can then use this to verify the business requirements and provide instant feedback. The business stakeholders might come back with extensions or modifications to the dimensional model after seeing the first result. This might lead me down one or more additional iterations of the above steps, adjusting the model according to more detailed requirements. This is all fine and exactly why prototyping is such an efficient way to interact between data warehouse developers and stakeholders. The time I spend at this early point in the process will save me from late and costly rework.

Once the model has been verified I can continue to identify data sources needed to load actual data into the dimensional model just built. I will most likely need to make minor changes to the dimensional model on the data warehouse, such as different data types for non system fields. Except for these minor changes, the entire dimensional model including the cube can be reused and put into production after proper testing with real life data.

 

Video link

My colleague Thomas Lørup Duun has made a video on this topic available on YouTube. The video is actually only 15 minutes in real time, so the header in this post is a bit conservative :-)

Read Full Post »

In this post I will talk about surrogate keys and how to use them for dimension keys in a data warehouse. As one of the key concepts of data warehousing, it is often the center of heated discussions between “true believers” of the Kimball recommendations and what we can call a more pragmatic approach.

If you do not really care about all the nitty gritty details and just want to learn how to implement surrogate keys, please feel free to dive straight into the how to video.

Why use a surrogate key?

We will start by defining a surrogate key as a single field, integer based non-meaningful key. A surrogate key is a brilliant way to handle some of the challenges you will face in data warehousing. Some of these are

It is hard to imagine a SCD type 2 dimension without surrogate keys. In a type 2 dimension, multiple instances of a dimension member will exist over time. A surrogate key is the way to identify a specific instance of a member in this context.

  • Simplification of a composite business key (wide, multiple fields)

The simplification of dealing with a single field vs. a composite key to identify a single record in a table should be obvious.

  • Shielding from changes in meaningful business keys

Using surrogate keys is an effective way to shield the data model of the data warehouse against changes in the source system. One cause of this could be a replacement of a line of business system. With a surrogate based data warehouse, the entire data model would be unaffected by such a change.

  • Join optimization

If you are able to base every join between fact and dimension tables on a single integer value, you will experience a significant query performance over equivalent joins made on multiple fields, especially if they’re non integer types.

Performance Impact

Following the definition of a surrogate key as a single field, integer based non-meaningful key that will improve join performance, it will outperform a business key implemented as either non-numeric data types or even worse, composite keys as they are implemented in many source systems.

We will not go into the discussion and many details of SQL Server performance (heap tables vs. clustered tables, indexes, pages etc.), as you can find entire blogs and web sites dedicates to this. For now we will simply conclude, that it’s considered best practice to have the smallest possible, single field, integer based primary key. And this is exactly what the DW_Id field in timeXtender is. It is a integer based field, used as the single field in the physical primary key on any table in timeXtender. It is also an identity field, meaning that is is assigned a non-meaningful, incremental value automatically when a record is being inserted in a table. And there we have it, a built-in well performing surrogate key on any table.

When not to use surrogate keys

I am aware that even suggesting that there can be cases, when a surrogate key is both unnecessary and unwanted is an open invitation to criticism. Nevertheless I’m going to take the risk and give you a couple of examples

  • The maintainability factor

What is sometimes overlooked or ignored, is that many data warehouses these days are built- and maintained by people who are not full time data modelers or data warehouse developers. If the person(s) maintaing the data warehouse feel at home in the line of business systems feeding the data warehouse, it might be safer and more efficient to keep the business keys.

  • Performance and storage is the main driver

Although it is most often true, that implementing surrogate keys will increase join performance and reduce storage, it might not always be the case. Or it might be insignificant to the overall performance or storage. If you i.e. consider a date dimension, both bigint and datetime data types has a storage size of 8 bytes. If all your joins can be based on a single datetime based field, what is the value of the integer based surrogate keys? And if you consider the price you have to pay, when performing the lookup required to replace the datetime based key in your fact tables with the surrogate key, what is the affect on overall performance?

  • The warehouse is only used to feed a OLAP database

If all you really use your data warehouse for is to feed a OLAP database, it might not be worth implementing surrogate keys. Best practice documents from Microsoft indicates, that to improve processing time and reduce the dimension size, you should consider using numeric keys for dimension keys when the number of members of a single attribute exceeds 1 million. I find it safe to assume that most businesses will have dimensions way smaller than this.

And finally, the how to video

Please refer to the following video, showing you both how to implement this on the data warehouse and how to use it in a OLAP dimension.

Read Full Post »

Before getting into the how to details of the implementation, let’s start by defining what early arriving facts are. An early arriving fact is what happens, when a fact record arrives at the data warehouse before its full dimension context. Or in other words, when we lookup the surrogate key in our dimension table using the fact table natural key, the dimension member doesn’t exist yet.

In many cases when we load our data warehouse, we would assume that dimensions and facts would arrive at the same time. Or at least we would expect that dimension members would exist for the fact records being loaded. Although the reasons why we have to be able to handle early arriving facts are many, multiple source systems and/or different load frequences seems to be the dominating factors. We might i.e. be loading our dimensions using a full load strategy with one load schedule, while using an incremental load strategy for our fact table with a close to real-time frequency.

For the sake of this post, consider a company selling products to customers using invoices to capture the sale. We can illustrate using the following simplified star schema

How to enable early arriving facts in timeXtender

In the following we will implement early arriving fact handling for the above scenario. Please be aware of the following default behaviour in timeXtender

  • Unless early arriving fact handling has been explicitly added to a fact table, a missing dimension key is assumed to be an error
  • The relationship settings determines if an error is of severity warning or error (foreign key behavior)
  • Relationship settings are inherited from the project default, when a relationship is created

 The steps 1-5 shown below are for a single dimension, and should be repeated for every dimension.

 Step 1 – Open the Add Related Records Dialog

On the dimension table, right-click and select the Add Related Records context menu option. This will open the dialog shown below, in which the remaining configuration will take place.

Step 2 – Table level parameters

  • In the Create Records from table drop down list, select the source table for the records to be created in this dimension table.
  • Set the Record Condition to Not Exists.
  • Set the Data Destination Table to the Raw instance of the table.

These parameters instructs timeXtender to create distinct keys in the raw instance of the dimension table, when the dimension key from the fact table doesn’t exist in the dimension table.

Step 3 – Field mapping

  • Map the natural key fields from the fact table to the corresponding fields in the dimension table
  • Map any additional attributes from the fact table to the corresponding fields on the dimension table
  • Define default values for remaining fields on the dimension table

Step 4 – Define the match condition

In the Conditions section, define the joins between the fact table and the dimension table. These are used with the record condition to determine if there is a match or not and what to do about it. These joins should be the same as when we perform the lookup of our surrogate key in the dimension table. In the example above the join is made on a single field, the customer key.

Step 5 – Confirm

Click the OK button to confirm and close the dialog.

As it can be seen from the screen shot below, a new group Table Transformations appears in the timeXtender UI under the receiving dimension table.

Dimension processing

Since we created our missing dimension keys in the raw table instance, it is up to the SCD Type 1 and 2 settings for the dimension table to determine what to do with those extra records. In most cases we simply replace the dummy/default/missing attributes on the dimension with the true values coming from the source (type 1 updates). As an alternative we can choose to create the missing dimension keys in the valid table instance, but then we asume full responsibility for the insert i.e. making sure there are no violations of the primary key.

References
Kimball talks about early arriving facts in his 2004 design tips.

Read Full Post »

The ability to track historically accurate values for dimension attributes, is a fundamental requirement in ETL and data warehousing. This is commonly referred to as slowly changing dimensions (SCD), of which the  most common variants are type 1 and 2. If you’re not already familiar with the concept of SCD, the books, articles and design tips from the Kimball Group are highly recommendable.

In short, the techniques used in handling changes are

Type 1: Overwrite the attribute
Type 2: Add a new dimension row

In a pure type 1 scenario, you will only have one instance of a record, with the attributes being overwritten as changes are brought in from the source systems.

If you’re not already using surrogate keys for your dimensions, now would be a good time to make that change. Once you enable SCD type 2 attributes, multiple instances of the same logical record will exist over time, thus invalidating the business key to be used as key for dimension members.

How to (manually) implement SCD on SQL Server

There are countless ways of implementing on SQL Server, including T-SQL stored procedures, custom components and built-in SSIS components . The predominating implementations are both SSIS based, using either the built-in Slowly Changing Dimension Transformation or the Kimball Method SCD component. The later is a huge improvement over the first one, especially when it comes to performance. This post will not discuss how to configure SCD using these components, but as the documentation and videos at the links above will show, it is a bit complex and requires some experience in both SSIS and T-SQL.

How to enable SCD type 1 and 2 in timeXtender

The default behavior for any table in timeXtender is what we call snapshot load. The table is truncated and fully loaded from the source(s) at each execution, thus no attempt is made to track historical values. But don’t worry, it will take you less than 5 minutes to enable type 1 and/or type 2 changes to be tracked.

To handle type 1 and 2 changes do the following

  1. Enable Slowly Changing Dimensions under Advanced Settings for the table
  2. Define the business (natural) key
  3. Choose type 1 attributes
  4. Choose type 2 attributes
SCD Table Settings

The configuration of the natural key and SCD attributes is done in a single, easy to use dialog

SCD Attribute Settings
And that’s all you need to do to enable SCD type 1 and 2 on a table in timeXtender. Hit the deploy button and the code to handle the actual data will be generated on SQL Server.

Read Full Post »

%d bloggers like this: