Feeds:
Posts
Comments

Archive for the ‘Video’ Category

In this post I will show you how to use prototyping to demonstrate business requirements and get rapid and actionable feedback. How you got the business requirements to demonstrate, is not important for the purpose of this post, but assuming that you have read some of my earlier blog posts, it should be clear that I recommend taking an agile approach.

The goals for the prototype

  • Demonstrate business requirements
  • Deliver fast, using minimum resources
  • Minimize rework

I am going to build a prototype using the dimensional model approach as per best practice. Although not strictly necessary for the purpose of a prototype, I am even using surrogate keys, as I will most likely reuse the prototype for the permanent implementation after one or more very short iterations.

The four easy steps

  • Build the data warehouse star schema
  • Enter sample data in the model
  • Build the dimensional model in OLAP
  • Build a test front end in Excel using Pivot Tables

Step 1 – Build the data warehouse star schema

In a dimensional model, the business requirements can be illustrated as this star schema (also known as a star join schema). For those unfamiliar with dimensional modeling, the yellow table is called a fact table. It is the center of a star schema and represents the numeric measurements of an event and has keys (links) to the dimensions that describe the event. The blue tables are called dimension tables. In this example we can describe an event: A sale and describe it in the context of customer, product sold and the point in time at which the event took place.

Remembering that the purpose is to demonstrate the business requirements as fast as possible and with a minimum of rework, I will implement the dimensional model above with manually entered sample data in the data warehouse database. When I build the dimensional model in the timeXtender meta layer it will look like this. Please note the red fields at the bottom of the fact table. These fields are lookup fields that are used to translate the natural key joins between my dimension and fact tables into surrogate based joins.

Step 2 – Enter sample data in the model

Using the custom data feature in timeXtender, I will enter just enough data to be able to demonstrate the model. As a side note, you would be amazed at the difference it makes to your business stakeholders, if you use recognizable data compared to just more or less random values.

At this point in time, the model including the manually entered demo data only exists in the timeXtender metal model in memory. If I use the deploy and execute features at this point, timeXtender wll generate all necessary SQL code in the physical data warehouse database, load it with demo data and save the project definition to the timeXtender project repository database. The processed fact table will look like this.

  • If you do not need a multidimensional representation for your data (an OLAP cube), the data are at this point already accessible from your front end of choice.

Step 3 – Build the dimensional model in OLAP (Analysis Services)

I will use a drag and drop approach to create a very simple OLAP representation of the star schema. This will include standard dimensions, standard measures and will look like this in timeXtender

Step 4 – Build a test front end in Excel

I will use the most common of all BI front end tools to build a view on the data that I have just built. A Pivot Table is an excellent way to get a feel of the dimensions and measures of a cube.

What is next?

The Excel workbook I just created can now be delivered to my business stakeholders. They can then use this to verify the business requirements and provide instant feedback. The business stakeholders might come back with extensions or modifications to the dimensional model after seeing the first result. This might lead me down one or more additional iterations of the above steps, adjusting the model according to more detailed requirements. This is all fine and exactly why prototyping is such an efficient way to interact between data warehouse developers and stakeholders. The time I spend at this early point in the process will save me from late and costly rework.

Once the model has been verified I can continue to identify data sources needed to load actual data into the dimensional model just built. I will most likely need to make minor changes to the dimensional model on the data warehouse, such as different data types for non system fields. Except for these minor changes, the entire dimensional model including the cube can be reused and put into production after proper testing with real life data.

 

Video link

My colleague Thomas Lørup Duun has made a video on this topic available on YouTube. The video is actually only 15 minutes in real time, so the header in this post is a bit conservative :-)

Read Full Post »

In this post I will talk about surrogate keys and how to use them for dimension keys in a data warehouse. As one of the key concepts of data warehousing, it is often the center of heated discussions between “true believers” of the Kimball recommendations and what we can call a more pragmatic approach.

If you do not really care about all the nitty gritty details and just want to learn how to implement surrogate keys, please feel free to dive straight into the how to video.

Why use a surrogate key?

We will start by defining a surrogate key as a single field, integer based non-meaningful key. A surrogate key is a brilliant way to handle some of the challenges you will face in data warehousing. Some of these are

It is hard to imagine a SCD type 2 dimension without surrogate keys. In a type 2 dimension, multiple instances of a dimension member will exist over time. A surrogate key is the way to identify a specific instance of a member in this context.

  • Simplification of a composite business key (wide, multiple fields)

The simplification of dealing with a single field vs. a composite key to identify a single record in a table should be obvious.

  • Shielding from changes in meaningful business keys

Using surrogate keys is an effective way to shield the data model of the data warehouse against changes in the source system. One cause of this could be a replacement of a line of business system. With a surrogate based data warehouse, the entire data model would be unaffected by such a change.

  • Join optimization

If you are able to base every join between fact and dimension tables on a single integer value, you will experience a significant query performance over equivalent joins made on multiple fields, especially if they’re non integer types.

Performance Impact

Following the definition of a surrogate key as a single field, integer based non-meaningful key that will improve join performance, it will outperform a business key implemented as either non-numeric data types or even worse, composite keys as they are implemented in many source systems.

We will not go into the discussion and many details of SQL Server performance (heap tables vs. clustered tables, indexes, pages etc.), as you can find entire blogs and web sites dedicates to this. For now we will simply conclude, that it’s considered best practice to have the smallest possible, single field, integer based primary key. And this is exactly what the DW_Id field in timeXtender is. It is a integer based field, used as the single field in the physical primary key on any table in timeXtender. It is also an identity field, meaning that is is assigned a non-meaningful, incremental value automatically when a record is being inserted in a table. And there we have it, a built-in well performing surrogate key on any table.

When not to use surrogate keys

I am aware that even suggesting that there can be cases, when a surrogate key is both unnecessary and unwanted is an open invitation to criticism. Nevertheless I’m going to take the risk and give you a couple of examples

  • The maintainability factor

What is sometimes overlooked or ignored, is that many data warehouses these days are built- and maintained by people who are not full time data modelers or data warehouse developers. If the person(s) maintaing the data warehouse feel at home in the line of business systems feeding the data warehouse, it might be safer and more efficient to keep the business keys.

  • Performance and storage is the main driver

Although it is most often true, that implementing surrogate keys will increase join performance and reduce storage, it might not always be the case. Or it might be insignificant to the overall performance or storage. If you i.e. consider a date dimension, both bigint and datetime data types has a storage size of 8 bytes. If all your joins can be based on a single datetime based field, what is the value of the integer based surrogate keys? And if you consider the price you have to pay, when performing the lookup required to replace the datetime based key in your fact tables with the surrogate key, what is the affect on overall performance?

  • The warehouse is only used to feed a OLAP database

If all you really use your data warehouse for is to feed a OLAP database, it might not be worth implementing surrogate keys. Best practice documents from Microsoft indicates, that to improve processing time and reduce the dimension size, you should consider using numeric keys for dimension keys when the number of members of a single attribute exceeds 1 million. I find it safe to assume that most businesses will have dimensions way smaller than this.

And finally, the how to video

Please refer to the following video, showing you both how to implement this on the data warehouse and how to use it in a OLAP dimension.

Read Full Post »

%d bloggers like this: