My last post touched the sensitive topic of Data Vault vs. Star Schemas and it was as expected like putting my hands in the hornets nest. It is clearly a topic that releases some strong emotions. The traffic on my blog went crazy (probably aided by Dan Linstedt’s post on his blog) and I am expecting a call from WordPress, trying to sell me an upgrade to a dedicated server :-)
As one of the comments on the last post stated, there seems to be agreement that star schemas are best for reporting. If the Data Vault is to be considered as the backend Data Warehouse, I guess we would then be discussing the work done in between the data sources and the dimensional model.
A couple of the key words from the discussion were repeatable, automatable and generatable, along with words like auditable, flexibility and adaptability. I do not know many organizations that would not like to have all of the above from all everything they do.
Assumption: A dimensional model can be automated too
I will try to get a little deeper in understanding Data Vault and the added value over a pure Kimball Star Schema dimensional model. The assumption I am making here is, that Data Vault is not seen as the only way to automate the generation of a dimensional model.
At timeXtender we promote an agile approach (methodology) and automation to speed up implementation. The later is what our software is doing. It is a Meta data driven design and build tool, centered around a dimensional model including staging, data cleansing and optionally also cubes and much more.
It seems that there is also agreement on automation and agile, which also aims at delivering fast and frequent to the business.
Data Vault: You never have to go back to the source
This is a very frequent statement from the Data Vault community.
I am going to assume, that behind that statement are a number of assumptions such as that the entire source model was discovered in the first attempt, does not include any new sources, that the sources were not modified since last time and perhaps more importantly, that a change in the reporting model is not related to a process change in the source system(s).
Data Vault: You can change you mind without impact to the data model
This is also sometimes presented as Data Vault is the only forgiven data model, that allows you to make mistakes and change your mind. One of the common examples used for this is a change of keys in a Kimball type II history enabled dimension. If you have such a dimension based on a Data Vault, you can change the keys and simply recreate the dimension with a historical correct result.
I completely agree that this is one of the challenges in a Kimball model, as it is difficult to recreate history. Especially because the Data Warehouse is often the only place this history is tracked. This is probably also the one scenario where I see Data Vault modeling adding true value to dimensional modeling.
But again I am assuming that there are some of the same assumptions behind it as above. I have also dealt with changes like that countless times, both in my time as a customer and from the vendor side of the eco system. In my experience, a change like that is almost always happening concurrently with a change of a business process that involves the way the source is used and very often also with a a change or extension of the source system, such as adding new fields or tables.
Where does the hours go in Data Vault?
What I am trying to get to with the above, is that there are some major, and understated assumptions behind some of the key benefits of the Data Vault model in my humble opinion.
So what I am wondering about is this: With the level of automation and ease of which “round trips between the Data Vault and the star schema” goes, where does the hours go in Data Vault?
A word on comments
I really appreciate the discussion my last post caused, and have of course also read the posts that Dan posted on his blog afterwards. Once again I welcome your thought and comments, but I would appreciate keeping it in a mature tone. If you cannot comment in a non-hostile, non-demeaning way, I will not waste my time responding.
Hi Thomas,
To answer your question with respect to time spend. In DV a lot of time (compared to other methods) is spend on selecting the business keys. They form the heart of the DV.
What are people in the business using to identify an object like person, invoice, order etc.
Is the business key used in all relevant systems? Will new systems also use this business key.
This will make it very easy to integrate information from different systems including old systems with history. An example …
Information about an employee is stored in several locations ..active directory, payroll, timesheets etc.. If you are able to uniquely identify the business key you will typically end up with one hub table: hubEmployee and three satellite tables: satEmpoyeeADDetails, sat EmployeePayrollDetails and satEmployeeTimeSheetDetails.
Now you are able to easily integrate this information by joining these tables. And when you change your Payroll system you’ll add a new satellite table and add this one to the join.
I subscribe to the response made by Marco. Finding the right business keys that are hopefully unique across the enterprise, is where your time really goes. Creating the links (associations) between these business keys is the second part, especially when trying to find the “unit of work”.
When business keys are not unique across the enterprise for the same hub, you either end up having two or more hubs (when generating things from the source systems) connected with a “same as” link. Or, just one hub, with two relations to a “same as” link (I prefer this).
But this integration issue is also present in other methodologies such as dimensional modeling. You also have a natural key there, or business key if you want to call it like that. Same issue, but possibly harder to solve when also applying business rules at the same time.
Data Vault is no holy grail, but it does help you to divide and conquer some of the (integration) issues that we encounter in data warehousing in general, no matter which methodology or modeling technique you apply.
Having a construct as described above, with one hub and three or more satellites depending on the source system that delivers it, gives you an opportunity to deliver different (and often valid) versions of the truth, without having to pack those all together in a single dimension. Yes, this can be done in a staging area as well. But if you would model that staging area according to the Data Vault methodology, it will just make it a bit easier compared to 3NF or other modeling techniques.
Sander:
Great elaboration, thanks. A question on business keys:
In my experience, there often is NO business key that has the global properties that the warehouse requires. I simply cannot make that that assumption while I model. Sometimes you even find that the “global business key” is actually the sum of multiple source system keys – for examples sales and inventory EAN numbers. Yes, you can sell something that wasn’t in the inventory.
Have you considered a data model where you simply generate a new “warehouse only” key and call that the master key. You then then map every other source system to that key instead of relying on the business to find you that key?
The way I model this in the Kimball schema is to have a Type1 dimension which contains the “entity master key” (which I generate as part of the dimension bus load). Off that type1 table, I then hang (in a 1-n relationship) any mappings required for each source system. The sources always map to my master key. Any source data that is supplied to me that maps to another key, I translate to maps that hit the master key instead. I suppose the Vault would call that a hub/sat implementation, because it is two tables with 1-n relationship to each other. But I think that these key maps have distinct properties that require them to be their own “table type”.
By always letting the warehouse generate the master key, I can proceed with my data modelling and prototyping and ignore (or delegate) the issue until the business is ready to decide on the issue. Often, they don’t even know they have a global key problem until I start showing them the final data model. The beauty of having the warehouse control the master key, is that you don’t really care if the business key changes or not – the join keys always remain the same.
Since I never really expose the business key to the end user (Except as a join column) – it does not matter from their perspective that the warehouse is controlling it.
Ideally, a master data repository would have such a master key for me – but I have yet to find one that handles this correctly. Have you seen any implementation that do?
Hi Thomas,
A late reply, given the very interesting elaboration that has already taken place between you, Marco and Martijn.
No, I haven’t seen a master data repository yet that handles it correctly.
Regarding the creation of a master key in the data warehouse and mapping all sources to that key, is something that can be done as well in DV. The business key to be used for a hub does not need to be one that is globally defined. To be honest, I have never come across an organisation where they have that, maybe you have.
When loading the hub, you just insert the “business keys” from all sources that you have defined. When I understand your map table construct correctly, you do the same thing. Different from the map table construct, is however that all those business keys are not mapped yet. I.e. you could have key “1” from system A and key “2” from system B that are in fact the same. In DV, you can “solve” that using a “same as” link table, hanging off that hub (with two relationships). To me, it seems this is more or less the same construct as your map table.
Now get me straight, when I read the rest of your contributions, I can not argue that what you do is better or worse than DV. It is a correct approach. DV is also a correct, but different approach. I am not here to discuss this better or worse – it is not important -, but do find it a very interesting discussion and the more solutions you see, the better you can see them in the correct perspective and choose which one suits best in a particular situation.
However, DV is also put forward as a means for business to discover discrepancies. With the hub and same-as link described above, this can be done. Rules for loading the same-as link would be the same as you need to apply (if I understand it all correctly) for determining that key 1 = key 2 from system A and B respectively.
Your approach can do this as well, but it may be a little more hidden in the map table (which I think is normally not disclosed to an end user) construct or even in the ETL than in the DV approach. Please help me understand how you handle this.
Regards,
Sander
Hi Sander
Sorry, I did not see your reply until now. You ask some great questions, let me try to make it clearer how to handle these situations in the map/T1Dim setup I am using.
But first of all, a few agreements. Indeed, I have yet to come across a large org where there is a single master key defined. Because of this, my data model simple assumes that there isn’t one (I don’t even bother to look before I start modelling)
The “same as” (or “multi key” as I call it in my taxonony) is handled in the mapping table by having the two identification business keys point to the same entity in the T1 table. The query to find duplicates (same as links in your case, but without adding a new table) is trivial, taking the generic form:
SELECT _Key, COUNT( DISTINCT )
FROM Map.
GROUP BY _Key
HAVING COUNT( DISTINCT ) > 1
With very similar contracts, you can locate Dolly Keys and Zombie Key and show these to the user. Note that you can also join two maps together to provide a direct sourcesource mapping table (and the other way around, as I illustrate in my blog).
Now, you correctly point out that I could get into a situation where I map two source keys to two different entities, even if they represent the same key. This is fundamentally a timing issue, as the correction will depend on when the error is discovered. When the error is found, I have a stored procedure that merges the two T1 keys into the same key. This requires deleting a row in the dimension, and correcting all maps that point to that row. The process happens to be automatable, since I can simple crawl the foreign key constraints. An additional advantage of this method is that the impact on the fact tables is minimal and very easy to analyse and auto-correct (making fact restatements faster and automated, a major goal of my architecture). There is one potential data loss here: if I wanted to store that i USED to think the two keys were different entities – I would have to add a WasError bit column to the map – but I don’t make that my default design, only confuses the end user. Note the the source extracts and fact ETL code are all unaffected by these changes – they simply assume the maps contains what is needed to properly control keys.
Your other question is what I do with unmapped keys. Here, you have several options, depending on what the sources are telling you and what data you have available to work with as you iterate over the warehouse design.
1) Nominate one or more systems to be the “master key” system. Whenever a new key is found here, map and add it 1-1. This essentially collapses the model into something that look at lot like a hub/sat. However, note that multiple systems can have 1-1 maps (like illustrated in the inventory/sales example)
2) Receive maps from external sources. This becomes interesting if you already have a master data management system. You can now translate the often hopeless data models they use there into the map structure instead.
3) Don’t finish the map. You can add the source key to the map by leave the entity key dangling. It now becomes a user interaction problem, and you can show the data steward the missing maps. I used this successfully on a system where they had four different product masters – but where there was a lot of “intrinsic knowledge” in the user base about how to map between keys. Yes, I will allow privileged users write access to the maps, but only the map.
4) Business rules: Use some rule to discover the map. Examples could be fuzzy lookups of city names, communication with external sources for EAN numbers etc…
I hope that helps explain a bit. As I said, I DO agree that a structure like the maps are requires in a warehouse to track keys. But I don’t see a reason to take this approach the extreme and model everything else in the DW like that.
Thomas,
If you sell something that isn’t in inventory changes are you not (correctly)selling items that leave inventory. And you are probably losing a lot of money. Instead of pursuing building a data warehouse management should be working on fixing this problem.
Many companies at a closer look are just a group of islands (sometimes in war with their neighbors) Having some sort of common business vocabulary can mitigate risks involved. Reporting on misuse and anomalies in business keys can have great value.
Of course management can decide to ignore these anomalies and you end up with using same-as or masterkey constructs. This is exactly the problem I have with master data management systems: They are to forgiving with ‘errors’ in source systems.
I agree Marco – there is a business problem if you look at it purely from a data perspective.
Yet, there are perfectly valid reason people do this. Things dont always get tracked in the inventory before they arrive in stores. This is common and stores and not going to stop selling a pallet of valuable good just because the guys down in inventory can’t keep up. The problem is further complicated when duplicates are detected and merged in on source but not another. The model needs to handle this.
What I am claiming is that you can model the bus in such a way that all these issues are not a problem. In other words: there IS in general no business key that has the properties a hub requires and you don’t need to assume this.. This may be “wrong” from a data perspective, but who are we to tell a company how to run their business :-)
You are right Thomas with some creativity you can solve a lot of issues within a dimensional modeling approach. I personally dislike this Data Vault versus Kimball debate.
Both can happily live together because dimensional modeling is all about the front-room: publishing the organization’s data assets effectively supporting decision making. Whereas Data Vault is all about the back-room: efficiently collecting, integrating and preserving data from the source systems.
And sometimes a ‘simpler’ approach than Data Vault is better. I have written about potential criteria in http://blog.in2bi.com/business-intelligence/four-approaches-to-data-warehousing/
Hi Marco
Great discussion, thanks for following up. I read through you blog, good treatment. I would add that the Start/End date you recommend in the history tracking in staging is not always needed – sales transactions are a good example, they just have a “Now Date”.
Here is my concern about your though: You write: “The power of the Data Vault comes from the integration on business keys.”. From the discussion on this blog, it seem that we are zooming in on this “integration” of business keys is the main value proposition of the vault. \
However, my claim is that the hub/sat structure, especially one where you keep the business key in the hub, is fundamentally the WRONG data model for handling this integration problem.
I blogged about the different source key pathologies you generally have to handle here:
http://blog.kejser.org/2011/07/30/an-overview-of-source-key-pathologies/
My stance s that these pathologies MUST be handled by a consistent modelling approach and that large warehouse all need this approach. In this sense, I am similar to Vault, I think there is a small, intermediate area needed. I think both the Kimball and Vault people agree that something has to be done to handle source key mismatches.
To summarise my concern about the hub/sat approach to the problem of “integration” or “key mapping” as i prefer to call it.
1) It is not always possible to identify the business key.
2) A single business key from one source may not even be complete, in the sense that it will not cover all entities we may see (as per my example, and also from inferred members from other systems)
3) If you put the business key in the hub, you will not be able to handle “Dolly Keys”, “Zombie keys” and “duplicate keys” with tracking history on the hub table.
Now, you can solve 3 inside the Vault by putting the keys in the satellites. This is structurally very similar to what I recommend with my mapping approach here:
http://blog.kejser.org/2011/08/04/transforming-source-keys-to-real-keys-part-1-introducing-map-tables/
The above pattern is both very strict (a desirable property it seems) and also able to handle all pathologies of keys.
But from a Vault perspective the hub/sat pattern has some problems with key pathologies:
1) What exactly do we mean by saying that Vault has a strict modelling approach if:
2) What exactly IS the hub? Is there one row per entity in there? Do we need special hub modelling trick to handle the EAN situation?
3) In order to handle ALL the key pathologies the satellite can no longer be strictly 1-n to the hub. (as per my blog above)
4) There seems to be another “type” of satellite now, the one that tracks history on things that are not keys. This satellite is for history tracking of non key attributes.
Ad 2) If there is one row per entity in the hub, isn’t this just what we call a dimension type 1 table? And why don’t we just add all the type 1 columns to the hub while we are at it? In fact, why did we both calling it a Hub in the first place?
Ad 3) We could solve the m-n problem with a link table. But now we have THREE different ways to handle the “integration” problem of keys. inside our decreasingly strict data model method.
Ad 4) Those “Other” satellites we can only point back at the hub if we have solved the key “integration” problem (if not, how will we know which hub key to point at? create a new hub for the other source system?). If those other satellites are there to track attributes that are not keys, what exactly is the difference between them and type2 dimension tables? Why didn’t we just call them that in the first place?
In other words: I am arguing that the Vault does not really solve the integration problem for all cases that need to be solved.
What remains is a claim that it solves a scale problem – but so far, I don’t think a convincing argument has been put forth that shows exactly how DV creates more parallelism. That is a discussion for the other thread though.
[…] Thoughts on Data Vault and Automation […]
“At timeXtender we promote an agile approach (methodology) and automation to speed up implementation. The later is what our software is doing. It is a Meta data driven design and build tool, centered around a dimensional model including staging, data cleansing and optionally also cubes and much more.”
Your tool sounds very cool.
I’d be curious to see if you would like to run it on top of a Data Vault model (perhaps even generate a Data Vault model) and then a star-schema from it.
“I am going to assume, that behind that statement are a number of assumptions such as that the entire source model was discovered in the first attempt, does not include any new sources, that the sources were not modified since last time and perhaps more importantly, that a change in the reporting model is not related to a process change in the source system(s).”
The Data Vault lets you model the source incrementally. You never have to go back to the source “for the same data” would be a more accurate statement. You never have to go back and redo the data model or the loading code of the elements from the same source systems that have been done (in 99% of cases).
The entire source model DOES NOT need to be discovered to begin a data vault model and it can adapt/accommodate additional tables/columns etc without changing what was built. It’s designed to be added to.
“So what I am wondering about is this: With the level of automation and ease of which “round trips between the Data Vault and the star schema” goes, where does the hours go in Data Vault?”
This is a good question and the root cause of some misunderstandings.
The Data Vault modeling CAN take time. Determining the business keys for the hubs CAN take time. The loading code between the sources and the DV can (and should) be generated and is usually trivial. If the loading code from the DV to stars can be generated, then they should be.
In my opinion it’s a slightly more complex situation than just creating the data sets. After the DV is modeled and running (loading) assuming you have enough for creating a complete star, your round trips are between the DV and the stars.
If and when new source tables, systems and data are added to the DV, it WILL potentially impact the front-end as well. Say there is new data in a new satellite table on a hub and you need to pull this into an SCD-II dimension, it will potentially require column additions and load process modifications to the code that loads the dimension.
If it is automated like with your tool, it can potentially make it much simpler.
Sanjay, I actually wrote a DV load tool similar to timeExtender. Yes, I have written code to auto load a Vault. I can therefore answer your question:
“I’d be curious to see if you would like to run it on top of a Data Vault model (perhaps even generate a Data Vault model) and then a star-schema from it.”
You could, but why would you? I think this is what Thomas is trying to understand here.
@Thomas,
I don’t think DV is a lot different. Basicallly I use the same patterns in DV but I start out at the DV end.
1. in DV I can change the mapping retrospectively while at the same time maintaining current mapping as long you do it in bitemporal sats (which would be mandatory for DV given your example).
We can also use similar mapping constructs in DV, but they look slightly different from basic DV types (keysats and keyreflinks).
2. in DV you also create a new business key (this is what you are doing, but you hide the key and call it a surrogate, but without a business key you cannot have a surrogate key).
3. In DV land your case would be an edge situation that might be implemented like that, but it is not mainstream (but it does happen)
4. In DV land we can build incrementally. Knowing the (integration) value of business keys beforehand is good but not in all situations required or possible
5. with DV we *can* build (generic) MDM solutions (although they would look slightly different from basic DV used for EDW’s.)
6. DV assumes active feedback loops on DQ, esp business keys is good (I think it is btw). the bigger the business key DQ gap is, the more ‘exotic’ your DV becomes (your solution IMO could be easily transformed into a fairly exotic DV.)
7. In DV we keep a close tab on the source of all data. We do extra work to make sure we don’t mix ‘derived’ and ‘sourced’ data.
8. The pattern of creating a “superhub” is universal and quite usable and can be implemented in DV, Anchor Modeling and prob more methods.
9. Once we are in bi-temporal land we should discard pure kimball solutions. We need DV or simliar temporal oriented methods to support our (bi-temporal?) starschemas.
@Martin:
It seems to me that you argument boils down to: DV can do this too!
What is DV, if you really boil it down:
1) Its tables that have a “good” key (hubs)
2) Its tables that have 1-n relationships to other tables (sats)
3) Its tables that implement m-n relationships to other tables (links)
Now, I used to be that 2 could not have a foreign key to 1, is that still the case?
… Since we have been modelling OLTP system like that since the 70ies (using E/R diagrams) it is not at all surprising that you can model these concepts in DV too (although with even MORE tables and join if you don’t allow sat –>hub relations). But that does not answer the still elusive questions about what is so fantastic about it.
As I mentioned in the other treads, one of my gripes with DV is that it uses these obscure names (hub/sat/link) as some VERY generic concepts – instead of realising that there are distinctly, well defined, different TYPES of tables (more than those three), each with their own taxonomy and modelling technique.
To enumerate the ones in my Kimball model, we have:
– Facts (and they come in variants, journalled, append only, snapshots etc)
– Bridge tables
– Type 2 dimensions
– Type 1 dimension
– Key maps
– Meta tables (many variants)
– Temporary staging tables
– HIstorical record keeping tables
– and more that I might have forgotten…
Each of those tables have a “strictness” to them, but they have distinctly different behaviours. I don’t see how collapsing all those well defined categories adds ANYTHING to the modelling technique that isn’t already there.
@Thomas,
DV should be able to model anything from ER/OLTP/Dimensional using semantic equivalent transformations, else, what would be the point of having a generic modeling method? (The same goes for Kimball) As for “better”, DV is an EDW optimized version of what I call “Anchor style modeling”, and some varieties of it has been in use in OLTP land from the 70’s. DV also has a cookbook going with it’s modeling style (just like kimball) focusing on data management instead of data delivery.
1) Its tables that have “a” (we try to do good, but “a” key will do) key (hubs)
2) Its tables that have 1-n relationships to other tables (sats)
Temporal tracking entities
3) Its tables that implement m-n relationships to other tables (links) (Composite keys, associations and references).
Now, I used to be that 2 could not have a foreign key to 1 (or 2), is that still the case?
Yes and No, “sats cannot have direct load (time) enforced dependency” would be more correct. DV FK’s are direct joint paths, not logical relationships (DV is a physical optimized model type, not logical as most “OLTP style” DV lookalikes. implied/logical or other relationships are still possible in a sat. A lot of this stems from using surrogate keys.
The full Dv taxonomy is not as simple as just hubs/links/sats (there is more), just like Kimball is not just simple dimensions and fact tables.
As of naming them, I use more abstract names internally to better reflect their behavior. There are other names for these kinds of concepts. For basic communication I use the extended DV entity type names. For exact communication I use ORM fact type classification schemes, but there are few who understand those. For Hubs and Links, they are called Anchors in Anchor Modelling, I like the term, but Anchor Modelling is even less well known. I realize there is some issue here, so still some work to do ;) Mainly the question arises if we should classify our tables per method used because each method might treat similar looking constructs in a different way. eg.g IBM has a DV lookalike with own terminology (anchors and leaves) and slightly different (and dangerous!) transformation logic. Take your pick. At least DV terms arn’t copyrighted ;)
You get more tables in DV but this stems from solving some issues around loading and load/lookup dependencies, sourcing lineaging, temporal storage etc. its a choice of managing your data. That said, it depends on your transformation skills how good it works and how many tables you end up with (same as w. kimball)
From a pure kimball perspective DV is a method to unify key mappings, historical record keeping and saving other complex transformations outside of your star schema (where it would be a nuisance probably not very appropriate). in “one” model that is adapted for handling (complex) temporal issues. I find this convenient and practical, although it is sometimes a challenge to get everything in.
HTH
Hi Martin
That is very helpful, thanks a lot for digging into this.
I am familiar with anchor modelling too, so I think we are speaking the same language here. My concern about anchor models, and I think you allude to it, is that it splits the entire model into a huge number of tables, making it hard to query (but I agree, very flexible) and taking performance on a nose dive. On the other side of the spectrum, Dimensional models store “the entire entity as we currently history track it” – which makes queries a lot simpler (but risks the destruction of history) – with performance being one of the main design goals of that model.
It seem to me that Vault tries to sit in the space in between those two with the satellites being able to track more than one column, yet not going all the way to purist anchor modelling. But once you have “atomized” the input data into a Vault normalised form, you have already said your goodbyes to performance.
In the absence of a good temporal database system, I think that anchor models DO make a lot of sense in the OLTP part of the enterprise. Because OLTP system add data in small chunks at a time and query for small chunks too. The join price to reconstruct an entity is thus easier to overcome.
But here is my quarrel with DV as a Warehouse component: You DON’T always observe the source “as it happens” and you can’t expect to do incremental changes only. You observe it at extraction time – which is not always event driven and sometimes you just have to handle a huge (multi billion row sized) batch of data as efficiently as possible. Doing an anchor based model makes this throughput goal a LOT harder to achieve.
There is also the issue of bi-temporal source tracking. As you extract a source you get an excellent opportunity to tag it with the right auditing, metadata and timestamps required to achieve the bi-temporal representation if you need it. Keeping the data in the source format, instead of Vault, also allows you to reconstruct any mart at any point in time EFFICIENTLY, by simply rolling back your ETL code to a previous version and reloading the mart to a staging server. You could claim you can do this from a Vault too, but what value have you added by atomising the source first? Even fully audited and bi-temporal data is worthless if you can’t query it – and the ability restore a source, in its ORIGINAL source model, to any point in time, preserves the work that people have gone through ALREADY to make that source queryable. Why throw away that information and hub/sat/link, only to pay the join costs one more time when you present it in a star schema? And while you may be able to reconstruct a mart at any point in time from the Vault, how do you go the OTHER way and reconstruct the source if that is what is needed? I suppose you COULD reconstruct it from the satellites. But by temporal tracking the source “as it is” you don’t have that problem.
Now, there is the issue about schema stability. Anchor modelling is seeking to address this by making the model “Add only”. I think looking at the history of database design allows us to see this in a new light:
There is the very odd pattern that we have seen throughout the history of databases: the fear of ALTER TABLE statements! This fear does have some grounding in reality and makes some sense for OLTP system – which can suffer downtime when you alter table. Oracle have invested significantly in making these things online operations though. But even then, you have to balance this fear with the performance hit you take if you seek to avoid it – hardware is not free. When taken to the extreme, we see this fear drive people into designs like these:
CREATE TABLE Table (id, name)
CREATE TABLE Column (id, table_id, name)
CREATE TABLE Values (column_id, value)
… with the creator proudly announcing that their latest “discovery” means that now, finally, we never have to change a table in the database again (and effectively moving the entire data model and transactional semantics into the code instead). The problem with this approach, as I am sure you know, is that it shows an immense ignorance about how a modern computer works and what it means to run a system cost efficiently.
However, for data warehouses, the fear is even more overstated, because these engines that specialise in solving the sort of massive data movement that ALTER TABLE statements can trigger. Often, you don’t even need the same uptime for a warehouse as you do for the OLTP system (sometimes you do, but typically only in a fraction of the system). In many ways, I see the Vault as a design pattern very similar to the “the world in three tables” pattern. Let me put up an ironic example to show you what I mean. I think I have discovered a design optimisation on Vault. In fact, why don’t you make it more flexible like this:
CREATE TABLE Hub(id INT, entity_name VARCHAR)
CREATE TABLE HubKeys (id INT, hub_id INT, key VARIANT)
CREATE TABLE Sat (hub_key_id INT, values XML, from DATE, to DATE)
That way, you can bi-temporally version the entire model without having to EVER add a hub or a sat again. This is VERY strict indeed, even more strict than the existing Vault model. You can auto generate the ETL too and you can keep all the business data in XML and make the transformation rules self describing at the point of storage. In fact, with the above solution, you don’t even need to auto generate the data model, it can be scripted in three lines! And XML is just such a super efficient way to join data together if you just use the latest DOM and SAX technologies – much more scalable than say, column stores and other popular technologies that are obviously created for less scalable models than this genius design.
I assume we can agree the the above design, for all its amazing advantages, is crazy, right? The argument is similar to the difference between using runtime reflection in programming languages vs. doing the same with templates.
Let me make it clear what I am NOT saying: I am NOT saying that self describing, schema less database systems are always wrong. In fact, these systems, when properly implemented, make a lot of sense when you can live with a loosely coherent, fast evolving, small changes at a time, “OLTP style” system. There is a good reason that Key/Value stores have gotten a revival.
But I AM saying that super generic models are not a very efficient way to clean, curate and deliver data that needs to appear to have “one version of the truth” and where the data needs to MOVE fast and in large chunks at time. In other words, what we traditionally call data warehousing.
My previous example included 4 tables related to employees. Suppose after ample discussions with the customer we found that the primary email address is a good candidate for the business key. Then we would in DV get:
hubEmployee (hubEmplyeeKey int PK, AuditID int, PrimaryEmailAdress nvarchar(128))
satEmpoyeeADDetails (hubEmplyeeKey int PK, StartDate datetime PK, EndDate datetime, AuditID int, ADDetails…)
satEmpoyeeTimeSheetDetails (hubEmplyeeKey int PK, StartDate datetime PK, EndDate datetime, AuditID int, TimeSheetDetails…)
satEmpoyeePayrollDetails (hubEmplyeeKey int PK, StartDate datetime PK, EndDate datetime, AuditID int, PayrollDetails…)
satEmpoyeeNewPayrollDetails (hubEmplyeeKey int PK, StartDate datetime PK, EndDate datetime, AuditID int, NewPayrollDetails…)
Two remarks:
1. Performance: to get the data into a dimEmployee table we would have to join 5 tables which will add overhead en next perform additional business rules. But when you compare that to a historical staging area… you will also have to join tables (4 instead of 5 and maybe use an additional key mapping table) but in DV you’ll be joining on the surrogate keys and this will outperform the join with less tables on the nvarchar(128) column… Of course with the overhead in your loading routine where you’ll have to look up the surrogate keys!
2. People do make errors and of course they will enter the primary email address wrongly and you’ll end up with additional hub records. In DV we would create error marts. With a business rule like: IF address not in AD then add error in Error mart. This will enable a good feedback loop and in the next run this address will probably be corrected.
My conclusion:
Traditional business value of a data warehouse was based on the reporting site solely. Better decisions etc.. But after Enron, Ahold, Greece, some banks etc it’s clear that data governance is getting more important. And thus companies are investing in DV based data warehouses (at least in Holland :-)) As consultants and software makers we should add it to our toolkit. And of course we should stay critical because DV is not the best solution in every scenario.
Marco, thanks for proposing an example. This really helps make the arguments concrete.
Here is my proposed model to the same problem (I am assuming the key is commonly email for now):
Dim.Employee_T1 (ID INT PK, )
Map.Employee_ByEmail (Email NVARCHAR, From DATE, To DATE, DimEmployee_ID INT FK, IsDeleted BIT) (PK: Email+From)
Something to note: I don’t really need the customer to continue to use email as the key. If they change their mind, I I will create a new map instead (and keep the dimension exactly the same)
Source.Employee_AD (whatever structure source has, plus audit columns, metadata, From, To and additional Transactional From/to if needed)
Source.EmployeeTimeSheet (As above)
Source.PayrollDetails (As Above)
Source.NewPayrollDetails (as above)
… There may be more source tables, depending on structure of source.
Dim.Employee is now my final dimension and if I am not doing history tracking at the point of delivery, I can simply store my facts hooked up to that key and not have to worry about changes (which is where the big cost is, facts are large)
Day to day load proceed like this:
1) Extract source stream (optionally, split into another stream for audit/bi temporal)
2) Do whatever business logic I have stored in the ETL tool (which is what I version) to give me a stream of record with all the columns I _currently_ want from that source. From here, everything is automated and meta modelled.
3) Join relevant key to Map to find Dim.Employee key (assuming all sources use Email map, there could be more maps, depending on source key layouts). Infer any members not found in map and add those to the map and optionally (depending on master data rules) to the Dimension. This can be done concurrently with any other source, as the inferred key generation can be made both idempotent and non deadlock prone. It can also be cached efficiently in the ETL too by using an updatable hash table for the join.
4) Pipe the stream into a temp table or file. Run an UPDATE FROM statement on Dim.Employee (note, the join here is between only two tables, the transformed data and Dim.Employee, I don’t let the database to do any complex joins (never fully trust an O(n!) search space!) and the UPDATE can be done independently of other source loads.
5) IF audit and bi-temporal is required, merge source stream with the proper source table and insert as appropriate into source table. Store updates (to-dates) in temp table, run UPDATE on temp table and source when done inserting to patch up valid-to dates (and to avoid deadlocks with the insert statement and stay in high speed, set based algebra). Note that this can be done concurrently with step 1 and restarted independently.
Note that the above never requires the database to join more than one table with another (multiple joins are handled solely in ETL code). This means I can benefit from any sorted properties of the source to run merges in linear time. It also means that any hard joins that must be optimised are done so ONCE, in the ETL code – I don’t rely on the optimiser of the database to do the right job on large table joins.
What about Type2 dimensions? First of all, I hang those off the Type1 table (sat style if you wish to use Vault terminology) but I ONLY generate them when there is a business requirement to do so. Note the agile mantra here: “don’t model what you don’t need”. To create the type2 dimension (there could be more than one), I can use the same business logic as the Type1 loader (reuse = good), but with the additional step of doing a from/to merging of the streams in step 4 above. The from/to dates I stamp on the type2 (and the columns I delta detect on) will of course depend on the current business requirements of how the business wishes to view its reality. If I want to go back in time to see what a dimension used to look like before a change in requirements, I can just rerun the ETL version that matches that date from the source archive. If the source changes structure, I either patch put he ETL by using my metadata to locate the relevant dependencies, or if the change is large or not backdated, I spin off a new set of source tables (so I can still rerun old ETL version IF I need to restore the picture of reality to a point in time). Note that this also allows me to split off old “audit only” source version of the data to other servers and databases and only bring them out when the auditors need them. Don’t keep online what you rarely need.
There is an interesting challenge with the Type2 load here which there are several solutions to, depending on business requirements. What happens if two sources deliver data to the dimension at different times and we still need to be a real time system? We could just let each source generate a new row for the entity that changed. However, this may not be what the business requires. An UPDATE statement may be more appropriate – this is fundamentally a business question (do you care if the customer changed address two times per day from a reporting perspective or do you just want the “end of day” picture?)… Note that you can still audit both changes, since you are monitoring the source, though there is a real risk that you never SEE the change if you only extract the source on irregular intervals.
With regards to your conclusion: I agree that auditing will be much more important going forward. We just disagree that DV is the answer :-)
One thing to add: Dim.Employee_T1 of course contains all the columns relevant to the employee. Including (if the business needs it) the email (which is not a key in that table)
Hi Thomas,
Interesting approach with a lot of similarities with DV (if you call the map table a hub) . I like this idea of one key and have some doubts about loading the dimension and historical staging tables asynchronous.
My main concern however is the handling of type 2 changes and especially if you change your mind. Let’s assume after one year of running this data warehouse the business decides that the attribute: department in the payroll tables should be a type 2 attribute. How would you handle this?
Great question. Like this:
1) tag the ETL job in the version control system with descriptive header (so you can find it again if you need audit)
2) snapshot metadata (small, store in version control system). Tag in same build number as the old ETL
3) change metadata to reflect new type2 requirement
4) rerun dimension ETL from the point in time you want to reflect changes (note; can, but may NOT be all of history, maybe only from time of “changed business mind” – a business decision) and hit new T2 dimension table (notice: this will FK point to the same T1 dimension keys as the old T2 table)
Now for the fun part :-)
5) walk metadata dependency tree to find fact tables affected. Automated, easy
6) use the join of old T2 with new T2 table to either update or reload facts (depending on target DB engine). Note: you can reload the affected facts without going back to the source archive because they indirectly share the same T1 key (through the old/new T2 relations that both have FK to the T1 table). This means that fact changes can be done co-located on the mart nodes – no need to go back to any mega store of data in the “middle” of staging and mart to reflect the impact that hurts the most when making changes. Also, no need for complex join trees.
An interesting question is: should you throw away the old T2 table after this? You could, because the versioned ETL can rebuild it. By why bother – of it is small (as dimensions often are)
Marco. He me understand your concern about the asynchronous update of the dimension please. Are you worried about locking semantics or something else?
Isn’t it wonderful how an example problem comparing Kimball and Vault shed light on the details of the issue? Thanks for making the discussion concrete.
Hi Thomas,
My concerns regarding loading:
1. Additional staging: You’ll probably need a volatile staging area. In an HAS/DV you would only need this for delta detection (to spare source systems)
2. Delta detection: it seems to be easier if you do the “complex” delta detection from source to HSA/DV and next use data lineage to easily determine deltas from HSA/DV to data marts . (instead of 2x the complex variant)
3. Dependencies: Sometimes you’ll need data from a source in more than one dimension. (e.g. managers details in an organization dimension)
Hi Marco
To address your concerns:
1) This depends a little on how you load. In step 1 that I describe, it is often a good idea to land the staging tables before you stream them. Once they are merged into the archive – you can blow away the temp staging. Same idea as HAS/DV really – “get out of source fast”
2) I don’t see how that follows. I basically have two deltas: 1. The “achive delta” (trivial, automated) and 2. the “type 2 delta” (also automated). Both deltas are merges, and thus only include one simple join. You have a delta going into the HSA/DV and another going out (to the mart) and you will also have to worry about recombining the sats/hubs into the dimension structure and incur the temporal join and denormalisation (I store the same knowledge in the ETL going out of the source – if the source has the right structure, I benefit directly)
3) We agree fully. And if you do need that, you can pull the second dimension out of the archive or the temp staging (if you have not blown it away). There is an important timing issue here, you may for example want one dimension to listen on the event bus (for real time) and another to do bulk extract from the source – having two extracts in this cases is a example of “separation of concerns” (don’t real time if you don’t have to).
One general point I would like to raise with you is this: Are we in agreement that dimension source data tends to be small – and that because of this we can treat it differently than fact data?
The reason I ask is that my stance is that the a warehouse mode should support efficiency of fact load (co-location, low join rates, compression, change tolerance etc) in the best way possible, because this is where the cost is incurred long term in a warehouse (running the system, scaling it). If this efficiency comes at a cost in “strictness” or efficiency loss in dimension load – I consider that a fair tradeoff. This is why I am advocating different treatments of different data structures and separation of concerns in loading patterns. Do you agree with this principle? Or would you rather take strictness over efficiency?
@Thomas Christensen:
First of all, thanks for starting this debate.
And a question:
Are these discussion useful to you and the types of comments you were looking for?
@Thomas Kejser
To me this is the most informative discussion on this topic I have seen so far anywhere. The example problem really helps a lot over the more religious, conceptual parts of the discussion. I am really looking forward to the results of Dan’s initiative of a “proper comparison” that he started the other day on his site. If my post and this discussion helped bring that out, we got something good out of all this :-)
If I try to turn things a bit upside down and ask the question “what could not be solved if Data Vault was not around?”, my conclusion is still very few. I still find Data Vault to be way oversold, and it concerns me a bit that it is being positioned as THE enabler of agile DW/BI implementation. I also still find it interesting to see the aggresiveness of which Data Vault is being protected from “attacks”. You do not have to look further than to Dan’s own blog to find statements like “if you are not a certified Data Vault Modeler and have 2-3 certified implementations behind you, we are not open for critical reviews”. At the same time they do not seem to put the same constraints to themselves, when it comes to telling the world about all the advantages over Kimball or Inmon. Of course I do not know the full history of all the prominent members of the Data Vault community, but I seriously doubt that they all have i.e. 2-3 “certified” Kimball implementations on their resume before the started telling the world about all the problems with that over Data Vault. But again I am really looking forward to the results of Dan’s proper comparision white paper.
As with regards to Sanjay’s question about timeXtender with regards to Data Vault. We do not have something we call hub, satellite etc. today, but our tool is actualy flexible enough to build a Data Vault model. And a Inmon. And a Kimball and probably most things in between. We are in business to sell software to help the world get data warehouses better and faster. We are of course always looking for trends to see if we should include something new, but for now I see no reason to include any new features for enhanced support of Data Vault.
Thank you for contributing, you have been a major asset in this discussion. And good luck with the new phase in your career :-)
Best Regards
Thomas
Happy to hear the discussion is doing the way you wanted it Thomas.
I did not realise that Dan has a new thread on his forum. Is there a way to copy the discussion we have her to that forum for you?
The idea of auto generating the model (and most of the ETL) is, I think, very crucial to any large DW. I suppose what you are asking really is: What is the BEST model to auto generate?
With regards to the “attack” angle. I too find the aggression most curious. If this thing was so revolutionary, surely it could be explained in simpler terms (like I hope my load example has been) and not require such a religious defence method.
Not sure if it is fruitful to continue the debate on Dan’s forum if it will go the same way as the other thread. What do you think?
I am quite sure that regardless of where this discussion comes up, it will quickly go down the same path.
This discussion reminds me of a lot of the disussions we used to have about open source vs. proprietary software. After about 5 seconds it was down to Windows vs. Linux and SQL Server vs MySQL. As a web developer I was also engaged in a lot of discussions about PHP vs ASP vs JSP and that followed the same pattern. A lot of people just pass on arguments they have found among their peers in whatever forum, rather than based on actual experience with both sides. If we assume are right, we will look for confirmation. If we assume we are wrong, we will look for input and feedback.
I started this discussion thinking “Although I have only seen Data Vault take off in Holland and Belgium, is there something I am missing? Is there something unique in here that I am not seeing? Is it actually adding some value over modern Kimball implementations?” I was looking for input to extend my “Kimball horizon”. It started out as open minded questions but quickly became a religious discussion.
Thomas, I think you did a great thing posting here and asking the core questions – thank you again. I too find the discussion takes a religious angle and just like with religion – when you try to get the the core of what people really believe it (through examples) suddenly everything goes VERY vague and people get defensive.
I would say that the discussion with the Vault people have made many of my own ideas much more concretely stated – so in that sense it has been useful. For example, I think the conversation Marco and I are having on this thread about loading and restating facts is very useful indeed. Marco, thanks for this and I hope we can continue this line of reasoning.
All this being said, Dan is a pretty smart guy and if we could get the religious angle tuned down a bit (and people could stop playing so insulted whenever they feel cornered) – I think there might be a an option to discover some form of synthesis and new ideas together. For example, it would be good to reach a common conclusion on how to properly handle business keys and surrogate keys – regardless of whether you use the terms hub/sat/link or my term: Dimensions and Maps.
@thomasC/Marco
Will you two be able to represent the below line of arguing between Marco and me about loading Kimball on Dan’s blog while I am away next week? To make sure Kimball load is not misrepresented and we get a fair apples to apples compare with a purpose of getting to the truth of the matter.
You are neutral parties, whereas I worry both Dan and I have some strong views and old scores to settle that may skew perspective and boil the discussion.
Please do feel free to mail me if you want to discuss anything and I will do my best to asynchronously respond.
[…] Thoughts on Data Vault and Automation […]
By the way, I posted a new entry that you might want to use as “ammunition” in the argument:
http://blog.kejser.org/2012/06/29/why-date-between-fromdate-and-todate-is-a-dangerous-join-criteria/
I have a busy week ahead, so not sure I will be able to contribute much to the discussion for some time.
@Thomas Kesjer,
I don’t think strictness is the opposite of efficiency and in most cases you can have both.
I further don’t think efficient fact table loading is THE problem in most data warehouses. THE problem is the lack of agility and data quality.
As to the fact table loading: Your system with the key-mapping table will outperform DV in the first and daily loads. But DV will outperform the rebuilds of fact tables. I’ll try to explain:
In DV you will do a lookup of the hub keys when you load the link table which is potentially costly. And later when loading the facts you’ll again lookup the dimensionkey. The latter is based upon the hubkey and won’t cost that much (all int’s) So you will typically do two lookups compared to one in the key-mapping table scenario.
So when you do a rebuild of (a part of) the fact tables in DV you will only have to do the cheaper lookup. The extra lookups in DV can become a problem. But as you know with the right hardware you can load 4 million rows a second …so you will need a rather large daily load to really get into troubles.
Marco
I partially follow your argument. It is indeed true that the DV requires one less join to “restate the fact” than the T2/T2 join I describe. But the join you are eliminating is a join between two dimension tables – barely measurable as the join result can be preached and you are still going to spend your time on the actual changing of the facts.
So it seems your tradeoff in DV here is:
– win: rid of dim/dim join (which Kimball requires to restate fact)
– lose: with slower initial and daily load (as you describe)
That does not seem to be a good tradeoff to me… Or am I misunderstanding?
Oh.. Regarding the load speed. The current number is 16M/sec :-) I improved it since the first record.
But that assumes you don’t do expensive stuff like BETWEEN joins and big/big table joins.
Check out my latest blog for an example of the dangers of temporal join trees.
A question on Links: let us say you had a TPCH schema. How would you model ORDERS and LINEITEM in a DV?… And how would you then load the star schema from that model? … And finally, how would you restate the facts if you changed your mind about T2 on say, the CUSTOMER table?
Preached = cached (iPad dictionary :-)
To Translate TPC-H (http://www.tpc.org/tpch/) schema to DV I would create:
linkOrders (linkOrderKey int PK, ORDERKEY int, hubCustomerKey int, AuditID int)
satOrder_ORDERSTATUS (linkOrderKey int PK, StartDate datetime PK, EndDate datetime, AuditID, ORDERSTATUS)
satOrder_Details (linkOrderKey int PK, StartDate datetime PK, EndDate datetime, AuditID, TOTALPRICE, ORDERDATE, ORDERPRIORITY, CLERK, SHIPPRIORITY, COMMENT)
linkOrderLines (linkOrderLineKey int PK, linkOrderKey int, linkSupplierPartKey int, LINENUMBER, AuditID int)
satOrderLine_Details (linkOrderLineKey int PK, StartDate datetime PK, EndDate datetime, AuditID, QUANTITY, EXTENDEDPRICE, DISCOUNT, TAX, RETURNFLAG, LINESTATUS, SHIPDATE, COMMITDATE, RECEIPTDATE, SHIPINSTRUCT, SHIPMODE, COMMENT)
For both ORDERS and LINEITEM I opted for the peg-legged construction. (ORDERKEY and LINENUMBER aren’t linked to a hub) I created two satellites for the linkOrders table because I expect the ORDERSTATUS column to change more frequently.
To restate the facts the generic solution would look like:
UPDATE fO
SET dimCustomerKey = dc.dimCustomerKey
FROM factOrders fO
JOIN linkOrderLines lOL ON fO.linkOrderLineKey = lOL.linkOrderLineKey
JOIN linkOrders lO ON lOL.linkOrderKey = lO.linkOrderKey
JOIN dimCustomer dC ON lO.hubCustomerKey = dC.hubCustomerKey
AND fO.StartDate >= dC.StartDate
AND fO.StartDate = new.StartDate
AND fO.StartDate < ISNULL(new.EndDate, '9999-12-31')
comment too long, so continue…
Alternatively to eliminate the big joins you’ll use the old customer dimension like:
UPDATE fO
SET dimCustomerKey = new.dimCustomerKey
FROM factOrders fO
JOIN OldDimCustomer old ON fO.dimCustomerKey = old.dimCustomerKey
JOIN dimCustomer new ON old.hubCustomerKey = new.hubCustomerKey
AND fO.StartDate >= new.StartDate
AND fO.StartDate < ISNULL(new.EndDate, '9999-12-31')
Marco, thanks for the example, this helps us get closer to the core.
This highlights EXACTLY what I think both ThomasC and I are asking: WHAT have you gained by modelling like this?
First of all, you having taken something that was badly structured for reporting (ORDER normalised to LINEITEMS) and added even MORE joins to the data model. To construct the same in a Kimball model (with archived staining) I would simply have done this:
1) Archive loaded data in the form they arrive in (i.e. I will have an ORDER and LINEITEM table)
2) Lookup relevant dimension keys on both tables
3) Join the two tables at load time
4) Store table as fact table, storing only the dimensionality I need and having ORDER_ID and LINENUMBER as degenerate dimensions.
The structure created in step 4 can be restated to any point in history using the optimised join you also describe (and the load technique I described earlier). Data is never lost and you can change your mind about it.
Furthermore, the fact table in step 4 can (if I want to optimise for storage) contain ONLY the columns the business needs – and if they want some of the ones from the archive – those can be added later (“don’t model what you don’t need”).
Correct me if I am wrong, but so far we have agreed that:
1) The Vault requires more joins for daily load
2) The joins required for restating fact are the same. Your example of the optimised Vault version uses two dim/dim joins, so does mine. I actually thought you would store and maintain a structure in the vault that would allow you to restate the fact with just one join (even though that would have been a minor optimisation only in the Kimball model)
3) You have modelled something in addition to Kimball
… So we are back at the “why?” questions: Why make things so complicated? Why add another model to the architecture?
Yes, the tcp-h schema is not a good candidate for DV. You are probably better of using another approach like daily truncating and rebuilding your data warehouse or a dimensional modeling approach where you keep history in dimensions. DV only gets interesting if you have many potential changes in the data feeding the data warehouse or many potential changes in user requirements and many integration points.
When you use a dimensional modeling only approach you will probably find yourself periodically rebuilding your data warehouse and losing historical information. In DV you will also have to rebuild parts of your data warehouse schema but because of the separation of structural elements (business keys and the associations between the business keys) from the descriptive attributes this is easier and historical information isn’t losed.
DV is not the only way you can do that and others have come up with similar approaches. I think it’s great DV has taken off in the Netherlands because whenever I get at a new customer I’m quickly up to speed because of the common language that is used.
Marco
I find that a REALLY interesting comment:
“Yes, the tcp-h schema is not a good candidate for DV. You are probably better of using another approach like daily truncating and rebuilding your data warehouse or a dimensional modeling approach where you keep history in dimensions.”
Perhaps we are zooming in on something here. Is it really that case that the Vault is only good data model for modelling dimensions and not large data sources? I would actually begin to partially buy into that argument – and as you have seen in the key topic we discussed – the structure I propose look a lot like Vault (with the exception of the assumptions we make about the business key).
With regards to this comment:
“When you use a dimensional modeling only approach you will probably find yourself periodically rebuilding your data warehouse and losing historical information.”
I thought we have moved beyond that… Have I, with my example loader, not been able to convince you that the dimensional model, with a historized staging area, does NOT lose history – i just doesn’t MODEL it?
You make a great point about the language of modeling. If everyone has a common vocabulary – things DO get a lot easier. However, I think we need to temper that with a “right tool for the right job” disclaimer: just like in computer science – different languages are made to serve different purposes. Perhaps DV is a “dimension only” language and just not suited to handling large fact data. On trivial sized tables being joined (dimensions for example) optimiser problems are just more surmountable – so the pain is hidden.
In DV hubs and their satellites are first loaded and next the links and their satellites. After that you’ll load dimensions and facts. So DV doesn’t have the benefit that your system has by loading facts and the historical staging area in parallel. (In DV you load in parallel but only within a group) That and potential extra joins is why as I already argued your system will outperform DV in the daily loading of facts. However I doubt that there are many systems that have such a huge daily load that it will have a significant impact.
We could do a case study based on TCP-H and come up with suggestions for alternative approaches for large data sets.
That being said there is a potential performance risk when loading facts from the related links and their satellites. If you join large link- and satellite tables together you will have to limit the used sets to only those of the daily load. In the TCP-H example we have 5 ‘large’ tables that are joined (or merged in the dataflow) to one fact table. You should limit each link / satellite to only those rows of the daily load.
I am thinking the argument for using a Vault in the first place are beginning to get rather thin – aren’t they?
It seems the ONLY advantage we have so far been able to find that a Vault has (as compared to a Kimball with a historical staging) is that most people in Holland understand the terminology. That is a bit like saying the Creationism is good in the US, because a lot of people there believe it is true.
To propose adding a new data model to a warehouse architecture, I think we need a bit more than that, don’t you Marco? :)
I have been rather busy lately hence this late comment.
When I reread my earlier comments I realized I made a mistake in the tcp-h schema. In DV when we have a link to link situation as in the example best practice is to flatten these relationships. The new schema would be:
hubOrder (hubOrderKey int PK, ORDERKEY int, AuditID int)
linkOrder_Customer (linkOrder_CustomerKey int, hubOrderKey int, AuditID int)
satOrder_ORDERSTATUS (hubOrderKey int PK, StartDate datetime PK, EndDate datetime, AuditID, ORDERSTATUS)
satOrder_Details (hubOrderKey int PK, StartDate datetime PK, EndDate datetime, AuditID, TOTALPRICE, ORDERDATE, ORDERPRIORITY, CLERK, SHIPPRIORITY, COMMENT)
linkOrderLines (linkOrderLineKey int PK, hubOrderKey int, hubSupplierKey int, hubPartKey int, LINENUMBER, AuditID int)
satOrderLine_Details (linkOrderLineKey int PK, StartDate datetime PK, EndDate datetime, AuditID, QUANTITY, EXTENDEDPRICE, DISCOUNT, TAX, RETURNFLAG, LINESTATUS, SHIPDATE, COMMITDATE, RECEIPTDATE, SHIPINSTRUCT, SHIPMODE, COMMENT)
As you can see we introduce some denormalization in the linkOrderLines table. This has two advantages:
1) Performance loading the fact table and
2) eliminating the dependencies when loading link tables (link-to-link tables can only load when all depended links are loaded)
Oops … I forgot the customer:
linkOrder_Customer (linkOrder_CustomerKey int, hubOrderKey int, hubCustomerKey int, AuditID int)
linkOrderLines (linkOrderLineKey int PK, hubOrderKey int, hubCustomerKey int, hubSupplierKey int, hubPartKey int, LINENUMBER, AuditID int)
Marco… I think that even with the denormalisation you describe, you are still making the model a lot more complicated than it has to be.
The great outstanding question still is: Where is the value of going through this modelling exercise.
So far, we have seen that it introduces more join and that the handling of keys is very far from the “strictness” originally claimed to be a property of the model.
Sometimes, the source is normalised and it makes sense to store it in the staging as such. But why re-normalise only the denormalise it again later if the already arrives optimally in the staging area.
Thomas … We are now comparing DV to a Kimball++ method with and historical staging area and a key-map table. Both are valid options in scenarios with integration- and data governance issues but I guess your preference between both approaches will depend largely on your experiences.
I imagine that you as a SQLCAT member were flown in by MSFT at their largest customer sites to tackle performance issues. And this is an area where Kimball ++ has the advantage over DV as we argued earlier.
As for modeling effort … I don’t see that much difference between both approaches. In DV you will typically simply split source tables in a (hub or link) table with the structural elements and one or more tables with descriptive information (sats) In Kimball++ you’ll have to model the key-map table and the staging tables with the temporal and metadata. And of course there are tools that will help you model for both approaches…
So let’s look at advantages of DV:
DV will enforce a self-describing model and people will easily grasp which tables belong together and how they relate. In Kimball ++ you would need additional strict naming conventions for your key-map and staging tables.
In DV when reloading the facts and dimensions you can use your normal daily loading routines after truncating the target to the relevant point in time. In Kimball++ you need some way to change the source tables to use the staging tables.
But of course the main advantage comes from the emphasis on the business keys as I argued earlier. Well you could of course say: there is probably not a companywide business key, why bother the effort of looking for one let’s just opt for the source systems natural keys and let someone keep record of the masterkey and how it links to the natural keys.
And don’t get me wrong: There is nothing wrong with this approach! However when you start to look for this “good” business key you will find interesting information that can really impact performance. You might for instance find that the owner of the masterdata of products (inventory) was a bad choice and you could advise to change it to some place earlier in the value chain (e.g. purchasing)
Furthermore by making the business key important for the business and by providing a feedback mechanism people will start to adopt these business keys as part of the business vocabulary which will lead to fewer errors and better alignment.
These are the pros and cons as I see them. So in the end it all boils down to personal experiences….
The coming two weeks I won’t be able to follow this thread.
Hi Marco
Its been along time since I had a look here and somehow I didnt receive an email updating me that you had responded.
You say that DV enforces a self describing model? What exactly do you mean by that? If we, for the sake of the argument, equate a hub with a dimension, then both describe an entity in the business. However, to get the full picture of the entity in the DV, I have to join several hubs, links and satelites together – I dont see how that is self describing?
I am afraid I also dont follow your argument about good business keys impacting performance? Do you mean that if you ask the source system to change the key, then the source will run faster? I dont see how DV has something unique going for it here – bad source keys are bad source keys, it assume that you dont need create a hub table and model a DV to realise that?
With regards to reloads: I apologise if I didnt explain this clearly. Obviously, the ETL load operates on staging tables, no matter if they come from an archive or a daily/hoursly/second load (with some extreme examples requiring no staging at all). For dimensions, reloading is equivalent to loading (start at T0, run forward).. For facts, you WANT to think differently about incremental and full load – because the optimal index maintenance will depend on the data you load. Having a single process is not an advantage for any significantly sized warehouse
For the reasons I quoted before, I dont think the argument can be made that it boils down to personal experience. So far, we have still failed to find a SINGLE thing that DV has going for it that makes it better than Kimball++ models (nice name by the way). I may have a personal preference for chocolate and you might like vanilla, and this we could not argue about. But if my personal preference was to use ADSL modems and yours was dark fiber lines – then you could rightfully argue that my choice (all other things being equal) is inferior to yours.