I am back in Belgium to deliver some training and do a bit of consultancy. Since that leaves me at yet another hotel room, I might as well share some thoughts on something I have noticed over the past year or so and which I also noticed during my last visit here.
Thus in continuation of my last post, I will share one more observation from the Data Warehouse automation event in Belgium: Data Vault is a hot topic in the Benelux region and was part of almost every presentation. This is a distinct contrast to what I experience during my travels to the rest of Europe, USA, Canada, South Africa, Iceland and India where it is hard to find a customer, BI consultant or even anyone at a BI conference, that has ever heard of Data Vault.
Data Vault, a secret revolution?
I was first introduced to Data Vault a couple of years ago, and have to admit that I did not really see the magic. And I even have to admit, that although I now see its place in some scenarios after intense reading, seeing presentations at various conferences etc., I still tend to see it as overhead that is adding little value in most cases.
But before you read on, let me state something that is apparently quite important when it comes to Data Vault. If you have not made the effort to become a certified Data Vault modeler, you know nothing. It seems like they are sharing some secrets during the training that makes everything clear.
Frequently I have heard people in the Data Vault community explain the lack of wider acceptance of Data Vault with people talking and writing less about Data Vault than of Kimball. And then Data Vault has only been around for 10-11 years, while Kimball has had many more years of speaking and marketing. I am personally not buying that explanation. If something as revolutionary to data warehousing as Data Vault is being positioned, has been around for that long without major acceptance, it might not be as revolutionary at all?
And when you engage in discussions, critical questions tend to be answered with the you are not certified… again, if you do not instantly agree with the conceptual one liner benefits. I have even found a discussion with Dan Linstedt, where he makes the same argument. He is pretty much saying, get certified or shot up. This behavior is of course not exactly encouraging anyone to write or talk about it, so I wonder why they wonder.
And with this long disclaimer-like introduction, I just want to make it clear that I am not a certified Data Vault modeler. This will of course leave me wide open for attacks from the Data Vault believers, since I am clearly not a big fan.
So what is Data Vault?
There is a really interesting discussion thread on the Kimball forum, where Dan Linstedt is personally engaged in a discussion about Data Vault vs. Dimensional Modeling. It is a pretty long thread with a lot of assumptions and technical details flying through the air. And by the way, Thomas Kejser who is really giving Linstedt a run for his money, is probably one of the most experienced, knowledgeable persons in databases, BI, DW and SQL Server specifically.
You can find a definition of Data Vault here, but I think this statement from the above discussion thread says it all
One point that is very difficult to argue over is end-user access. Why? Because the Data Vault is built to be a back-end data warehouse, that feeds star schemas, cubes, and other data modeled data marts on the front end. With this in mind, the Data Vault is tuned for high speed loading, AND high speed querying (but, of course – it is NOT tuned for ad-hoc querying) – it is tuned for specific processes (consistent, repeatable processes that use the same indexes and the same data sets over and over again) to produce the front-ends.
So to put it a bit simple, Data Vault is something you do in addition to the dimensional model. It sits in between data sources and the dimensional model, which makes it hard to consider it as anything else as overhead to the existing practices.
Data Vault, agile and speed of delivery
I can (and have) built and loaded Star Schemas from a Data Vault within 45 minutes of receiving a 2 page requirements document filled out by business users / customers.
This statement from Mr. Linstedt in the same thread as above, explains a lot when it comes to agile and speed of delivery. At the DW automation conference, the power of Data Vault was underlined by not delivering in days or weeks, but actually minutes. I guess in all fairness they should have added – we can deliver in minutes, after the Data Vault has been built and loaded.
There is actually nothing unique in being able to deliver that fast. Using timeXtender as my rapid design and deployment tool, I am frequently building and loading star schemas from the original data source(s), via a staging area, within 45 minutes of receiving requirements from the business.
And just to be very clear on the assumptions and prerequisites, I only need to have a working connection to the data source(s) and someone with source system domain knowledge, to help find the information.
What do you think?
I will let you know in the next post, the one situation where I think Data Vault modeling is adding value to dimensional modeling. In the mean time, please feel free to share your thoughts on Data Vault vs. dimensional modeling.
The war has been going on between Inmon and Kimball for years (it seems like Inmon is the only one still fighting). Now we have Inmon vs. Kimball vs. Data Vault. With Inmon’s kind words about Data Vault, it appears as it might even be Inmon and Data Vault in the red corner against Kimball in the blue corner?
But what do you think, is Data Vault all buzz and no real usage, or is it the magic bullet we have all been waiting for? Is it replacing Kimball or does it compliment Kimball? Please let me know what you think.
P.S. I do hope that someone from the Data Vault community finds this post via some search engine. If you do, it would be refreshing with some constructive information about the benefits of Data Vault, applied to real world scenarios with multiple data sources, disparate keys and what have we not, integrated into a data warehouse.
Interesting, good to hear your thoughts Thomas.
This lines up with a discussion I had about a month back with Dave Campbell, one of Microsoft’s SQL Server Technical Fellows. His opinion was that one of the future waves was the introduction of the ‘digital shoebox’, essentially being a dumping ground for as much data as the business can find. The intention is then to have analysts use this as a source for all sorts of interesting querying and investigation.
Of course, with ideas like nosql and big data floating about, the capability to do this is now possible, however I’m not entirely sure of business ability or appetite to embark on a project like this.
I appreciate your thoughtfulness you have given to this topic, and thank-you for posing it in such a way as to not attack, but rather introduce a much needed discussion.
Let me just say this to kick things off:
1) there is no “secret sauce” being taught at certification classes – everything about the Data Vault model is, as it should be – public and available. You can learn everything you need to know about _the modeling techniques_ from the book: Super Charge Your EDW – available on amazon.
2) What is being taught in my classes (as I am no longer with Genesee Academy), is the methodology, the implementation details.
3) Data Vault is FREE AND PUBLIC DOMAIN to learn, and yes – just like Kimball “sells you his books and courses” – I do to.
4) Data Vault has been around since 1990 – far longer than 10 or 11 years, however it’s only been made public since 2000. It took that long (10 years) of Research and Design to get the architecture and the methodology right.
5) It was originally built for big government agencies who had very difficult problems to solve, and where (sadly) Star Schemas as a back-end-data warehouse + the Kimball methodology really truly caused all kinds of problems. From scalability to flexibility, to auditability. In this government agency (which is still using the DV today), the DV model and methodology was the only one to allow them to solve their issues. Back in 1997 they were doing real-time and had “big data problems” when everyone else was just thinking about “strategic” based data warehouses, they were faced with what we now call: operational or real-time data warehouse.
6) this agency had all kinds of rules and regulations: stemming from CMMI Level 5 (which now has a 6th level), to six-sigma initiatives, and TQM at the business levels. The methodology used to build the Data Warehouse & the model, went under intense scrutiny. We also needed a repeatable and automatable solution.
Why? because we were forced to budget & account for our budget back then. The company required us to be within 90% or better tolerance between our estimates and our actuals (hours collected, and delivery timings for the entire project).
The company also had a policy that we were to use function points. Due to the methodology applied + the repeatable model (DV style), we were able to arrive at 97% for delivery, and within 93% accuracy of our estimates for our actuals. Function point counts were established to measure difficulty, ease of use, complexity, and assign estimate counts for person-hours per function point.
We had more requirements for replication. The entire EDW had to be seemlessly replicated to a classified or high security environment. Now here’s the kicker: the classified people wanted to “extend” the data model, BUT they also wanted to inherit any changes we made to the “public model” without affecting _anything_ that they had extended. Seemless integration of data model changes.
They also wanted to inherit all of our “public” ETL routines on the classified side, and be able to write their own ETL routines based on our patterns. Once again, the ETL should be impacted between 0 and 1% for changes to the data model, and furthermore, changes we made to the public model – should not impact the classified model, nor the classified ETL.
We had a lot of requirements, including “raw data” – specifically for auditability on both the public and the classified side. Today, many EDW’s are faced with Raw Data or loaded with Raw Data because of compliance and auditability.
Auditability is not the function of the model, rather it is the function of the methodology chosen, and can be achieved (similarly) with a fully persistent raw staging area – however, once you move past the raw-data checkbox, we had other requirements. Our requirement was: tie this data to the business processes for lean-initiative purposes. In order to meet this requirement, we first had to understand just how dirty our raw data was, this required separation of the business keys from all the descriptors.
Because at the end of the day, the business keys are the only data component that ties directly to business process models. What the business wanted was: if “we” the business optimize business processes, then we want “direct and immediate impacts/changes” made to the warehouse and its loading processes. Like a tree that has a bad branch – the business had a desire to “cut off a limb” or remove an unecessary business process.
But it goes deeper than that. They wanted to do this, AND have little to no impact to the existing EDW. The only way (again) to do this was to isolate the components by business key. Thus – formed the Hubs of the Data Vault.
And the story goes on from there, to links, and satellites for descriptive data, but unless you are interested, I will not bore you with further details.
All i will say is that this business was finally able to accomplish something in 1999 that it had tried & failed to do for 15-20 years before. Create a single integrated raw auditable ENTERPRISE data warehouse – from which they could launch and easily create star schemas.
One last thought: before you think “awww that’s easy! I can do that…” Let me just tell you this: we did our first delivery in 1997 and completed in 1999. They were 157,000 employees strong with 7 sectors of business and 53 different companies that we had to span. In the beginning we had 3 mainframe sources, by the end, we had over 75 mainframes, quite a few adabas systems, some peoplesoft, windchill and yes, even SAP to load from.
This *was* big data at the time. This *was* real-time and batch mixed together. This *was* a huge challenge, and we delivered successfully. Most of our ETL was originally written in cobol, shell scripts, perl, and SQL. Our BI platform was Andyne GQL (eventually Hummingbird, eventually changed to business objects). Our HTML web server was a DEC ALPHA, and our HTML web pages served 7000 shop floor workers with reports, and 5 to 15 second response times. The HTML on the web server was updated every 5 minutes for some reports – again through perl, FTP, and shell scripts.
So before you come back and tell me that you can do this in Kimball style warehousing, just know this: the business had tried & failed many times before with “Kimball Style” – they needed a different approach, a different methodology, and a different data model.
When I started this project with my team, they already had over 150+ data junkyards that were “kimball style” and failing.
Now don’t get me wrong: i’m not saying “Kimball Style” is wrong, and I’m not saying that Kimball Style can’t be put in place properly. Like all methodologies and models, this one has its place. What I am saying is this: It failed to meet the needs of this particular organization (over and over and over again).
(paraphrased) Einstein once said: to do the same thing over and over again and expect different results is idiotic.
The “magic” (if there is any) in the Data Vault lies in the combination of the methodology and the data model combined. Simply evaluating the data model for it’s merits will get you no-where.
Oh yes, another thought if I may: with the rise of “big and bigger data” today, NoSQL approaches, triple stores, KV stores (like Hadoop), document stores, and so on… Physical data models are becoming less and less important. Logical models are becoming more and more important – as are *methodologies and automation* for implementation.
So to finish it off, yes – the EDW model is built to be a back-end powerhouse for world-wide based corporations. No, it is not built for end-user consumption – which is why you still need a vital part of “data warehousing” – the star schema, to deliver data to the business in a sense that it makes “sense”.
Yes, you can build an EDW out of a RAW star schema, but for HUGE enterprises like the one above, the changes to the underlying dimensional models become killers. Which is why we see (often) Kimball Style Warehousing projects “shut down” and “restarted” over and over again. It’s part model, part methodology that causes this to happen.
Anyhow I apologize for being long-winded. I hope this helps a little bit in clearing the air.
Hello Dan
Thank you for the background of Data Vault and some of the thinking that went ahead. This provides interesting insight and helps put it into context.
But I am afraid that if you were selling me Data Vault I still would not buy it, simply because I am not getting the business value out of anything you are saying. In line with the comment from Jankees below, you are not making it easy to understand what business value Data Vault is adding.
While I totally agree that auditability is mission critical for some organizations, which includes a lot of government/public agencies, you hopefully agree that it is not equally critical for all organizations. I would even claim that most organizations cannot justify the level of overhead and costs involved, to provide the level of auditability you are aiming for. Not even in today’s environment with SOX and other regulations.
Admittedly I have not yet worked with organizations like Department of Defense, but my experience does include other public agencies and life sciences. Knowing that I am in great danger of generalizing too much, they also tend to have a preference for the waterfall approach and being a bit paranoid in their general behavior, even in areas where they are not required to be so by legislation. But even these organizations are not taking data warehousing to the extreme like CMMI level 5 as you mention.
My first experience with data warehousing was as a customer years ago. It was a classic example of a failed project in all aspects. We went months over the initial deadline, thousands of both external and internal hours and we simply did not meet the goals of the project and ROI was questionable. This was done according to Kimball best practices at the time for both dimensional modeling and the project lifecycle. The requirements gathering phase was lengthy, expensive and we wrote very detailed requirements. We had plenty of user involvement, solid executive sponsorship and plenty of founding of all kinds of resources. We did everything straight by the book, but still failed big time.
The main cause was that requirements changed along the way, and when we finally delivered we had not caught up with these changes although we really, really tried with all proper change control systems in place. So the business actually got exactly what they wrote that they wanted, which was sadly not what they actually needed. This they of course instantly realized, when they got what we had built.
This is what first inspired me to start writing tools to automate the process, specifically the first code generation components to aid in the trivial, but expensive and error-prone task of writing SQL for table definitions, load statement etc. when we needed to change the code as result of yet another change in the requirements. In 2005 this evolved into timeXtender, a rapid design and deployment tool that allows you to design, build, maintain, extend, and change the BI backend from source extract to Data Warehouses, Cubes and even tabular models. It comes with fully automated documentation, data lineage tracing, built-in version control, automated deployment across development, test, QA and production environments and so much more.
The point behind all this is that we did not stop doing Kimball simply because one project failed. I still believe that Kimball dimensional modeling is a very solid, scalable, adaptable and well proven model for incrementally built data warehouses regardless of company size or complexity. But you do not have to see much of my blog or speaking, to realize that I am a big fan of mashups. I do not believe in a single perfect methodology, model, tool or project. As you so rightfully say, the magic lies in the combination of data model and methodology.
When you add agile to the Kimball dimensional modeling, you get that powerful combination that addresses the flaws of pure Kimball. Agile allows you to deliver early and often, ensuring that you deliver according to business requirements while they are still current. It emphasizes a business value driven approach, which ensures that we are delivering the most valuable requirements first.
While Kimball modeling is incremental in its core, it is far from agile by nature and this is when you add the last missing piece in the combination: Automation. While nothing is preventing you from delivering incrementally with Kimball, without agile and automation there is a natural constraint on how fast you can complete a single iteration in the cycle.
Your comments have made things much clearer in understanding what you were thinking with the invention of Data Vault. I still do not understand the attempt to tie Data Vault to agile and speed of delivery. From a business value perspective, the organizations I know tend to consider the value delivered, when they have working views/reports/dashboards in their hands. There is little business value in the ability to deliver in minutes from a Data Vault, if it takes you weeks/months/years to build.
Perhaps you can shed some light on this last part. What is the typical implementation time for a Data Vault and how much is commonly built in a first iteration?
Hi Thomas,
You’ve given a CLASSIC example of the reason for the existence of the Data Vault as an underlying data source to a Kimball style mart. It enables going back and doing umpteen iterations on the front-end. It’s easier to do because you’ve taken what was disparate systems and used a Data Vault to integrate it into friendly integrated flexible structures.
Also because the business NEVER really knows what they want, they end up doing the classic requirements/signoff dance. Everytime you end up re-creating these structures for example while trying to deliver in a purely Kimball architecture, you’re losing time and valuable assets in term of information and data that could’ve been integrated.
The beauty of the Data Vault is speed-of-delivery because of the decoupling the business rules in the methodology. You NEVER have to go back to source, although you may end up going back to the Data Warehouse umpteen number of times.
I’ve seen Dan generate the entire Data Warehouse in minutes and I’m sure you’ve seen others do it too. There is beauty in its simplicity which is what makes it so powerful, flexible and scalable.
Agile is not just about deliver, it’s also about maintenance and it’s a 100 times easier to maintain a DV even with Kimball style dimensional marts from it as compared to a pure Kimball style based on a BUS architecture.
The initial data modeling can be time consuming depending on the sources, but the data integration is one of the most boring things you’ll see. You can pretty much generate ALL the DI code in to the DV and to an extent even out of it.
It enables true agility, no just to build, but also to maintain and extend.
Warm Regards,
Sanjay Pande
Hi Sanjay
You have missed my point completely.
Something was clearly wrong with the pure Kimball approach with respect to the long cycles. The combination of Kimball dimensional modeling + Agile + Automation is speeding up the implementation of Kimball without adding any extra steps.
Data Vault is adding an extra layer in between the sources and the dimensional model, no matter how you put.
You are clearly also making the assumption, that you will always get all data into the data vault in first attempt, since you will never have to touch the sources again. First of all, the system landscape is constantly changing just as user requirements, just as the existing data sources are also being upgraded, customized and extended.
Hi Thomas,
#1 first and foremost, please understand this: I do not, and never have “sold the Data Vault” – Just like you don’t “buy” 3rd normal form and you don’t “buy” star schema (as a data modeling technique).
So, for me to try to “sell you on a data modeling concept” would not be a good idea, and simply wouldn’t work anyhow. The point is this: IF you have pain (as many of my customers have), and you are frustrated with existing approaches, existing data models, and you are NOT getting anywhere then you need to change your architecture and methodology in order to overcome – otherwise you will not see different results. IF you have pain and are in this situation, then and only then will the Data Vault Model and Methodology hold enough appeal for you to give it a try in a proof of concept.
IF you are working things out just fine (through your existing tool set), then you probably will see no reason to even read the rest of my entry here. I understand this. I can’t justify giving you a cup of water and making you drink it – if you are not thirsty.
On my blog I did post a list of business values that make the Data Vault model & methodology an attractive option. Please let me know if the list is not clear enough, or if you would like elaboration on any of the points. I have repeatedly posted many lists, and discussed many business values over the years on my blog.
but in the interest of saving you and your readers time: i will again, attempt to list some of them here:
* repeatability
* scalability
* adaptability
* flexibility
* lower cost TCO *** when compared with 3NF data model as an EDW, and Dimensional model as an EDW *** particularly when using RDBMS technology infrastructure on an MPP platform. Arguably in NoSQL or alternative infrastructure environments, this whole proposition regarding data models is turned on its ear.
* pattern based definition
* automatable
* generatable
Regarding auditability, I respectfully reserve the right to whole heartedly disagree with you. Every single industry from Aerospace manufacturing, to banking to government, to raw materials, to electronics and so on – (meaning every customer I’ve been working in since 2001) has requested and demanded auditability from their EDW. We (unfortunately) will not see eye-to-eye on this point.
Regarding agility and the Data Vault…
Scott Ambler is the father of Agile Manifesto, and I’m happy to say, a good friend of mine. I am lucky to have him as a friend. However, he has reveiwed the methodology and the model, and endorsed it. Not just on the book, but professionally. Stating that it is so important to him that he keeps a copy of the book on his ipad to show to customers.
Anyhow, enough with the quoting of the experts. Agility is exactly that: speed of delivery (amongst many other things on HOW to be agile – on that point, we agree).
I am not here to say whether or not your product: i.e. TimeExtender is good or bad, or whether or not it’s agile or not, nor am I here to say that Dimensional modeling can’t be done properly. I’ve never made those claims, and never will.
Regarding the agility of the Data Vault it comes in two forms:
1) From the methodology. That said, can you use the methodology without the Data Vault model & still succeed? Yes. Absolutely. The methodology is a hybrid, a mashup consisting of parallel teams (spiral iterations, rad/jad iterations), and waterfall approach for major milestones. Somewhere, the “assembly” of an Enterprise data set must take place. Just like in an MPP system, there must be a single “controller” to assemble all the parallel pieces.
2) from the model. Now this piece really isn’t Agile, as much as it is an enabler for the team to BE agile. Agility springs from the people on the project, but if the data model is an inhibitor to forward motion, well then you and I have both been there, and do not need to return to that point.
The Data Vault Model in it’s pattern based design, allows agility to take place. Adding new source systems becomes easier and easier. Eventually changes to the Data Vault model stop – once the core-foundational business has been modeled. And the story goes on from there.
Regarding dimensional modeling:
There are very few Dimensional Models at the granular or raw data levels that I have seen in the market place or customer sites today that are surviving the massive data influx, let alone keeping up with the changes the businesses are demanding. There are a few out there which are GREAT and work well. and for those that were implemented properly, once again, I applaud your work – however in my experience, I see more failures than successes, more data junkyards than true EDW’s when it comes to using dimensional models *as a data warehouse*.
Furthermore, most industries are now trying to wrap their brains around Hadoop, Hive, NoSQL, KV stores, Triple Stores, Graph Stores, and so on… Again, and again – eventually the “data model” will not matter at the physical level, which makes parts of this conversation a moot point. What will matter (and become more and more important) is the LOGICAL model, and the Implementation Methodology.
And yes, we agree here too – there is no single silver bullet methodology, there never will be (and I’ve never claimed that the DV is a silver bullet either). After all, would you put a Ferrari engine in a VW bug and try to win a race against other Ferrari’s? I think not.
Regarding how “fast” we put this together, going back to the project i listed before…. I had a team of 3 people, the requirements phase took 3 weeks to get done, after that – (again in 1997) we worked for 3 months from start to finish, standard 10 hour IT “days”. 3 people, 3 months, hand-written ETL (as mentioned), full system, full deliverables, first phase in 90 days. This was 1997. The second phase of deliverables added 3 new source systems (mainframes, 50 new tables each), and was done in 35 days. The 3rd phase took 40 days, added 2 new source systems including Finance (huge).
And so on. Because of Cycle Time Reduction, Lean Initiatives and CMMI Level 5 engineering, we were tasked to “optimize IT business processes”. Our team HAD to be agile back before it became known by most corporations. Just to give you an idea, other IT teams were still delevering single reports in just over 45 to 50 days, where we could load, test, and deliver a single report in under 2 days. 45 minutes was the time it took to build a prototype and load it with sample data.
Also remember: we did NOT have the tools you have today with the fancy GUI’s and all the metadata control.
Regarding your comment: “little business value to deliver in minutes if it takes you weeks/months/years” – well, your partly right: if it takes more than 2 days, or 1 week – then YES you’ve missed the boat. We ALL have at that point.
I don’t debate that deliveries in months and years is not agile. I don’t work that way, and I certainly don’t think you do either.
Today, it takes maybe 2 weeks? As it should – with agility, as Steve Hitchman likes to say: there should be a deliverable every week, at the most every 2 weeks.
I have case after case where Raphael Klebenov of WhereScape has built, deployed, and completed Data Vault based warehouses rapidly (in weeks). I have case after case where I have built and deployed EDW based Data Vaults in weeks as well, and there are numerous more in the Netherlands. Just ask Ronald Damhof, or Tom Breur, or Centennium for their thoughts.
My typical implementation time will vary from everyone elses, that’s just how it goes. But in general, once the business has their act together and their requirements defined (which by the way is necessary for ALL EDW projects regardless of model or methodology choice), it takes me roughly 2 weeks today.
For further explanation on Agility & Data Vault, look up Kent Graziano, he’s got some great stories as well. Or Bruce McCartney in Calgary Canada, he actually has “actual hours” logged versus what it would have taken to build “the traditional way”.
Hope this helps,
Dan Linstedt
Hi Dan
Somehow the contents of my original comment was not posted, so I’ll try again.
I am not saying auditability is not important, what I was at least trying to say was that I have not met anyone, who are willing to take the extreme measures you are suggesting. You can achieve it much cheaper and with much less overhead in my experience.
It makes me happy that you at least agree that agile is (also) about delivering faster. And I have no problems believing that you can deliver the first iteration in two weeks. But it is a little funny that you should mention Tom Breur. He was actually saying on stage recently that agile is *not* about delivering faster, at the same time he was also saying that you deliver in minutes with Data Vault, not days or weeks. I hope you understand the confusion, when such a prominent advocate for Data Vault is stating things which I consider to be completely opposite to what you are saying.
When it comes to WhereScape, I know them quite well. I am absolutely sure that they can do Data Vault based warehouses in weeks, but I also know they can build star schema based warehouses without the added weight of Data Vault even faster. If that is no longer the case, the marketing department at timeXtender would like to know, as we would then need to update our competitor sheet :-)
As you can see above, the discussion is once again getting hostile so I am signing off. I would like to at least thank you for keeping a sober tone, despite that we are far from agreeing on the added value of Data Vault. I did actually do a decent amount of research before posting my thoughts and I appreciate the additional insight this discussion has provided.
Maybe I will run into you at a conference somewhere in the world, then we can pick up the discussion again.
Best Regards
Thomas
some thoughts:
EDW 3 tier architecture versus EDW 2 tier architecture
DV Methodology contains a 3 tier architecture.
Kimball Bus proposes a 2 tier architecture
If you really *want* a Kimball 2 Tier architecture, then don’t follow the DV methodology, BUT you can (if you want) use a DV Model instead of your persistent raw staging area with unconnected tables and highly replicated data.
Benefits stem from separating the source system from the business layers, which basically means: you can buffer the impact of changes to the source system, and absorb any changes made – including sun setting of old systems, or purchases of new systems. without re-engineering the EDW and without impacting ANY of the data delivery layers to the business. That *is* and has proven to be, real – solid, measurable business value.
JUST as you adapt and isolate the business from source systems with raw staging areas, so it shall be with Data Vault models. It is the same principle.
More business value: DV model is flexible enough to absorb BOTH real-time feeds and batch feeds at the same time! and even to the same table in certain circumstances. We never have to “create a ghost dimension record” and come back and update it later, as you would have to do in dimensional modeling. We preserve true history of the transactions *as they happen* in real-time.
The Data Vault is built on a principle of 99% inserts, the only reason for updates at all is because of the temporal aspects of end-dating a record, because most relational database engines simply can’t handle true temporality, or don’t yet support it, or have just released it – and it’s still “buggy” to some degree.
Hope this helps, there are plenty of cases where business value has been described on the web – and not just by me, but by consultants like Ronald Damhof, Tom Breur, Martijn Evers, Sanjay Pande, and the list goes on. Don’t take my word for it, please ask for their thoughts as well.
Thanks for the thoughtful insights.
Respectfully,
Dan Linstedt
Dan,
What are the 3 usp’s or differentiators for using DV?
“A building must have a thousand floors for the elevator pitch above.”
I apologize for being short winded. But a good solution or method must be summarized in 3 keywords or phrases stating purpose and added value.
JanKees
Jankees,
To give you differentiators I would like to know what you are comparing it against? Please reply, and I will do my best to re-iterate differentiators.
For general reference:
I’ve listed the overall benefits of the methodology and the model on http://danLinstedt.com
Thank-you kindly,
Dan Linstedt
Dan,
In the elevator there is no time for extensive comparison.
Why do we need these long answers to explain DV?
Auditability seems a USP. Although that can be solved in several other less expensive and less laborious models as well.
My question remains; “When does an organisation need DV?”
What word should trigger my fellow traveler in the elevator?
JanKees
Jankees, if you cannot voice a direct question, then I cannot provide you with a direct answer.
The other problem with elevator pitches is: they are meant to “sell something” As I’ve indicated, I never have and never will try to “sell you” or anyone else on Data Vault.
You have to have pain in your existing system to understand where The DV would fit.
But (sigh), I will try once again (as I always do) to re-iterate: I have clearly stated the benefits in several different short posts, along with other ones, on my web – site: http://DanLinstedt.com
And if you really must know, you can also look it up on Wikipedia, or go to the LinkedIn Discussion forums and ask the general population.
Please try to do some reading. If you have a specific question (not a general one) about the value of Data Vault as it pertains to solving a specific problem, then I will be able to answer you directly.
The problem (as I see it) with your question is this:
The SAME thing can be asked about 3nf as an EDW, as well as Dimensional Modeling as an EDW. And you know what? they will ALL list similar values.
But again, it’s not about the Data Model, it’s about the methodology of implementation + people where the true business value is gained.
So… again, I urge you to check out my bullet-list of value propositions stated on my web site. it is a highly condensed version of the values that can be achieved.
Cheers,
Dan L
No elevator pitch will ever convince you to “drink” from a water fountain, if in fact you are not thirsty.
Hi Dan
I hope you are still monitoring this thread. I saw your initiative over at your blog: http://danlinstedt.com/ to compare the Kimball and Vault. Great idea.
It would be great to continue the discussion. However, you blog does not allow Google Apps, Facebook or even WordPress.Com logins (and my WordPress.Org does not seem to work either).
Would it be possible for you to open your comment sections for these types of logins (there is a plug-in available to do this somewhere, I use it on my blog)… It would make it easier for people to continue the debate.
Cheers
Thomas
Hi Thomas,
For some reason, it has caught fire in that region of the world. I am in Canada and chose the Data Vault over other solutions for a fairly complex project. There are other organizations in Canada that use it too.
The magic is in the methodology and the patterns. Sometimes revolutionary concepts take time. Windows is still popular today and on pretty much all corporate desktops. There is no technical or logical reason for it to be as prevalent as it is. Sometimes the status quo is hard to shake. BI has suffered from this right from the beginning.
Certification is just a training course and testing. It doesn’t guarantee anyone to be competent in a Data Vault implementation. I’ve been through the certification and THERE IS NO SPECIAL SAUCE.
The one point of contention is of course the methodology. Most people look at the model and go … Hmm, maybe it’s a good idea, maybe not … what’s so great about it?
The models are fairly simple to understand as Hubs – business keys, Links – relationships and satellites – descriptive data.
The methodology can also be summed up very easily – Don’t put any soft business rules on the way into the Data Warehouse. Soft business rules are defined as anything that changes the meaning of the data on the way in. Also each and every process should be parallelizable in a nothing shared MPP style and it is encouraged to break everything down into small pieces so infinite scaling is “theoretically” possible (even with project management tasks).
The technology team doesn’t own the data, so they have no right to modify it.
That’s it really in a nut-shell. If you have any data integration experience, you should be able to understand and create pretty much all of this without going to a certification class.
Any decent architect or data modeler can get up to speed just by reading.
It has been proven in large and complex implementations with an extremely high success rate (A rate that industry analysts and both Kimball and Inmon should be jealous … and for good reason).
It doesn’t encourage user-query since it can get complicated, but with data virtualization solutions maturing, it may not be an issue. The jury is still out on this since there are no real business rules – just basic integration by business key in flexible structures.
Now for the BUS architecture or the dimensional Kimball stuff.
There are way too many issues building a DW this way (and I AM speaking from experience). I’ve yet to see one that doesn’t completely break down with change. Conformed dimensions and SCDs not only add to the complexity, but they add complexity by dependency. Every time you touch a conformed dimension for a modification, the impact gets cascaded. In the Data Vault, the impact gets isolated to the object your working with.
SCD type IIs have a tendency to grow and get complicated with new data.
Eventually, cost of maintenance and testing outweighs the cost of building something new. I’ve yet to see a dimensional model last more than a few years.
Now, where the Kimball model really shines is the front-end, more because of familiarity and tools that leverage the model and also because it’s easier for the business to understand.
An underlying Data Vault Data Warehouse will simplify your Kimball based front-end marts which do not have to be a complete BUS architecture DW on top of a DV DW. This is where the misconceptions come from.
Also, you do need to keep in mind that the DV does NOT carry any business rules, so any star-schemas that are delivered are in fact raw data star-schemas, so the generation in minutes and all are great in theory, but not so great in practice.
However, the underlying DV DW does give you flexibility in delivery of the marts. You can aggregate the mart at a higher grain without being concerned, you can avoid SCDs if you want or you can generate SCDs if you need to. Having the flexibility to do all of this very quickly gives you a LOT of power and flexibility.
I’m one of the people who has been harping on this point, so I’ll mention it. The DV extends your DW beyond what is traditional into other areas and it’s a great source for Data as a Service within an SOA architecture. As a side-effect it is naturally friendly to data mining and entity analytics.
The DV is also one of the easiest ways to integrate multiple Data Warehouses (or split them) when companies merge or split … which is reality. The data is easily tracked back to the source systems.
Whenever you do a dimensional schema, you’re going from the requirements to the solution. This is not a bad thing, but IT DOES force you to design for a single purpose. I DO NOT agree with that philosophy … or a Bottom-Up approach (Problem Definition to Solution)
With the old Inmon style it was the other way around which is also an issue because you have to build the farm before you can do anything. This is traditionally the Top-Down approach. (Big Engine to Solve Many Problems).
With the DV, you have a Top-Down architecture with a Bottom-Up implementation which is a best of both since you’re only interested in data within scope. A Data Vault project also starts with requirements. ALL data within scope is loaded.
It makes sense for Bill Inmon to endorse it because the model is superior and more flexible than a 3NF style DW and it still fits in to his architecture.
The DV model is perhaps one of the easiest to load, scale and maintain. It enables you to do with it a LOT MORE than any other style of DW architecture (as of today anyway).
Personally, I won’t EVER build a DW in an Inmon 3NF or a Kimball dimensional style again.
I hope I’ve shed some light on it.
Oh yes, and if anyone says, you need to be certified to understand the DV, they’re talking out of their ass. You can just get the book at Amazon and have the same level of knowledge in a single read. Some people still choose to get training and that is their prerogative. I have nothing against attending a training course.
Warm Regards,
Sanjay Pande
Please have a look at the e-learning site http://datavaultacademy.com, hosted by Genesee Academy, one of our partners.
It’s gives you all info (global and detailed) on the characteristics and added value of Data Vault. Not only the modelling but especially the methodology.
I personally like their USP of Data Vault:
-Enhanced Auditability
-True Enterprise Wide Scope
-Extreme Agility and Adaptability
But: it’s ONE of the alternatives, it never will solve all your DWH-challenges. Neither does 3NF or star schemes :-)
Erik Fransen
Centennium, NL
And yes, we use Data Vault at many of our (large) customers since 2007. Why? Because it gives you full auditability, supports integration nicely and gives you maximum agility at the back end. Never, ever during my last 16 years of running DWH-projects that was delivered by any 3NF or Kimball DWHouse.
DV does NOT help you in getting the data out to the user, but actually that is really a simple task now as all your data is at your fingertips in the DV. No stovepipes anymore. Generate your star scheme, use a business rule engine for decision making on top of the DV or just query it.
“You have missed my point completely.”
Ha! It is you who has appeared to miss my point completely.
“Something was clearly wrong with the pure Kimball approach with respect to the long cycles. The combination of Kimball dimensional modeling + Agile + Automation is speeding up the implementation of Kimball without adding any extra steps.”
It still ends up being re-generation. The DW immediate value is there with this approach, but it DOES NOT work long-term. In effect, your re-generating your DW EVERY TIME business asks you to change it. Granted, it’s technically feasible with the approach you’re using, it’s still not a good idea except for front-end. Your DW is also a single purpose entity that is incapable of integrating with other systems without complex re-engineering.
Add a new source to an existing model or even new data and you break it. While you can get away with re-generation of code in the dev cycles, you still mess up ALL the testing cycles.
The code generation also doesn’t take care of the exceptions and the architectural bandages that become necessary sometimes in a Kimball architecture. It has been and will continue to be a maintenance nightmare.
If you can enforce strict code generation policies, then you may be able to get away with it.
“Data Vault is adding an extra layer in between the sources and the dimensional model, no matter how you put.”
True, but the dimensional model in such a case needn’t carry the volumes or complexity otherwise necessary. It also provides added value of audit, re-generation, integration of new source without going back to the source. If the source changes, it is flexible enough to absorb the changes without changing a single line of existing data integration code and without touching the existing data model at least 99% of the time.
The extra layer as a buffer is actually a good thing. I don’t see any issues with it. In fact, I deem it necessary in a sustainable architecture. Now, if you want to argue about the merits of a 2-tier vs 3-tier architecture then we can get into that debate as well.
“You are clearly also making the assumption, that you will always get all data into the data vault in first attempt, since you will never have to touch the sources again. First of all, the system landscape is constantly changing just as user requirements, just as the existing data sources are also being upgraded, customized and extended.”
The Data Vault was designed from the ground up to withstand and absorb these changes. Sources being added, removed, migrated to other systems, absorbed and modified is a reality. The Data Vault is the ONLY model and methodology that takes this at face value and actually HAS been designed to be receptive to change without any data model or data integration code changes.
The change only affects small parts of the model, making maintenance and testing at least into the Data Warehouse an isolated incident. You can’t do this in either a Kimball or even an Inmon style DW.
It’s also the only DW architecture that lets you switch your routines from batch to real-time or real-time to batch with 0 data model changes.
Also what you define as “agile BI” will work much better (every time) when you pull data from the Data Vault than directly from the sources, because you’ve layered the work and have pre-integration already done. The cool thing is, the Data Vault is also very easy to generate. Modeling the data vault is not a piece of cake though and can take time. Comparatively, it’s easier than both 3NF DWs and Star-Schemas.
On a Business Intelligence project, programmer time is extremely expensive and it’s so misused on major and minor maintenance projects. The Data Vault reduces programmer time on maintenance projects substantially.
While there ARE in fact potentially 2 copies of the data (assuming you’re bringing everything into the marts), they’re in different structures and serve different purposes.
Now, if you want to actually debate instead of harping on points that I’ve actually countered and defeated in 3 different comments, we can continue this conversation. Otherwise, it’s like talking to a brick wall. I think you’re smarter than that (I hope) and do expect the conversation to become a more interesting debate.
It apears once again, that is not possible to have a discussion about Data Vault without getting those hostile comments back, when someone does not instantly agree with your arguments.
You can switch of the auto pilot as I see no point in continuing in this tone.
Thomas,
Please don’t throw the baby out with the bathwater. I believe that you and I were having a non-hostile discussion. If you believe my posts to be otherwise, please feel free to let me know by emailing me.
And if you read carefully – it was not I who responded to you, it was a different individual.
Sincerely,
Dan Linstedt
As a BI professional I’m a big fan of correct terminology and so on. In this case, terms like “data warehouse” and “business intelligence” are thrown onto a nice large buzzword bonfire along with “better” and “faster”.
However, if you step back for a bit, a concept like DataVault, in the multi-dimensional modelling (Kimball) viewpoint is basically a huge versioned staging area. The additional requirements that Dan outlined in his lengthy answers seem to be related to the way that companies need to process, categorize, organize and handle their data, not so much the way that they report on it. For reporting it seems like everyone is in agreement that star schemas are best.
So I’m really sorry, but I don’t really see what the big fuss is about. I’ve built staging areas in the past that versioned and categorized the incoming data into snapshot and transactional data. Sure, perhaps if I was a big fan of paperwork I could have written a book about it and put a methodology on top of it. To be honest, I’m glad Dan picked up that part of the work. I’m certain that for particilar usecases (SOX/big banks/…) it can be very valuable.
And yes, when DV fans try to advocate (avoiding the word “sell” here) their methodology they use passive agressive language like “Kimball is wrong, dead wrong” but I’m sure it’s all done just to get a message across, not to insult or hurt feelings.
As a DI professional: yes DV adds complexity and risk to your ETL and as such I would recommend it only when their is a clear need for it. To know what those needs need might be it makes sense to at least take a look at the various methodologies out there.
Thanks for the nice blog Thomas!
Matt
“… a concept like DataVault, in the multi-dimensional modelling (Kimball) viewpoint is basically a huge versioned staging area.”
A Data Vault can be fit into a Persistent Staging Area in a Kimball style implementation. In effect you would have a DV model which has side-effects both positive and negative.
“I’m certain that for particular use cases (SOX/big banks/…) it can be very valuable.”
I’d say, based on personal experience the use cases are far more generic than specialized. Compliance etc required by Govt, Finance, Banks, Telcos will certainly benefit from it, but it’s more than that.
“And yes, when DV fans try to advocate (avoiding the word “sell” here) their methodology they use passive agressive language like “Kimball is wrong, dead wrong” but I’m sure it’s all done just to get a message across, not to insult or hurt feelings.”
Ha ha … That’s really my fault. I made him write that. I’m much more aggressive which would be obvious with my posts. The intent has never been malicious but more to get attention. It took me a while to get Dan to even agree on this one since he’s very mild and polite. It really did get mixed reactions although he did get some flak for it (My apologies to Dan for it).
“As a DI professional: yes DV adds complexity and risk to your ETL and as such I would recommend it only when their is a clear need for it.”
I completely disagree with this statement. The DV reduces the risk of your ETL substantially. First building the DI code into the DV is dead simple and it also helps distribute some of the integration work which reduces the complexity of the ETLs on the way out of the DV (Even without a DV, you do want to break the ETL down).
Because the DV is entirely pattern based including the DI code, it is heavily encouraged for the code to be generated.
@Matt “DV adds complexity and risk to your ETL”. Working with the DV methodology makes ETL actually very simple and basic (downstream from staging to DWH). From DWH to data marts you have your business views that are as complex as you want them to be. But this has nothing to do with DV.
I think Matt is right in his assessment that:
a) DV is not the best solution in all situations and
b) DV adds an amount of complexity and risk to the ETL
But I think tools like timeXtender should be able to hide this complexity and mitigate the risks involved.
As for a) I recently wrote a blog post comparing four approaches to data warehousing and argued for a choice based on aspects like:
o The expected total size of the solution
o The number of source systems and how the information in them overlap
o The average expected life-time of source systems: source system volatility
o Current user requirements
o Expected number of changes in user requirements: requirements volatility
@Matt & @Marco, from experience I can say that DV may add some risk to the warehouse since certain rules need to be followed to ensure the DV matches your source data, loading hubs before links, etc. However, it greatly simplifies your ETL, if you are allowing the database to do its job.
At my last job, our warehouse was Kimball. We were required to read one of the Kimball books after being hired. The ETL to load this was not easy to look at. Each loader looked different. If you worked on one, that didn’t necessarily give you any insight into any others. Of course, there were clusters of them that were similar, but it was still difficult to maintain.
When our data architect began talking about DV, I was skeptical. I could see that there were benefits, but it didn’t look like it was going to be worth the effort. After going to a DV training, I realized that this was going to be more than worth it. We simplified our loaders to the point where I wrote a Python script that could build our entire Pentaho code base. This built everything we needed from the ODS, through the DV and into the stars. Our table definitions and SQL logic were housed in Consensus.
If a new source was added into the existing mix, the source definition was added to Consensus, along with target columns, and the ETL would be generated automatically. Our ETL risk was almost entirely mitigated because every loader for a class of tables (hubs, link, sats, facts, dims) was structurally identical.
My logic was that since there is an added effort required to set up a DV “staging area” (see my previous post) and that you still would need to build star schemas based on actual BI requirements, there most certainly is an additional ETL effort to be made and you most certainly spend more time doing it.
Having the ability to copy tables 1 to 1 from your source systems to a DataVault schema seems like a lot of fun and it may be a great thing. However it doesn’t really get you any closer to bridging the gap between said source systems and your end users.
Finally, I’m a big fan of model-based data integration, ETL metadata injection and all that sort of thing. In Kettle we created specific operators for slowly changing and junk dimensions. I’m more than happy to consider any idea to make it easier (or even more generic) to populate DV schema.
Cheers,
Matt
Hi Matt, Erik,
Thank-you both for your candid and kind words. I have added another blog post this morning that walks through the DV model at some of its most core points – for the rest of the questions here around value and differentiators, so I hope that helps a little bit.
Anyhow, Matt, there seems to be a common misconception in the market place about a required staging area.
The staging area isn’t required at all – in fact the closer your project gets to real-time data absorption, the less and less you actually need to have that layer around. I have built several “Real-Time Data Warehouses” over the past several years, and only one needed a staging area because it was absorbing external data from other partners still sending it in batch.
Now, here’s the thing: even in a Kimball Style bus architecture, most people still use “staging areas” – and i’ve even heard from some that they are fully raw, and still 1:1 with the source systems – ie: some are using replicated tables (replication) as their staging areas. IF you have that around, then why not leverage it for the DV too?
And you are both right – the world is moving, and requiring IT to be more agile, in general, which means we have to find a way (any way we can) to help automate and generate as much of the grunt-work of building EDWs as possible.
Just a question, I’d be curious as to your thoughts.
Thanks,
Dan Linstedt
There IS additional work involved, however most of it can be generated because of the pattern based loading code and the strict methodology. It’s also beneficial in the long-term from a maintenance perspective because the model is designed to be flexible enough to adapt and absorb change WITHOUT touching existing code on the way in to the DV.
While, you’re still one step away from delivering the content/info to the users, the buffer really helps focus on bringing in only granular incremental components as and when you need them in and then focusing your time on delivering the desired data out. There are other side-effects (positive), I’ve already gone in to earlier.
I like Kettle because of the metadata driven architecture. It makes writing logic generators easier. I think Dan was working on getting RapidAce to spit out Kettle transformations/jobs from the data model. I’ve seen him output Informatica PowerCenter maps in bulk using it.
A LOT of DV folks are Kettle fans, so it may be a good idea.
A difficult discussion. My answer is quite simple: it’s not about good or wrong, it’s all about requirements and conditions, and choosing the best solution that fits.
Quite a discussion… ;-)
I have been building and extending Kimball data warehouses for over 10 years and did come across some of the pains that Dan is mentioning. Most of these issues arise because of the implementation of business rules between source and staging area, independent of which modeling technique was used. From this perspective, the DV methodology has a good point to move those rules further upstream, i.e. as close a possible to the delivery of the data marts (independent of modeling used).
However, I think even Kimball’s approach has evolved in such a way that moving those business rules upstream seem a good idea, so this is not related to DV only.
Using a star schema for delivering of the data to the end user with most reporting tools seems obvious. DV or 3NF will become too complex for an average user to understand and finds his way in (unless you have a very good abstraction layer in the reporting tool).
When you need to integrate data at some point (let’s call it the DWH) in order to deliver it in multiple versions of the “truth” (using a single version of the facts), then I think the DV modeling approach is a very good way to do that. It is flexible and easy to extend without touching existing structures or ETL processes.
A particular reason for doing so using the DV “in the middle”, is because getting to common dimensions can be very cumbersome. Finance doesn’t care about marketings viewpoint and vice versa. So why not skip that? But having an integration point will enable you to get there eventually. DV is a good modeling technique to do that.
But as always, I will choose the best combination of methodologies and modeling techniques based on the client needs and the context.
Disclaimer: I’m a certified datavault data modeler and the only religion I adhere to is Catholicism ;-)
Ps. Tom Breur mentioned in his presentation that being Agile is about being efficient. I subscribe to that. The fact that being efficient enables you to deliver faster is a nice side effect.
Interesting discussion, a colleague just pointed me to it. Thomas, thanks for the kind words. I apologise in advance for a long reply, there are a lot of points to address here.
Pleas allow me to to summarise what I hear people AGREE on here (both the Vaulters and the Kimballers):
1) In all but the most trivial implementations, there is a need for some form of historical tracking of the source, no matter how we choose to represent the final reporting model
2) Star schemas are the right way to represent the end user model
3) Auditing is often a requirement, especially in heavily regulated environments
4) ETL code can by and large be auto generated BOTH for a Data Vault model and for a Dimensional model. The “secret sauce” that makes ETL code auto generatable is a consistent naming convention and modelling approach. This is NOT unique to DV.
It seems we disagree what we will call a staging area and whether there is some requirement for a database (and data model) in between staging and the final star schema.
If I understand Dan correctly, he claims that DV addresses 1 and 3 above. However, Thomas (And I agree with that) says that a Kimball system with a staging area ALREADY addresses this, so there is no added business value of adopting a DV methods. Please allow me to dig into that argument a bit more.
First of all, historical tracking of the source is SUPER easy to do and is something that every ETL developer worth their salt have done since, well, pretty much the invention of data loading. I think we all agree that some form of from/to date stamping on source tracking tables is a good idea if tracking of history on the source is a requirement (which ISN’T always the case). In other words, nothing new, nothing revolutionary here.
Second, Auditing, or the the stamping of lineage information on records, is ALSO super easy to do and something ETL developers have several ways of addressing. Some of these include meta data frameworks that track chunks of loaded data for batch style warehouse. For real time data or multi sourced data, others simply stamp descriptive data directly into the rows loaded or in separate tables (potentially normalising the detailed audit information to avoid clouding the model). Is there ANY doubt how to DO this for anyone who has ever build a warehouse and of the different tradeoffs involved in using either the metadata approach or the “stamp a batch number” approach? Do we NEED a methodology to address this?
Because auditing and historical tracking is so well understood and easily implemented that we can consider it trivial, I think it is safe to say that DV has nothing new to tell us about that. Unless Dan has some silver bullet that is unique to DV and which miraculously addresses these problem (in which case, I encourage a debate about that). The fact that some organisation don’t “get” these problems and that Dan has helped them solve their issues (in the same way others on this thread has done with dimensional models) does not make DV the only solution to this common problem. I hope everyone can follow this line of reasoning so far.
Dan raises an interesting question about the need for intermediate layers between the source and the final, dimensional model. Quote:
“True, but the dimensional model in such a case needn’t carry the volumes or complexity otherwise necessary. It also provides added value of audit, re-generation, integration of new source without going back to the source. If the source changes, it is flexible enough to absorb the changes without changing a single line of existing data integration code and without touching the existing data model at least 99% of the time.”
Again, I don’t think anyone is in disagreement with the principle: that the dimensional model (which, may I remind you, we agree needs to exist) NEEDS some form of pre-storage area where data you DON’T yet need, but don’t want to throw away and want to audit, is stored and can be reloaded from without disturbing the source. The only thing we disagree on here is what we call an database that has these properties… I call it “Staging” and I don’t see a need to model it using any methodology except the obvious one: Grab a copy of the source, hit it with a little metadata audit column. If it is a large table, implement the (again, amazingly trivial) tracking of the raw source tables with a few from/to timestamps (or don’t, if the table is append only). Why invent new words for this “source tracking” and call it non intuitive things like links, hub, satellite? Again, this tracking/auditing can be automated and any Kimball implementation of significant scale will most likely have done so at an early point in the design cycle – just after prototyping. I have personally build meta frameworks that automate this approach. Interestingly, BigData systems lend themselves particularly well to this raw copy method (but you can use a vanilla relational database too)
Here is where I think the core of the argument is: What do we do with changing business LOGIC? Once we have a copy of the source and a desired, dimensional model destination – how do we deal with business users changing their mind about how to TRANSFORM the data? Assuming we have a Kimball style staging area with some basic history/auditing of the source that meets the current requirement of the business, where is the knowledge stored that is needed to transform from this representation of data to the star schemas we publish? The answer in Kimball terms is: in the ETL code! And as Kimball modellers, we know that understanding the mapping between the source and the star schema is where most of the time is spend in a data warehouse implementation. Storing this knowledge in the ETL code has some great benefits, because it is a LOT easier to version code than it is to version data. I actually AGREE with a trick Data Vault champions, namely there is some merit in normalising the dimensions on the “dimension bus” into a both a Type1 and a potentially multiple Type2 representations, because this makes it easier to load marts and assign surrogate keys. I suppose Dan would call this normalisation hub and satellites. I call it “type1 and type2 representations” of the same dimension and don’t claim I have invented anything new. Maybe this classifies me as a belonging to a “Kimball Hybrid School”.
I somehow get the idea that Dan claims that the above knowledge of transformation from the source representation to the star schema is somehow stored inside the Vault and not in the ETL code when you apply Vault modelling. Perhaps I am wrong and reading too much into Dan’s claims. I think it would be useful if Dan could shed more light on any complexity reduction in the ETL code that is the result of applying a Vault based approach to the intermediate storage area before the data mart.
Dan asks the essential question here:
“Now, here’s the thing: even in a Kimball Style bus architecture, most people still use “staging areas” – and i’ve even heard from some that they are fully raw, and still 1:1 with the source systems – ie: some are using replicated tables (replication) as their staging areas. IF you have that around, then why not leverage it for the DV too?”
And as I hope I made clear, the near 1-1 approach is exactly what I am advocating (though I think replication is a pretty poor way to do this because it introduces some nasty operational issues). I would then ask the question back: IF you have that 1-1 representation around already, what value does DV then add and how does it reduce the complexity of getting from the staging model to the Star Schema. I am particularly interested in how the link/hub/satellite joins required make ETL and particularly VERSIONING of business logic any easier.
“2) Star schemas are the right way to represent the end user model”
They are ONE way and currently popular. Whether it’s right or to what extent can be debated. They have certain strengths and weaknesses most of which are obvious.
“3) Auditing is often a requirement, especially in heavily regulated environments”
A fairly common requirement in any organization of size – usually required in telecom, finance, banking, insurance, healthcare etc (country regulation specific).
“It seems we disagree what we will call a staging area and whether there is some requirement for a database (and data model) in between staging and the final star schema.”
It is a mistake to go directly from source to star (even if you have a staging area to land the data) in most cases. The staging is just to pull data out quickly and to enable joins etc especially when you have disparate systems. Using the staging area for anything more leads to other complexities.
I dislike Dan calling the Data Warehouse (Data Vault) a staging area when in fact it’s a fully functional centralized integrated data repository. But, he does make a point when people point to a stupid persistent staging area, knock on a couple of time fields and say – I’m audit ready. Yes, you may be audit ready, but that is the ONLY purpose that area serves.
“Because auditing and historical tracking is so well understood and easily implemented that we can consider it trivial”
IT IS in fact trivial which is why in the Data Vault auditing is a side-effect by design. You DO NOT have to design for audit like having a persistent staging area that serves no other purpose.
“I think it would be useful if Dan could shed more light on any complexity reduction in the ETL code that is the result of applying a Vault based approach to the intermediate storage area before the data mart.”
Loading data into a Data Vault is trivial because of both design (consistent pattern based design) with the methodology (no “soft” business rules). ETL code is in fact broken down into many streams of tiny processes which can be run in parallel in an MPP style – which makes getting the data into the DW (layer 2 after staging for batch or layer 1 for RT).
This data is in fact INTEGRATED by hubs and links which is where the power lies. New data can be absorbed into the Data Vault DW WITHOUT touching or affecting an existing line of code or data. Change impacts are minimal – impossible in a conformed BUS architecture.
EVEN WHEN the source systems change whether during maintenance projects or replacement projects, the Data Vault can absorb this data into integrated structures without changing any existing loading code or data – but simply by adding on to it.
This integrated repository definitely enables very rapid “raw” star-schema generation and it is then much easier to apply or change business rules even within the ETL code to the marts because integration points have been set within the data.
It also has MANY more uses than audit or compliance which it does automatically.
“IF you have that 1-1 representation around already, what value does DV then add and how does it reduce the complexity of getting from the staging model to the Star Schema. I am particularly interested in how the link/hub/satellite joins required make ETL and particularly VERSIONING of business logic any easier.”
The DV provides a 1:1 “data only” representation (not structural – even though it’s trivial to rebuild source structures from a DV). This is completely different from a 1:1 staging area even if you make it persistent and or add time tracking.
It has a level of integration and capacity for data absorption that is unparalleled. It can scale like MPP because of the core data model and the ETL load process model of the data into the DV based DW.
Pulling data out is less complicated because you do not need to do integration AND business rules in the same load.
Because of the consistent patterns – ETL out in the raw form is also fairly trivial and you can still apply your business rule logic within the data integration flows or mappings to the star-schemas. Your mappings will naturally be simpler than before.
The Data Vault is a data asset you end up building that is flexible in structure but rigid in methodology and it can absorb almost anything you throw at it with ease. It discourages re-engineering from source, so you never have to rebuild, re-test from the same data from the source to the DV. You can just test any new code and test from the DV to the marts (even that is simplified).
Because you have a integrated repository with history, you can actually deliver smaller volume marts or aggregated marts if the need arises without worrying about both storage and delivery.
Everything in the Data Vault starts at the hubs. New satellites can easily be added in from the same source systems (or from different source systems to the same hubs). New systems can be integrated within both hubs and links.
It simplified the data extraction, integration and storage aspects substantially. Enough to make a dramatic impart to star-schema based mart generation.
There are MANY more plus points to storing data in the way the data vault is designed, and I’ve already gone into them. The main and most important point is NEVER having to re-engineer existing code or data in the DV based DW which is primarily a storage area (even though it has a plethora of useful applications).
Is the Data Vault applicable everywhere?
Nobody, not even Dan has claimed that. There are in fact situations where you may not need it. But, if there are a multitude of systems, large volumes of data, changing business rules, changing sources, merging/splitting companies (or depts within companies), then a DV is an invaluable asset to have.
The Data Vault does not have a DW “life-cycle”, but ONLY a data life-cycle. You can continuously load it for 10-15 years (or forever) and it will just keep becoming more and more valuable. You can add to it just like legos, You can merge or split satellites when their update frequency changes without losing any history.
The strengths show up in the integration and maintenance of a Data Vault DW and the simplification which happens a) because it’s splitting the work and b) pre-integration prior to mart loads.
I hope this clears up some of the confusion.
Even though I’ve done it all in many, many years (including mistakes), I’m never going back to a purist Kimball approach and will most likely be using a Data Vault as the preferred DW structure for pretty much all projects because of the risk reduction.
Hi Sanjay
Thanks for the replies. There are a lot of point, so let me address each one in turn:
“They are ONE way and currently popular. Whether it’s right or to what extent can be debated. They have certain strengths and weaknesses most of which are obvious.”
…Indeed there might be other ways. And we don’t yet know if general relativity allows faster than speed of light travel through wormholes – yet we do not let that lack of knowledge influence the design of the Boing Dreamliner.
“A fairly common requirement in any organization of size – usually required in telecom, finance, banking, insurance, healthcare etc (country regulation specific).”
…In other words: we agree, right?
“It is a mistake to go directly from source to star (even if you have a staging area to land the data) in most cases. The staging is just to pull data out quickly and to enable joins etc especially when you have disparate systems. Using the staging area for anything more leads to other complexities.”
… This is EXACTLY where I disagree. There are complexities with transforming data to something users can view. The question I think people are trying to ask here is: What does DV add that removes this complexity?
“I dislike Dan calling the Data Warehouse (Data Vault) a staging area when in fact it’s a fully functional centralized integrated data repository. But, he does make a point when people point to a stupid persistent staging area, knock on a couple of time fields and say – I’m audit ready. Yes, you may be audit ready, but that is the ONLY purpose that area serves.”
… I must have explained my point very poorly. I see the staging area is that place you tag data and where you can RELOAD data from if your transformation rules change. It is also a catalogues of which sources tables
“IT IS in fact trivial which is why in the Data Vault auditing is a side-effect by design. You DO NOT have to design for audit like having a persistent staging area that serves no other purpose.”
… Sanjay, I don’t know what straw man you are putting up and shooting down here. I think I speak for a few people on this discussion when I say that audit is not the only purpose of a staging area, it serves as a way to quickly rerun ETL code too (allowing new logic without reextracting from the source).
“Loading data into a Data Vault is trivial because of both design (consistent pattern based design) with the methodology (no “soft” business rules). ETL code is in fact broken down into many streams of tiny processes which can be run in parallel in an MPP style – which makes getting the data into the DW (layer 2 after staging for batch or layer 1 for RT).”
… Again, you are not saying anything that isn’t ALREADY addressed in a Kimball model. Having a consistent pattern based design is exactly what a Kimball model is and ETL code is in fact broken down into a highly parallelised set of process. I simply don’t see what a layer of super normalisation (which is what DV ends up looking like) does to make parallelism any more efficient. If you think super normalising makes ETL run faster – you are really misunderstanding parallelism.
“This data is in fact INTEGRATED by hubs and links which is where the power lies. New data can be absorbed into the Data Vault DW WITHOUT touching or affecting an existing line of code or data. Change impacts are minimal – impossible in a conformed BUS architecture.”
… Again, this claim about the BUS architecture is straw man with no relation to reality. Adding new data to a “dumb 1-1 staging” area does NOT impact the information bus. Yes, adding data to a star schema DOES change the schema (that is kind of the point of adding data). I think you need to shed more light on what you mean by “INTEGRATED” here.
“EVEN WHEN the source systems change whether during maintenance projects or replacement projects, the Data Vault can absorb this data into integrated structures without changing any existing loading code or data – but simply by adding on to it.”
… As can a 1-1 representation of the source if you wish it to do so. But often you don’t WANT that behaviour! You want to understand what adding a column to the source means for your historical data – should it be updated with new, derived data or can we simply ignore old records and extend the schema as needed? These are business decisions and making source changes transparent is often a very POOR idea. TRACKING source changes is a different story, of course we should do this – and again this is trivial to do in a 1-1 staging.
“This integrated repository definitely enables very rapid “raw” star-schema generation and it is then much easier to apply or change business rules even within the ETL code to the marts because integration points have been set within the data.”
… again, “integrated” is very vaguely defined here. Also, please explain to me how massively normalising data and STORING it bi-temporally makes it any easier to generate star schemas than simply rerunning the highly paralliseable ETL code you can generate out of a 1-1 copy of the source. The joins you have to go through to get stuff out of a DV are deadly to any optimiser, how do you work around that?
“It also has MANY more uses than audit or compliance which it does automatically.”
… Perhaps, this is what I think people fail to see. Might you name some example using an actual data flow from source to star?
“The DV provides a 1:1 “data only” representation (not structural – even though it’s trivial to rebuild source structures from a DV). This is completely different from a 1:1 staging area even if you make it persistent and or add time tracking.”
… I find the “data only” term here a bit unclear. Perhaps you can elaborate on what you mean by this and how this is completely different from a 1-1 representation?
“It has a level of integration and capacity for data absorption that is unparalleled. It can scale like MPP because of the core data model and the ETL load process model of the data into the DV based DW.”
… Maybe it can, though the normalisation generally lowers parallelism. Again, there is nothing new here, Star schemas loaded from 1-1 have the same properties and I am not sure if you are claiming they don’t?
“Pulling data out is less complicated because you do not need to do integration AND business rules in the same load.”
… If i understand right, you claim that the DV has store a “pre integrated” form of the data that has not yet been subject to business rules. Correct? Please tell me how this is different than the dimension bus which stored pre integrated dimension tables? I assume you still need some form master data management to actually make sense of sources that disagree with their key schemes and data, right?
“Because of the consistent patterns – ETL out in the raw form is also fairly trivial and you can still apply your business rule logic within the data integration flows or mappings to the star-schemas. Your mappings will naturally be simpler than before.”
… nothing new. Good ensign practice and something Kimball models have done for year already.
“The Data Vault is a data asset you end up building that is flexible in structure but rigid in methodology and it can absorb almost anything you throw at it with ease. It discourages re-engineering from source, so you never have to rebuild, re-test from the same data from the source to the DV. You can just test any new code and test from the DV to the marts (even that is simplified).”
… in the same way you can just test and add code on top of a 1-1 extract (we all know writing SELECT * is a poor idea and that naming columns is good, right?). However, I am not sure I WANT something that is too flexible in structure and rigid in methodology. In fact, I find that most HARD scale problems requite some non-rigid methodologies (the noSQL movement being a great example) and strong, fixed structures that make powerful assumptions about the data (take highly optimised compute clusters as a good example).
“Because you have a integrated repository with history, you can actually deliver smaller volume marts or aggregated marts if the need arises without worrying about both storage and delivery.”
… THAT is a very interesting claim. I am beginning to think there is something very deep about this “integrated” property of the Vault that allows you to aggregate data faster than the database is capable off? Could you elaborate with an example of how you use the “integrated” property to make a data mart smaller and more aggregated without having to worry about storage?
“Everything in the Data Vault starts at the hubs. New satellites can easily be added in from the same source systems (or from different source systems to the same hubs). New systems can be integrated within both hubs and links.”
… How exactly is this “integrated”? Because the satellite joins to the hub? Sounds more like a messy, normalised stovepipe to me. If this is what integration is, why bother modelling and storing the source in a new way – the source already HAS a model? In fact, that is exactly what I am championing: “Re-Modeling the source is a waste of time”.
“It simplified the data extraction, integration and storage aspects substantially. Enough to make a dramatic impart to star-schema based mart generation.”
… You simply need to supply an example of this to be believable. I think that is the “secret sauce” Thomas is looking for.
“Is the Data Vault applicable everywhere?”
… I don’t think that is what this debate is about (apologies in advance if any disagree). They way I understand the question being asked by Thomas is: “Is Data Vault applicable ANYWHERE?”
“The Data Vault does not have a DW “life-cycle”, but ONLY a data life-cycle. You can continuously load it for 10-15 years (or forever) and it will just keep becoming more and more valuable. You can add to it just like legos, You can merge or split satellites when their update frequency changes without losing any history.”
… Yeah, adding data to the same system tends to make it more valuable. No disagreement there. Changing the schema to accommodate those changes (without losing history) is also a fairly trivial task in a 1-1 staging area. I think distinguishing between “DW life-cycle” and “data life-cycle” only adds to the confusion – unless you define these terms more accurately.
In summary, I think there are two major issues here Sanjay:
1) You are putting up an odd straw man for dimensional systems (apparently, because you have failed implementing them in some case) and then you are shooting it down to promote DV. You claim there are all these issues with the Kimball approach that are somehow solved by DV. Yet many of the issues you claim are there, people just don’t recognise or they have found good solutions to them already. Understandably, this approach you are taking to the argument does not lead to a very constructive dialogue since it resembles the preaching of religion more than practise of science.
2) There seems to be some confusion about the word: “INTEGRATED”… I think the discussion would be advanced if you could define exactly what you mean by “INTEGRATE”. Maybe you could supply a real example that illustrates how this DV integration get data from the source to a star schema.
I think it would be useful to advance the argument by walking through some examples of real data models.
I think you’re misinterpreting what I’ve said, so perhaps … I’ve done a poor job explaining or something else. As you stated, lots of points so I’ll see what I can address.
“This is EXACTLY where I disagree. There are complexities with transforming data to something users can view. The question I think people are trying to ask here is: What does DV add that removes this complexity?”
This appears to be the root cause of the misunderstanding of the DV based on a lot of the comments. Dan has put out a ton of information in his videos and chapter downloads of his book. They explain this in detail. It would be difficult to go into details without showing you the different models to compare and contrast them.
“I simply don’t see what a layer of super normalisation (which is what DV ends up looking like) does to make parallelism any more efficient.”
This parallelism is for loading the DV. Staging is still 1:1 for batch loads even in a DV which is 3-tier. Staging loads will be no different and it’s a truncate load area. The DV loads are trivial, small and numerous allowing very simple MPP style parallelism.
“If you think super normalising makes ETL run faster – you are really misunderstanding parallelism.”
It has more to do with breaking the loads down into several pieces that can be run in parallel than normalizing.
“As can a 1-1 representation of the source if you wish it to do so.”
Yes, it can but in a very simplistic way only.
“You want to understand what adding a column to the source means for your historical data – should it be updated with new, derived data or can we simply ignore old records and extend the schema as needed?”
YES. Now you’re asking the right question. Yes, the DV does provide that understanding and traceability and the “what happened?”. The old data will continue to get loaded and you can extend the schema with new data without impacting old records or processes.
“These are business decisions and making source changes transparent is often a very POOR idea.”
I very strongly disagree with this statement. Maybe you want to explain what you really mean.
“TRACKING source changes is a different story, of course we should do this – and again this is trivial to do in a 1-1 staging.”
You can potentially add change tracking logic to your staging area, but this makes the staging slightly more complex than a simple truncate load. What do you do in an RT situation here?
I’m not saying there are no potential solutions. What I’m saying is more often than not, they’re much more inconsistent and varied than most folks claim – very often inelegant.
“The joins you have to go through to get stuff out of a DV are deadly to any optimiser, how do you work around that?”
How is this any different from the joins purely from source replicas in staging. Also the DV contains comparatively narrow tables and less redundant data because of the satellite splits. It’s a relatively lower data volume in the joins with relatively narrower tables. The split in the satellites is based on certain criterion and rate of change is one of them. It’s not normalized down to the column level like say an anchor model (even though you could consider an anchor model a special case of a DV data model).
A hub and it’s satellites is loosely equivalent to a dimension (and is often used to source a dimension load to the marts).
I really don’t see why this is an issue. You’re never really joining everything in a DV and getting it out in one fell swoop. You wouldn’t be doing that in any other architecture either. How and why is this different?
“I find the “data only” term here a bit unclear. Perhaps you can elaborate on what you mean by this and how this is completely different from a 1-1 representation?”
There is a 1:1 copy of the source data only at any point in time in the DV but the structure is NOT 1:1 as you would typically have in a staging area. The hubs are an collection of business keys across the sources and are append only. The satellites hang off the hubs and are descriptive and temporal data and look fairly similar to an SCD type II in structure. The links contain the relationship by key and the link satellites contain descriptive data.
There’s nothing stopping you from creating a single hub and a single satellite hanging off it and calling it a DV. The satellite split is recommended but optional in many cases (and there are forced splits in many cases all of which are well defined).
“If i understand right, you claim that the DV has store a “pre integrated” form of the data that has not yet been subject to business rules. Correct? Please tell me how this is different than the dimension bus which stored pre integrated dimension tables?”
Great question!
In a dimensional bus, the conformed dimension can have raw data but is usually integrated and you’re almost forced to have business rules on it (at the bare minimum precedence and super-ceding of different sources with the same data).
The hubs in a DV force a certain level of master data conformity in most cases where it’s known. Compared to a conformed dimension in a BUS, a hub and it’s satellites are much easier to extend without creating cascaded change impacts. It’s by design and separation while still maintaining the integration point which is the hub.
In a maintenance situation where the conformed dimension has to be extended because new sources dictated that the data does in fact belong to that dimension and requires change of the dimensional structure, the impact is much higher. In the DV, this is easily absorbed.
This gives you flexibility in pulling the data out in different situations with an integration point that reduces the work compared to doing it directly from source or “historical” staging.
“I assume you still need some form master data management to actually make sense of sources that disagree with their key schemes and data, right?”
While this is needed, and the DV model enables you to do a certain level of MDM, the methodology doesn’t allow you to “judge” the data per se. That is a “business decision” and it does give business the flexibility to change their minds (which they’ll do anyway).
The hubs are where the business keys align (when the business knows about this) and in situations where it’s not known, it provides ways to get around it without compromising the model (This is usually forced and not considered a “best practice”).
There are philosophical arguments we can get into here which I find serves no purpose and usually ends up in non-productive debates.
“In fact, I find that most HARD scale problems requite some non-rigid methodologies (the noSQL movement being a great example) and strong, fixed structures that make powerful assumptions about the data (take highly optimised compute clusters as a good example).”
You may have a point here. However, the noSQL movement is still struggling with structured data storage and retrieval. The problem they’ve solved is a “technical” one to an extent – They’re still working on many issues. The model even they’ve used is basically MPP albeit on commodity hardware (not new. Seen it in university apps over 10 years ago).
You’re also inferring that the data structures are inflexible in the Data Vault when it’s just the opposite. The structures are like little lego pieces made of putty that you can put together any way you want (and re-shape to an extent).
It’s the methodology that is inflexible not the model.
“THAT is a very interesting claim. I am beginning to think there is something very deep about this “integrated” property of the Vault that allows you to aggregate data faster than the database is capable off? Could you elaborate with an example of how you use the “integrated” property to make a data mart smaller and more aggregated without having to worry about storage?”
You’re reading way too much into what was actually a very simple statement. Because you have ALL the data at a granular level in the Data Vault with CDC, you can in fact get away with delivering aggregated facts or smaller volumes in the marts in many cases. If you need to pull ALL of the data into the marts, you can do that too. The flexibility is there. Yes, this can also be done with persistent staging (and I didn’t claim that you can’t). However, a persistent staging area serves no other purpose and doesn’t have ANY level of integration. It’s basically a source system copy with more data than the actual sources which can only lead to more complex data integration on the way in to the user-facing data layers.
“You are putting up an odd straw man for dimensional systems (apparently, because you have failed implementing them in some case)”
Did you just pull this statement out of a hat? You’re assuming I’ve failed in a certain dimensional architecture which is why I recommend the DV? Also numerous references to “straw man” arguments. Usually it would get me or anyone ticked off (I’m also guilty of doing this at times, but sometimes writing doesn’t clearly convey tone).
I NEVER claimed I know you or your skill level or knowledge. I’d appreciate if you extend me the same courtesy without being judgemental. Yes, it’s ok to retort and debate points. I’ve built and preached Kimball style solutions for years before being exposed to the Data Vault. Even Dan has very extensive Kimball design and build experience and so do many other DV practitioners.
“I think it would be useful to advance the argument by walking through some examples of real data models.”
Dan has done this on video, in his books, in many articles … in his training courses several times over.
The best places to learn more would be either Dan’s blog or at http://LearnDataVault.com
“Understandably, this approach you are taking to the argument does not lead to a very constructive dialogue since it resembles the preaching of religion more than practise of science.”
And why did it get to this point? There have been places in the discussion above where I’ve received responses like “I disagree” without a reasonable level of argument. I’ve tried to back it up with what I believe reasonably experienced architects and modelers would understand (It could very well be a flaw in the quality of my explanations as well).
“They way I understand the question being asked by Thomas is: Is Data Vault applicable ANYWHERE?”
Unlike Dan, I believe it’s applicable EVERYWHERE and consider it a superior architecture with more traceability and flexibility than any other architecture I’ve personally used. It has many positive side-effects and helps even dimensional models to be built faster out of it than straight from the sources (nothing revolutionary here … very often for complex dimensional loads you need to break up the logic anyway).
I’d rather have a valuable data repository that has multiple uses than bandage solutions. The DV is a better source for dimensional builds than the source tables or persistent staging tables.
There are other positive side-effects which go beyond the “traditional” Business Intelligence realm.
- It can absorb both batch and real-time data with no model changes because of the late binding data model.
- It’s a better source for DQ than the source systems because it contains integrated historical source data in raw form.
- It’s a better source for Entity/Identity analytics (which serve MDM and other applications) because of the same reason above.
- It’s a better source for DaaS and feedback loops back to the source systems in many situations.
- For certain data mining algorithms and tools that can mine in-db data, it’s a very good source because of the structures.
My start of using a DV was when I was in an extremely complex situation. The DV provided much more value than for just the BI project as a side-effect and helped simplify a lot of the BI needs as well.
Even if you don’t use a DV, I see you recognizing value in storing historical source data prior to loading your star-schemas. Why not take it up a notch and store it in more consistent friendly structures that are actually aligned by business key (or enable late alignment if not currently known). It’s actually argument enough to have a DV layer without going into detailed explanations.
Thomas,
I have a few questions for you to answer, before trying to discuss what the points are, I would like to understand (clearly) how you define some things, and how you (precisely) deal with the following situations. It’s not that I don’t have this knowledge, it is simply to see where our definitions differ or are the same. Please be as specific as you can, or we will not reach the answers of your original question: to compare and contrast the DV method, Systems Architecture, and Data Model with the Bus Systems Architecture, the Bus Methodology, and the Dimensional Data Model.
1) Please define your staging area. Does it contain a full set of history? If so, then does your “conformed dimensional star schema warehouse” also contain a full set of history?
2) Does your definition of staging area mean 1:1 structure for the source? or 1:1 data for the source? or both?
3) If your staging area is 1:1 structure of the source tables, then how do you bring in XML and unstructured / semi structured data sets?
4) If your staging area is 1:1 structure of the source, then does your staging area contain primary and foreign keys that are copies of the Referential Integrity of the source system?
5) Please define how your staging data is integrated (if at all). As Kimball defines it in his book as a heterogeneous environment ranging from flat-files to relational, to multitude of infrastructural disparate data landing zones.
6) Please define your term: Data Warehouse. Does it mean “historical, non-volatile, integrated, and subject oriented?” Or does it mean simply: historical and base-fact driven? Or does it mean “a logical union of the data marts” as Dr. Kimball would define it?
7) Please define your term Real-Time. What do you mean by Real-Time? What are the data arrival rates? Seconds? Sub-seconds? milliseconds? varied?
8) Please walk through a specific case of how Dr Kimballs solution would handle real-time data feeds (meaning millisecond burst rates of 5,000 to 10,000 transactions per second) – and then please define what happens in the data marts once the “late arriving data” gets there on a batch schedule. What happens in your dimensional world for this case?
9) Somehow parallelism and performance crept in to this discussion. Please define how you run parallel processes through business rules directly in to your “conformed dimension based logical data warehouse”. Please explain the step-by-step process by which a set of business driven processes load in parallel to even a single star.
10) You said Sanjay doesn’t understand parallelism. Please give an example of what kind of parallelism you are specifically referring to.
11) What is the absolute largest “logical conformed dimensional data warehouse” (as defined by Dr. Kimball) you have worked on? I am referring to the following specific metrics: a) raw data size, b) amount of data arriving on regular batch streams, c) amount of data arriving on real-time streams at the same time, d) amount of data being queried WHILE the data warehouse is under load (during a batch load). I don’t mean: just simple stagnant dead data size sitting in the “warehouse”
12) Please outline the specific steps to adding a hierarchy to a dimension (changing the grain of the sequence number that is used in fact table(s). Please give us some idea as to exactly what the process is for you to build that change in to the star schema.
13) IF you have introduced referential integrity in to your relationally based staging area – how do you handle parallel loading and huge volumes? How do you handle real time to the staging area?
14) IF you have real-time data feeds, do you stop at the staging area at all? Or do you load directly to the “logical dimensional data warehouse?”
There are probably 50 more questions that I should ask here, but reality says I don’t have time, and you probably don’t have time to answer them all. Again, I am in full understanding of how these things happen, and what the answers should be. My intent for asking these questions is to see where your definitions do not align with mine – so that we may begin to accurately discuss the differences in a comparison and contrasting method.
Which I believe was your original question.
Now, if you choose not to answer any of these questions going forward – then there clearly you are simply defending your “solution” without justification either. And at that point, I would see no further use to this conversation as well.
Thank-you kindly, looking forward to your answers.
Dan Linstedt
Hi Thomas
Thank you for contributing to this discussion. And you are spot on with my intentions of my question, is DataVault applicable ANYWHERE?
Your comments also hits exactly what I think is the problem with Data Vault. There are a lot of claims, promises and “advantages over everything else” that are very difficult to disagree with. But I still find it very difficult to understand HOW as it never gets to actual examples.
Dan, I think Thomas has a couple of very valid questions to Sanjay’s statements. To answer questions with questions is something we all know from politicians trying to answer something too specific. And to me this is exactly what you are doing here. “We do not want to answer your questions, but please answer all our counter questions. And if you do not, you are just defending your solution”.
Data Vault is the new kid on the block in Data Warehousing. And when you come along with this “advantages over everything else” approach it is being positioned with, you should expect to be able to answer simple how to questions.
Data Vault is making a lot of promises that are difficult to disagree with. But the next natural question after “do you agree with the´advantages?” is always “how?”.
With the “show me your biggest warehouse…” question, I cannot help wonder if the mega warehouses is the only place Data Vault is applicable? Then at least we know it is applicable SOMEWHERE.
“But I still find it very difficult to understand HOW as it never gets to actual examples.”
Dan gave away a complete data model at http://LearnDataVault.com and chapters of his book. Perhaps you should download it. Many, many people have. There are also videos on there.
“Your comments also hits exactly what I think is the problem with Data Vault. There are a lot of claims, promises and “advantages over everything else” that are very difficult to disagree with”
Why is it difficult to disagree? Is it because you agree or you’re not sure. Let’s reverse this and ask you what are the points you agree with?
“Data Vault is making a lot of promises that are difficult to disagree with. But the next natural question after “do you agree with the advantages?” is always how?”
The how is relatively easy to explain and has been done … several times
“With the “show me your biggest warehouse…” question, I cannot help wonder if the mega warehouses is the only place Data Vault is applicable? ”
Volume is just one problem and in my personal opinion less important than other things it solves in better and more elegant ways than prior methods even though scaling is addressed.
It solves integration and flexibility issues. It solves audit and compliance issues (If you have data in a DV, you’re compliance projects like SOX etc are already done just by pointing to it – assuming you’ve used the methodology). Most importantly it solves the longevity issues with existing constructs. The focus of the DV is integrated storage in flexible and accommodating structures. Any applications of data out from there (including data marts) are easier to build from it because of business key alignment. It dramatically reduces testing impact on maintenance projects because of the way it’s designed.
There are many things that can be borrowed from the DV and applied to other architectures to gain some of the benefits, but to gain ALL the benefits you need to go all the way.
If the question is why you would want to? Then we’re back to square one.
So you’ll have to answer which aspects of it you in fact agree with. I see a BIG difference in the way you’d approach a star-schema build to what Thomas would just inferring the comments (which may be a tad unfair on my part).
Interesting. I ask Thomas to respond to my questions, to validate his statements so we can align our thinking, and I get challenged yet again.
So Thomas, does this mean you won’t answer my questions? Or are afraid to justify your solution?
Again, it is no longer about me justifying the Data Vault – so much as it is:
IF we are to have a reasonable debate, then BOTH sides of the arguments must be presented. So far, I have seen nothing from Thomas regarding the justification of his position. All I see here are attacks on me and the Data Vault. Very difficult to have a good debate this way.
I never said “advantages over everything else” – I have never claimed that. I have never stated that the DV is the be-all-end-all.
Size has nothing to do with it. I was merely asking you the question – so that you can tell me what your experience is, and your background is with very large data warehouses.
You see: many of the “foundational cracks” in methodologies and architectures do not readily reveal themselves until you have to deal with both volume of data, and real time. My point is also this: I have seen very few “dimensionally based raw granular data warehouses” that meet the needs of the enterprise. And further, I have seen even fewer that have spanned and succeeded in meeting the enterprise needs at high volumes of scale with near-real time data feeds.
I am merely asking you to justify your side of the argument, again to define clearly what you are basing your premises on. It is (as I stated) difficult to have any sort of discussion or debate about the merits (or not) of something when only one side of the picture is being exposed.
I am asking you WHY you disagree with the promises. I am also asking you to present your definitions – so that we know “what table” we are really sitting at.
I’ve shown in the market place over and over again, and in classroom training – HOW the Data Vault can work. It is a proven implemented solution at many many customers. If you really want to know HOW it works, then please visit some of the customers where it is implemented successfully, and ask them point blank why they chose it.
I can lead a horse to water, but I cannot make him drink.
Just curious now, about why you chose to dodge my questions?
Dan Linstedt
Dan, I think there is some confusion about who is on the thread.
There are 2 x Thomas here. You asked me (Kejser) the questions (good ones) and I will be happy to address them once I am back on a non-mobile device.
In the meantime, I encourage Sanjay and the other Thomas to focus on the issue at hand, namely the HOW not the “who” or the “what”.
And Dan, I think we have already seen your examples (I have even taken a course and implemented a full loading framework that adds data to a Vault) – the issue at hand here seems to be that even then, it unclear how the claimed benefits are realized.
I hope this thread lead to more clarity about what Vault really is.
Hi Dan
Answers to your questions below. THanks of numbering them, makes it easier to follow up on the discussion
1) Please define your staging area. Does it contain a full set of history? If so, then does your “conformed dimensional star schema warehouse” also contain a full set of history?
… It does if history is needed. If the source guarantees to store history and is willing to bear the cost of re-extraction in case of business logic change, I will not bother. If it doesn’t give these guarantees (it rarely does, as I am sure you know) – staging is simply all of history. Sometimes, I may chose to split the old data into different servers to save on cost – so staging does not have to be a single, physical storage area.
2) Does your definition of staging area mean 1:1 structure for the source? or 1:1 data for the source? or both?
… Both. And if the sources changes structure, it is handled as per below.
3) If your staging area is 1:1 structure of the source tables, then how do you bring in XML and unstructured / semi structured data sets?
… in the same way that the source handles it, using database intrinsic functionality (depending on the DB engine) or simply as files. Here, I am quite excited about using the new BigData solutions like HADOOP to handle this (since database generally do a pretty poor job at storing semistructured data, no matter how you model it)
4) If your staging area is 1:1 structure of the source, then does your staging area contain primary and foreign keys that are copies of the Referential Integrity of the source system?
… I do hold the key columns themselves, but I do not enforce the integrity. The reason is that I don’t trust a source system to actually maintain that integrity. I see this as a problem with the Vault too, how do you link two hubs together where the referential integrity of the source is broken? Do you maintain strictly enforced referential integrity?
5) Please define how your staging data is integrated (if at all). As Kimball defines it in his book as a heterogeneous environment ranging from flat-files to relational, to multitude of infrastructural disparate data landing zones.
… Define what you mean by “integrated”. You can think of it as a “pile” of copies of the source data with a dictionary that describes what I actually HAVE in there (to make it easy to find again).
6) Please define your term: Data Warehouse. Does it mean “historical, non-volatile, integrated, and subject oriented?” Or does it mean simply: historical and base-fact driven? Or does it mean “a logical union of the data marts” as Dr. Kimball would define it?
… I am not a big fan of the term at all – since I think the industry has so many definitions as to make it nearly vacuous. But if I was to define it, it would be: The sum of all conformed marts, the staging area and the conformed dimension store. You may chose to include any master data/hierarhcy management systems in there too. Staging is very non-volatile (if the source is), the marts are quite volatile (as their shapes change with the business requirements)
7) Please define your term Real-Time. What do you mean by Real-Time? What are the data arrival rates? Seconds? Sub-seconds? milliseconds? varied?
… Depends on requirement. Sub second is doable from a pure data perspective. In the ms range, the rendering time of reports being to dominate the perceived runtime for the user and event driven systems become more viable. The most cost effective approach will vary with requirements. The cost here is in the load part of the solution. It is a lot cheaper to ETL bulk load a nightly batch than to pay the operational costs of keeping a real time load system constantly running. It essentially becomes a question of uptime requirements.
8) Please walk through a specific case of how Dr Kimballs solution would handle real-time data feeds (meaning millisecond burst rates of 5,000 to 10,000 transactions per second) – and then please define what happens in the data marts once the “late arriving data” gets there on a batch schedule. What happens in your dimensional world for this case?
… Good question. Let us make the tough assumption that the real time feed is required to be sub second. In that case, the only good solution to doing the transform from the source fast enough (without moving the join costs to the end user during report queries) is to use an event based system – Complex Event Process (CEP). The CEP engine keeps multiple bulk connections open to the fact table and delivers data as it arrives in there. Late/early arriving data is handled by first inferring keys (if the fact arrives early) and real time streaming those keys into the dimension followed by a fact load of the key (this can be done in the millisec range, the coordination of the event sequence I hope is clear, is trivial). The CEP engine maintains a hash table of all known dimension keys and updates this hash table when a new inferred key arrives. Typically, you would want the dimension table to have transactional semantics (with singleton inserts) whereas you want the fact table to be in bulk load mode. This gives you the consistency required for most situations. One of the reasons you want the dimension tables to have traditional transactional semantics is exactly to avoid duplicating early arriving keys (the “create key in dimension if it is not already there” transaction can easily be made idempotent).
Large chunks of data that arrive while the CEP is streaming are handled with traditional bulk load. Depending on the database engine, you of course have to be careful with ACID properties here. As Amazon has long ago realised, BASE semantics are often the way to do in these situations. Getting to the minute level realtime is essentially similar to a nightly batch system (though you now have to account for having enough CPU to both load and query at the same time)
I have successfully run systems that deliver 1M rows/sec in sub second systems like described above.
9) Somehow parallelism and performance crept in to this discussion. Please define how you run parallel processes through business rules directly in to your “conformed dimension based logical data warehouse”. Please explain the step-by-step process by which a set of business driven processes load in parallel to even a single star.
…You have some tables in your staging area. The small ones (typically, but not always, dimensions) are simply loaded into the marts directly, and can be done so independently of each other. As mentioned, the ability to infer a new key in a dimension is idempotent (if the “discovered” key has been created by another ETL job, it will just be returned, not created again). It should be reasonable clear that this makes all loads fully paralliseable since there are no dependencies between them. I assume you might worry about large tables and how to make the load of those parallel? Pretty much any database engine supports intra query parallelism, and for those that don’t, it can easily be faked like this:
Start multiple reader threads on the large table
Each thread decided to work on a subset of the input from staging (you can define this subset as partitioned by a column or simply from the physical offsets of rows)
Each thread does the transform needed on the subset of rows it needs
Each thread opens a bulk connection to the target table (typically a fact) and simply streams the data directly in there.
If you are unfortunate to have a database that does not allow more than one bulk stream per table, you can fake parallelism by hash partitioning the target table.
10) You said Sanjay doesn’t understand parallelism. Please give an example of what kind of parallelism you are specifically referring to.
… Sanjay made this comment: “This parallelism is for loading the DV. Staging is still 1:1 for batch loads even in a DV which is 3-tier. Staging loads will be no different and it’s a truncate load area. The DV loads are trivial, small and numerous allowing very simple MPP style parallelism.”, when describing why the super normalisation makes things more parallel. First of all, I am not even sure that response parses as a sentence. Second, if you make the claim that DV somehow adds parallelism that Dimensional models don’t have, you better have a REALLY good illustration of what you mean. I don’t think Sanjay has lifted the burden of evidence (which the other Thomas did allude to lies with the Vault people)
11) What is the absolute largest “logical conformed dimensional data warehouse” (as defined by Dr. Kimball) you have worked on? I am referring to the following specific metrics: a) raw data size, b) amount of data arriving on regular batch streams, c) amount of data arriving on real-time streams at the same time, d) amount of data being queried WHILE the data warehouse is under load (during a batch load). I don’t mean: just simple stagnant dead data size sitting in the “warehouse”
a) >1PB in a scale out system, 100TB in a scale up (one of the larger scale up DW in the world if I am not mistaken?)
b) 2TB overnight to be loaded. Intra day load speeds of up to 200GB in 20 minute time windows at 5M rows/sec
c) As per above, 1M rows/sec (risk data)
d) Not sure I understand the question. Likely because our definition of warehouse seems to misaligned as per previous question. But the number of concurrent queries were in the thousands, with some queries touching multiple TB for an answer.
12) Please outline the specific steps to adding a hierarchy to a dimension (changing the grain of the sequence number that is used in fact table(s). Please give us some idea as to exactly what the process is for you to build that change in to the star schema.
… This is a hard task, and why I don’t store grains higher than the lowest – which makes this problem go away. This is by the way one of the places I disagree with Kimball, I don’t think there is a need for any grain but the lowest possible. If you want something higher, that is what aggregates are for.
13) IF you have introduced referential integrity in to your relationally based staging area – how do you handle parallel loading and huge volumes? How do you handle real time to the staging area?
… I do not enforce referential integrity in the data marts, I let the ETL handle the checking (and potentially inferring of keys) that need to happen. The reason: Sources typically don’t deliver integrity, so unless I build some business logic (driven by conversations with the business users) – I expect to get garbage in and garbage out. As would, I assume DV, until the model cleans itself.
14) IF you have real-time data feeds, do you stop at the staging area at all? Or do you load directly to the “logical dimensional data warehouse?”
… That is a great question. It depends on what you mean by real time. If you are in the minute range, there are sometimes advantages to staging the data (it can make joining co-located). But if you are in the second range, delivery to the end result is typically the path forward. I/O can take too long, even on modern NAND.
Incidentally, I would ask you the question here: In the real time scenario, how to you avoid paying the cost of “stopping” the data in the DV? And if your claim is that he DV is fast to load – doesn’t that just move the total data latency (as measured from when the data is available in the source until it shows up in a user query) to the price of reconstructing the result through joins of hubs/links/satelittes (i.e how do you avoid making the user pay an cost, which I pay in the CET/ETL code, for these joins?)
Sanjay, answers to your questions and corrections to the straw men you keep putting up (I will point them out more clearly now, since it seems you do not agree that is what is happening here)
“You can potentially add change tracking logic to your staging area, but this makes the staging slightly more complex than a simple truncate load. What do you do in an RT situation here?”
… Who said that a simple truncate load is what staging is? That is one way to do it if you don’t care about history of the source. But didn’t we just call it a 1-1 of the source? Again, that staging has to be truncate/load is a misrepresentation of a Kimball approach.
“I’m not saying there are no potential solutions. What I’m saying is more often than not, they’re much more inconsistent and varied than most folks claim – very often inelegant.”
… Elegant is of course always a subjective matter. But I would challenge you to compare a 1-1 history with a Vault and then ask: “Is it more elegant to MODEL the source than to just track history of it?”
Thomas: “The joins you have to go through to get stuff out of a DV are deadly to any optimiser, how do you work around that?”
Sanjay: “How is this any different from the joins purely from source replicas in staging. Also the DV contains comparatively narrow tables and less redundant data because of the satellite splits. It’s a relatively lower data volume in the joins with relatively narrower tables. ”
… I think you are quite seriously misunderstanding where a relational engine spends its time when joining. The fact that your tables are narrow and many, makes the joining performance WORSE. Especially if your joins include criteria like: “ActiveDate BETWEEN FromDate AND ToDate” which wreck havoc with database statistics. As you are hopefully aware, the complexity of coming up with a good query plan for a large join tree is O(n!) – which means there IS no generally good solution to the many table join problem.
To answer your question on how this is any different in ETL: during ETL, I am in control of the sequence of joining and I can optimise my joins for best possible co-location and join strategy. If the source happens to have denormalized data, I will take advantage of this, not take the data apart again because my hub/sat/link strict model requires it. I also get the options of storing an intermedia join result in memory and reusing it for multiple loads (whereas the relational engine just discards the intermediate results after it is done with it). Of course, you can mimic the same behaviour by doing “table frogger” where you do a partial join, store the intermediate result in a table and then continue to expand the join tree (and good luck with that if you have a replicated, geo clustered database system that logs all changes in the database). If you are going down that path, you have to ask yourself the question: Why did you bother to normalise in the first place?
“A hub and it’s satellites is loosely equivalent to a dimension (and is often used to source a dimension load to the marts).”
… then why not just store it as a dimension and call it that? This is what the dimension bus is all about.
“The satellites hang off the hubs and are descriptive and temporal data and look fairly similar to an SCD type II in structure. The links contain the relationship by key and the link satellites contain descriptive data.”
…For clarification, let me propose a sample model. Let us say we have a business with customers who live in cities. We track history on both the customers location and on city data (say, the population of the city). How would you DV model this:
Hub_Cust –> Link Cust/City City
For reference, I would model it as:
DimCust (with City as denormalised column and if I get greedy, a foreign key to city)
DimCity (if used as a dim)
“The hubs in a DV force a certain level of master data conformity in most cases where it’s known. Compared to a conformed dimension in a BUS, a hub and it’s satellites are much easier to extend without creating cascaded change impacts. It’s by design and separation while still maintaining the integration point which is the hub.”
… Again, I think this “cascaded impact” is a straw man argument. its DIMENSIONS, they are small. There will be around 9B people on this planet and if you store all IP numbers in a dimension (not a good design admittedly, but bear with me), until recently, they fit in a 32 bit space. I don’t know about your database system, but the ones I work with handle a billion row update pretty fast.
“In a maintenance situation where the conformed dimension has to be extended because new sources dictated that the data does in fact belong to that dimension and requires change of the dimensional structure, the impact is much higher. In the DV, this is easily absorbed.”
… I see how this is easily absorbed in a model that has the property that it can absorbs anything you throw at it (which I agree that DV has). But that still does not answer the reduction of complexity question. You still have to think about what to do about those data marts with data that users actually use to VIEW and use the data. They don’t magically restructure themselves and solve this problem just because you added a new table. And how about the ETL that news to reload those marts? That doesn’t magically fix itself either (though you CAN meta model it to make changes really simple). There is also the business question of what the desired semantics should actually be, storing the data in a way that can represent any form (in the same way that a 1-1 copy has all the data the source ever had) does not make that go away.
“This gives you flexibility in pulling the data out in different situations with an integration point that reduces the work compared to doing it directly from source or “historical” staging.”
… Another straw man. I am pretty sure we agree that we have to get from the staging (even the truncate/load you mentioned) to SOME place, whether that be the Vault or directly to the mart. My argument is that you ALREADY know how to pull the data out of the source format, you did this when you staged it 1-1 and when you transformed it to a mart. You integration point IS the 1-1 staging.
“The hubs are where the business keys align (when the business knows about this) and in situations where it’s not known, it provides ways to get around it without compromising the model (This is usually forced and not considered a “best practice”).”
… I think we agree that there is a need for some place to keep track of how business keys map to each other. Some people would call that a master data repository or simply a part of the ETL process (with some lookup tables). It seems you call it a “hub” – but didn’t you just use that same word to refer to something that is the source of a dimension table? Why use the same name for two different concepts? Also, I believe the hub does not handle the situation where multiple keys map to the same entity (duplicates etc.) – would you store those in satellite? And if so, do you do so because it just happens to be a table that has 1-n relationships with another table in this modelling technique? If the goal is a strict model, perhaps it would be better to invent a new table that handles this type of key mapping to avoid confusing it with tables that handle dimension data.
“You’re also inferring that the data structures are inflexible in the Data Vault when it’s just the opposite. The structures are like little lego pieces made of putty that you can put together any way you want (and re-shape to an extent).”
…I may have put it too mildly. I am not inferring they are inflexible. Rather, I am inferring that what they represent is vacuous or perhaps superfluous. The example above of key mappings and dimension table sources being muddled up is, is I think a good example of how a too generic model is forced onto something that already has a stronger taxonomy. The structures you refer to that are like little lego pieces with the properties you assign them have been around for some time in database – we normally call them “tables”.
“Because you have ALL the data at a granular level in the Data Vault with CDC, you can in fact get away with delivering aggregated facts or smaller volumes in the marts in many cases. If you need to pull ALL of the data into the marts, you can do that too. ”
… To claim the “granular” benefit you must be assuming that the dimensional model does NOT store the data at the lowest possible grain. Very often, the “Grain” of a star schema (as mentioned above, I disagree that there is a real choice of this) is simply the source data, represented with new and more efficient keys for optimal joining (in few, wide tables, to make this more efficient in the engine).
“However, a persistent staging area serves no other purpose and doesn’t have ANY level of integration. It’s basically a source system copy with more data than the actual sources which can only lead to more complex data integration on the way in to the user-facing data layers.”
… Again, what is this straw man about the persistent staging having no other purpose and leading to complexity? In fact, you said so yourself that staging can be more than that: “You can potentially add change tracking logic to your staging area, but this makes the staging slightly more complex than a simple truncate load”….I think the confusion here is that you believe staging has to be a truncate/load or that if it isn’t, it becomes too complex. Would that be a fair way to represent your argument? We DO agree that you have to get from the source system to the user-facing data SOMEHOW right? and that this requires you to deal with the source data model? if I may venture to voice another of the big asks by the Kimball people here is: “How does putting the source into a hub/sat/link structure make this problem any easier than just dealing with the source itself?”… Now, I am just a humble database designer with 15 years in the industry and a few 100+TB warehouses in my pocket, but I from your teaching material, I don’t see how the Vault makes this source translation any clearer.
“And why did it get to this point? There have been places in the discussion above where I’ve received responses like “I disagree” without a reasonable level of argument. I’ve tried to back it up with what I believe reasonably experienced architects and modelers would understand”
… The other Thomas said it well: “To answer questions with questions is something we all know from politicians trying to answer something too specific. And to me this is exactly what you are doing here.”.. You are NOT answering the questions, but just throwing questions back. You can safely assume there are plenty reasonably experiences architects on this thread. Might there be something wrong with the argument for DV since it is not immediately obvious to us what is so brilliant about it? I believe understand the modelling technique, yet I don’t find it adds anything that isn’t already there. Obviously, there is some secret sauce missing here…
“I’d rather have a valuable data repository that has multiple uses than bandage solutions. ”
… is this another straw man, i.e. are you saying that Kimball solutions are bandage solutions? I think we all agree that an organisation that cannot agree on a common dimension bus, will only have “mart sprawl”. This is not really a Kimball problem, in the same way the “spreadsheet hell” is not an Excel problem. That being said, large organisation DO become a lot more agile when they allow individual divisions to build their own little “mart islands”. There is a long term cost associate with that as we probably all know. But last time I checked, a lot of companies just don’t think that far and are happy enough to play agile tactics, even if that means sacrificing the holy “one version of the truth”.
“Even if you don’t use a DV, I see you recognizing value in storing historical source data prior to loading your star-schemas. Why not take it up a notch and store it in more consistent friendly structures that are actually aligned by business key”
… I DO see the value in storing the source. Yet, I don’t see how inventing a new terminology of hub/sat/link and super normalising things makes my life any easier? In fact, I would rather store the data in the original, source form – because there is a fair chance I can have a conversation with the owner of the source about table structures he already know.
“Did you just pull this statement out of a hat?”
Sorry, I did not mean to insult you, my apologies. But I will point out that you DID say:
Sanjay: “Even though I’ve done it all in many, many years (including mistakes), I’m never going back to a purist Kimball approach and will most likely be using a Data Vault as the preferred DW structure for pretty much all projects because of the risk reduction.”
… Did you consider that perhaps you were simply doing Kimball wrong? If you see Kimball as requiring a truncate/load staging, I think it is clear to all why you would get in trouble with historical data and changing business requirements. Because when you needed the old data, it wouldn’t be there anymore.
“For clarification, let me propose a sample model. Let us say we have a business with customers who live in cities. We track history on both the customers location and on city data (say, the population of the city). How would you DV model this:
Hub_Cust –> Link Cust/City City
For reference, I would model it as:
DimCust (with City as denormalised column and if I get greedy, a foreign key to city)
DimCity (if used as a dim)”
If City is used as a HUB (which would be a DV recommendation in this scenario), you would have
HUB_Cust
SAT_Cust
HUB_City
SAT_City
LNK_Cust/City
LSAT_Cust/City (Optional)
Otherwise, you would have
HUB_Cust and SAT_Cust_City (The satellite may contain more than that).
“then why not just store it as a dimension and call it that? This is what the dimension bus is all about.”
You can do that, but when you need to extend DimCust (more likely than DimCity), then you have a situation on your hands where you need to re-model and re-populate from the source affecting the data integration routines from source to stage (trivial), stage to DimCust (less trivial).
The cascaded change impact to the dimension and to the star-model is reality. The potential change of grain is reality, the additional testing impact is reality. You HAVE to deal with it.
In the DV, we would tack on another satellite table to HUB_Cust and add a load which is trivial. This doesn’t free you from creating the new and improved DimCust, but it does simplify the join logic out because you have a single driving table.
Now, there IS an impact on the front-end user facing layers, but it gets lessened because of the change impacts being closer to the actual business view of the data. Testing the accuracy of the data in the DV is trivial which naturally reduces the burden on the accuracy in the user-facing layers.
“It seems you call it a “hub” – but didn’t you just use that same word to refer to something that is the source of a dimension table?”
A hub is a collection of business keys across the enterprise. The hub is a part of the Data Vault which is considered a Data Warehouse and can be used to source the Data Marts where you would have structures like Star-Schemas (at least according to the DV Architecture). A Hub and it’s satellites when combined look a lot like SCD-IIs. It IS a different concept which is why the hub which is located in the non-user facing DV is different from the Dimensions in the user-facing DMs.
“Also, I believe the hub does not handle the situation where multiple keys map to the same entity (duplicates etc.) – would you store those in satellite?”
Duplicates from the same source are not permitted in the hubs but their descriptions are permitted in the satellites. A way to handle them from multiple sources is by using same-as link tables. In a majority of cases the variance will be in the descriptive data which will end up in satellites. The Hub to satellite is by default 1:n. The architecture does permit multi-active rows in satellites as well if the situation requires it.
“… To claim the “granular” benefit you must be assuming that the dimensional model does NOT store the data at the lowest possible grain.”
This is a complete misrepresentation of what I said. In a Kimball style it is a “best practice” to store at the lowest possible grain because you do not have a choice. With an underlying DV, you do have a choice and can elect not to do so.
“Now, I am just a humble database designer with 15 years in the industry and a few 100+TB warehouses in my pocket, but I from your teaching material, I don’t see how the Vault makes this source translation any clearer.”
To be honest with you, just based on that statement and your experience, I have trouble understanding why you’re not just “getting it”.
“… is this another straw man, i.e. are you saying that Kimball solutions are bandage solutions?”
Not necessarily, but there are often situations that do arise during modifying Kimball architecture that will likely force you down that path. The chances are much higher is what my assertion is. In my opinion Persistent Staging is also a “bandage” solution instead of an integrated repository of data.
“In fact, I would rather store the data in the original, source form – because there is a fair chance I can have a conversation with the owner of the source about table structures he already know.”
That is of course your prerogative and choice. With the source copy as a 1:1 form you don’t have business key alignment and mapping across sources that a DV gives you. New sources adding to your 1:1 copy are potentially more complex to join when pulling the data to the user-facing layers. Very often this complexity requires multiple levels of staging and landing the data more than once before delivery to the user-facing layers (hence the reference to “bandages”).
“As you are hopefully aware, the complexity of coming up with a good query plan for a large join tree is O(n!) – which means there IS no generally good solution to the many table join problem.”
And I’m going to re-iterate. How is this different in your particular model? You do not really have fewer joins when you’re building you’re star-schema tables from source data. In most cases, you’ll be joining a bunch of tables to create the grain of the dimensions etc. It’s not something that automagically becomes easier or has less impact to the queries with a Kimball architecture.
We’re talking about data loads here (so sourcing to the ETL). Yes, there are many things that can be optimized during the ETL loads – some tool specific, some just good practices.
“I think we all agree that an organisation that cannot agree on a common dimension bus, will only have “mart sprawl”. This is not really a Kimball problem, in the same way the “spreadsheet hell” is not an Excel problem.”
Here’s exactly where the DV differs. It actually embraces this particular point as reality and something that occurs more commonly than not. Hence the focus on getting raw data into the structures it recommends. Hence the alignment by business keys using hubs, and the flexible accommodating structures for data absorption and the extensibility of the model.
“That being said, large organisation DO become a lot more agile when they allow individual divisions to build their own little “mart islands”. There is a long term cost associate with that as we probably all know.”
The Data Vault is cognizant of this and does recognize the need for this which is why the architecture lets you operate like this but still have some level of integration within the DV. As long as they are using an intermediate shared Data Vault area before going to the marts, the DV (DW) continues to get built up.
“But last time I checked, a lot of companies just don’t think that far and are happy enough to play agile tactics, even if that means sacrificing the holy “one version of the truth”.”
The Data Vault doesn’t support the “traditional” view of SSOT. A DV only supports a “single source of the state of data at a point in time”. Truth is subjective and judgemental and is a “business decision”. There in fact can be multiple contrasting interpretations of data sourced from a DV to marts for different business requirements that support different “truths” on the same data sets.
“If you see Kimball as requiring a truncate/load staging, I think it is clear to all why you would get in trouble with historical data and changing business requirements.”
I’ve done enough projects in this style to have done it in many ways including the truncate load style. Yes, I’ve used persistent staging areas. I’ve used staging areas for CDC. Yes, I’ve used time-stamped sources etc.
My definition of staging today is of course very different from yours.
I do have the guts to admit I’ve made mistakes. I’m not infallible. In terms of doing Kimball wrong, can you actually pin-point right and wrong in that particular architecture that go beyond the basics?
With a Data Vault I know what’s the right and wrong ways to do things because of the rigid methodology.
I’m a little tired of your misinterpretations of my statements to conveniently call them “straw man” arguments. However, I do hope this clarifies at least some of my reasoning more clearly.
Sanjay, I think we are getting closer to the core of the issue here.
Let us just assume that our “correct” Kimball approach stores data at the lowest possible grain in the fact from now. I think we both agree this is the way forward – or?
“You can do that, but when you need to extend DimCust (more likely than DimCity), then you have a situation on your hands where you need to re-model and re-populate from the source affecting the data integration routines from source to stage (trivial), stage to DimCust (less trivial).”
… Let me make sure i understand you right. You are NOT referring to the adding of columns of the dimension here are you? We both consider that trivial, right? Are you referring to a change in the underlying KEYS on the dimension? This is also only an issue if the dimension is stored ONLY as type2, not if you keep a type1 representation around too (in which case, the Type1 keys are unaffected by most changes). Of course, there is the extreme example where you completely change what the entity IS. This would require a reload and potentially changes in the loader. But, a hub would be similarly affected if you decided to change the business key that the hub centers around – agree? Score: Kimball 0:, DV: 0.
“Now, there IS an impact on the front-end user facing layers, but it gets lessened because of the change impacts being closer to the actual business view of the data. ”
I think you need to elaborate on this. How is the “business view” of data any more clear by splitting our theoretical Customer/City example into 5-6 normalised tables? My typical business users don’t “get” large join trees very well, do yours?. How does that lessen impact as compared to using a dimension to represent the same data?
“And I’m going to re-iterate. How is this different in your particular model? You do not really have fewer joins when you’re building you’re star-schema tables from source data. In most cases, you’ll be joining a bunch of tables to create the grain of the dimensions etc. It’s not something that automagically becomes easier or has less impact to the queries with a Kimball architecture.”
… I thought I explained that already with my example of storing intermediate results and controlling the join trees explicitly. But since you are asking how I have fewer joins, let me just reuse the example. In Kimball (using arrows the represent ETL, shame I can’t add illustrations here)
Source –> (do joins) –> DimCity/DimCust Bus store –> (Copy) –> Star schema
And in Vault (correct me if I am wrong please)
Source –> (do joins/re normalise if source is not Vault formed) –> 5-6 sat/hub/links –> (do big join tree to rebuild dimensions) –> Star Schema
… And there is the core of the question: What value does putting that intermediate level add? Because that last join is a pretty steep price to pay, especially in a big system.
“The Data Vault is cognizant of this and does recognize the need for this which is why the architecture lets you operate like this but still have some level of integration within the DV. As long as they are using an intermediate shared Data Vault area before going to the marts, the DV (DW) continues to get built up.”
… I think we all agree that DV can serve as a massive storage pile of data – ANY data model can. Furthermore, I don’t think anyone disagrees that it would be great to be able to convince big organisations to pile up all the data in one, centrally curated area. The question is: should we just catalogue it in the original source format (as Bigdata systems do) or go into some exercise of modelling it using a DV.
“The Data Vault doesn’t support the “traditional” view of SSOT. A DV only supports a “single source of the state of data at a point in time”
…. Again, we agree that having a single integration point is great in theory. But how do you consolidate that idea with a real time requirements that don’t have time to wait for data to “land” in a super normalised model before it is being served up to the user?
“This is a complete misrepresentation of what I said. In a Kimball style it is a “best practice” to store at the lowest possible grain because you do not have a choice. With an underlying DV, you do have a choice and can elect not to do so.”
Then I apologise. You have to make it clear what benefit it is that you claim here. If you want a higher grain in Kimball, I am sure you are familiar with aggregate tables and how easy those are to create. I don’t really see what choice it is the DV gives you here that Kimball does not already have
“In terms of doing Kimball wrong, can you actually pin-point right and wrong in that particular architecture that go beyond the basics?”
… I am not sure this question parses? What exactly are you asking? For me to point out where you misrepresent Kimball problems?
And finally Sanjay, just an observation. You use a sentences that trigger some reflexes in most Northern Europeans:
“With a Data Vault I know what’s the right and wrong ways to do things because of the rigid methodology.”
This insistence on “model dogma” is exactly what worries me about the DV. Flexible requirements, require flexible solutions – reality is rarely simple enough to fit an ideology.
As the other thread that Thomas has started has shown, we DO agree that some data structure is required to consolidate how business keys relate to each other. I would encourage a discussion on that thread about that particular problem.
Dan, I forgot to add something to your question14 above.
14) As I mentioned, to get into the second range, you generally have to hit the final, target model right away (unless you have a REALLY good way to avoid paying the join cost in the end user application, in which case I look forward to seeing that). Does this mean that I will not store the staging history while I load? No.
One of the beauties of ETL and CEP tools is that they allow multicast of data streams. I can stream data (delayed, if I need to) into the historical staging area while at the same time using the same data source and memory structures to hit the star schema that support real time users. The arrival of the data does not even have to synchronous with each other. This many-to-many source/destination semantics of ETL/CEP comes in quite handy in a lot of cases. It can for example also be used to archive staging data while at the same time loading into star schemas, saving I/O costs but avoiding intermediate layers.
[...] Thoughts on Data Vault vs. Star Schemas [...]
“Let us just assume that our “correct” Kimball approach stores data at the lowest possible grain in the fact from now. I think we both agree this is the way forward – or?”
I agree, for a more purist Kimball approach, you have to do this. You CAN do aggregate tables and other structures as well. With an underlying DV to your Kimball based user-facing area you DO NOT need to store the lowest grain in the marts (which of course will depend on business requirements). It is the same concept as the Kimball approach in terms of aggregating tables. The lowest grain is already available in the DV, so you can get away with delivering aggregate facts without creating the lowest grain fact in the DM.
“Source –> (do joins) –> DimCity/DimCust Bus store –> (Copy) –> Star schema
And in Vault (correct me if I am wrong please)
Source –> (do joins/re normalise if source is not Vault formed) –> 5-6 sat/hub/links –> (do big join tree to rebuild dimensions) –> Star Schema”
–
It’s more like this. DimCity would probably be fairly simple potentially a static dim. DimCust will probably be a complex conformed SCD Type-II and will look like:
source -> joins (assuming 5-6 tables)/lookups/cleansing/dedupes/business rules/precedence rules/transformations/CDC -> DimCust
The join above creates the grain of the dimension. You also have to determine which source table gets precedence in a duplicate situation across systems (and you have to potentially create workarounds for master data issues).
The source to DV is trivial and the most complex of them is a satellite load which is much simpler than a type-2 dimensional load. In a DV, the load to HUB_Cust and SAT_Cust assuming 5-6 source tables will be:
There will potentially be more than 1 HUB_Cust load process which is doing nothing else but inserting a business key that doesn’t already exist in the HUB.
There will be 1 or more satellite loads to 1 or more satellites on the hub. The satellite requires the hub surrogate key, so a join/lookup to the hub is needed. CDC for the satellite effectivity is needed, so a join to the satellite which is where it stops. If there are multiple source systems, the best practice is to store the data in separate satellites.
There are more processes that are much simpler and smaller and your joins (optionally lookups) are very simple.
From the DV out to the marts you do not have to do the joins that create the grain. The HUB already has the grain. With most ETL tools, you don’t even need to do joins for the dim loads and can just do lookups to the satellites for descriptive data. Depending on your needs sometimes you may have to do 2 or more passes.
Now, you still have the complex stuff like business rules etc when you’re going to the marts but you don’t have to do dedupes, grain creation or precedence rules because you’ve done that on the way in to the DV already.
The loads to DV typically are single table sourced (in most cases) and not multi-table joins. That is why there are loads of tiny processes. It’s very easy to pinpoint and re-run any points of failure. It is deliberately simple to ensure you can simply restart the process.
Now, with a DV you WILL have more tables and more load processes.
The data model and the determination of business keys IS time consuming.
The loading code is relatively easy and the structures enable a little simplification to the processes that deliver data out to the marts.
A few points of clarification if you don’t mind Sanjay:
1) You mention the “grain of the dimension”. Since we already agreed that we will store data at the lowest level, what do you mean by this term? Is this the set of columns we track changes on in a Type2 setup?
2) You add something to my, admittedly simplified, loader example. Namely these operations: lookups/cleansing/dedupes/business rules/precedence rules/transformations/CDC… I agree, this generally has to be done. My question is, where do you do that in the the Vault? In other words: how does making the example more complex (for both Dimension loader and Vault, unless you have a way to remove these steps) advance the argument of why Vault is superior?
3) I am a bit confused about the distinction between a “join” and a “lookup”. Perhaps just a terminology disagreement. Could you elaborate on what the difference between these two terms is for you please?
In your extended example, you also make the assumption that the dimension load will have to do 5-6 joins. Now, I am not sure where you get that number from. If we are extracting from a source system that is used for OLTP, I think it is reasonable assumption that the data model there looks something like (we could perhaps make this assumption just for arguments sake):
CustomerOLTP –> CityOLTP
Which lends itself to a very clean Kimball ETL process of 2 merge joins (one with the city dimension, one with the customer dimension). To bring the same model into Vault, you will now need to split the stream (to form the link) and bring in both the hub and the satellites to the join so you can insert in each of them.
You make this interesting observation:
“The loads to DV typically are single table sourced (in most cases) and not multi-table joins. That is why there are loads of tiny processes. It’s very easy to pinpoint and re-run any points of failure. It is deliberately simple to ensure you can simply restart the process.”
… Based on the above walkthrough of a load, would it be fair to say that while the loaders are mostly single table source, they will often have multi destinations (namely, the hub and sat, and potentially a link)? If so, how exactly does that make rerunning easier? My worry about restart ability is not really that I have multiple sources (thats pretty easy to handle, I can after all read them again) – my worry is that I insert wrong data in my target. The hardest part about restart is after all that you have to track what you INSERTED/UPDATED, not what you read, don’t you agree?
You point at a interesting tradeoff:
“Now, with a DV you WILL have more tables and more load processes.” (i.e. added complexity)
vs.
“The loading code is relatively easy and the structures enable a little simplification to the processes that deliver data out to the marts.” (i.e. simplification)
So far, that simplification in the second part is what seems to elude everyone here. Perhaps you can extend the example and illustrate just WHAT the Vault add to create this simplification. With the grand claims made by Data Vault modellers – I am still waiting for that moment where you can point to that simplification and say: “This is where the magic happens”. I hope we are getting closer to the point where you can say: Vault: 1 point and Kimball: 0 points. Staying in the boxing metaphor: so far it seems Vault is just running around in the boxing ring and beating its own chest.
And finally:
“The data model and the determination of business keys IS time consuming.”
I assume this refers to the “integrated” property that you claim Vault has, but Kimball doesn’t? I agree that handling business keys is complex. However, as I argue in the new thread that Thomas C started (thanks Thomas for continuing your work with this), I don’t think that hubs/satelittes/links are the right data model for this (either). Lets follow up in that thread on that particular argument please.
With the risk of opening myself up for a quick uppercut: I would point out that there is a VERY real problem with Type2 dimensions and fact tables during changing business requirements that I have described here:
http://blog.kejser.org/2011/10/22/why-surrogate-keys-are-not-good-keys/
… I acknowledge this issue, but I don’t think Vault is the answer. Perhaps we will have to walk through another example.
Clarifications for you Thomas K.
1) From a single OLTP app source, the grain is a non-issue in both the DV and in Star-Schemas. When you add multiple OLTP source systems, it comes into play. I was only taking a typical example (5-6 tables joins seems pretty typical to source a customer dim).
2) The complexities are all done in a single pass in the star-schema situation. Sometimes it’s so complex that the data has to be landed to temp tables and done again (IF this is done, it’s a bandage in my book). With the DV, you’re getting the data out and integrating business keys by hubs which are deduping the keys across systems. The satellite loads are very simple compared to a dimension load because they are typically sourced from single tables. These are doing your CDC for you. You also have the dimensional grain from the hub. The hubs with it’s satellites enable easier management of your dimensional data prior to it going into a complex dimension.
Now, admittedly this is not always a solved problem especially if you don’t know the key mappings. A “Same-As” link serves the key mapping if the knowledge comes later.
3) I picked the 5-6 out of my hat to make the example more realistic. I think you’ll agree it’s a reasonable number. The Dim City will probably be a static or type 1 dim so it’s too simple to compare.
The lookups and joins depend on infrastructure like ETL tool. A join is 2 or more tables joined via traditional SQL whereas a lookup is a hash table like structure in memory and can be cached, shared and/or persistent. A common feature of most ETL tools (Informatica PC has a lookup transformation and DataStage has a few different variants to this. Almost every other ETL tool I’ve seen has a similar construct).
“If so, how exactly does that make rerunning easier?”
It’s because of the load simplification. The process is doing so little that it can just be re-started on failure. The chances of inserting wrong data is lowered dramatically as well for the same reason. It CAN happen and cleanup is also fairly easy within the DV. The methodology also provides loading algorithms to ensure re-startability on failure and cleanup on the rare occurence of data corruption.
“Now, with a DV you WILL have more tables and more load processes.” (i.e. added complexity)”
This can potentially start a whole new debate (possibly an argument).
Simply having more processes is not more complex if the process are all simple and enable you to easily pin-point and rapidly fix points of failure. These are very simple processes that can be coded up rapidly or even generated. This is where load parallelism comes into play in an MPP style. You have loads of tiny running processes loading the data in parallel.
As stated above you’re getting business key maps via hubs (in most cases) and you’re getting the CDC via satellites.
“I assume this refers to the “integrated” property that you claim Vault has, but Kimball doesn’t?”
Neither I nor any other DV supporters said or claimed that Kimball doesn’t have integrated data. What I said was in a Kimball style load, you’re simply doing too much in a single pass with complex data integration code that is harder to maintain and extend. I doubt even the staunchest of Kimball supporters will disagree with me on this. Actually anyone (of any methodology) who has done a reasonable amount of work with ETL will agree, it’s a bad idea to put too much logic in a single pass even if it’s possible.
The DV lets you break this down into smaller pieces prior to a load. These pieces are much easier to manage (because they’re simple) while giving you the integration by hubs and links. It’s a model that lets you add new systems and data in and new data integration code without impacting existing code or data. The strength is data absorption (a point I believe you agree with).
Your disagreement stems from the fact that you don’t see value in the structures or the approach.
“I hope we are getting closer to the point where you can say: Vault: 1 point and Kimball: 0 points.”
I really don’t want to go into this in a competitive battle because even with a DV, we do very commonly use Kimball style star-schemas as a front-end user facing construct. Also, I have no reason to convince or convert anyone except for when I’m in an architect situation on a project.
Many points I have used which you deem to be “straw man” arguments have actually happened in real projects in real situations (No, I can’t talk about them, sorry).
“So far, that simplification in the second part is what seems to elude everyone here. Perhaps you can extend the example and illustrate just WHAT the Vault add to create this simplification.”
I thought I had already done that, but it’s done again in this comment above.
“I don’t think that hubs/satelittes/links are the right data model for this (either).”
Well, it’s been a “proven” model in hundreds of situations. So, the proof is in the pudding.
“With the risk of opening myself up for a quick uppercut …”
Why should it be antagonistic. What you state is a real business problem that exists and needs to be solved.
I read your article (thanks for posting it).
The problems with the Kimball models are the inflexibility to change. New source applications late into the project or in subsequent projects with the same data can cause havoc in a shared BUS architecture – which is why silos and temporary marts are quicker and easier to do and happen more often than we like.
Yes, you can get some functionality using your persistent staging but rebuilding the marts is more work than from a DV where key mapping and CDC has already been done for you. You don’t need to worry about audit or compliance either (which also appears to be a non-issue in your book because your persistent staging can take care of it).
While the DV may not completely solve the key problem that you state, at least it gives you options to capture and store that data at points in time.
In both my opinion and experience, it’s much easier to re-generate a data mart from a Data Vault than from source primarily because of the consistent storage patterns. The expense of splitting the key apart from the descriptors has a big payoff in model flexibility for absorption and in pattern consistency.
If it’s a complex dimension build with many attributes, chance are it requires a number of joins to get sourced. When you add a new source system that affects it, you’re facing a complex rebuild and test situation. An underlying DV can and does help here.
Now, I’m pretty much done with this discussion and I can see where and why things turned ugly in the comments. So many times you’ve taken my statements and turned them out of context either without understanding them or perhaps deliberately. The join parallelism is a perfect example. I didn’t talk about it at all. I talked about load parallelism and the DV’s inherent structure is perfect for MPP style loads. I merely said, you will probably encounter 5-6 table joins in a complex dimensional load which is fairly common (or perhaps you do not acknowledge this reality). If you replace this with the 5-6 joins you’ll encounter with a hub and it’s satellites, you’re not paying (or gaining) anything extra in the joins. Don’t know why we’re debating something as elementary as joins when there are harder problems (nice work publicly demeaning me for statements I didn’t even make. Yes I did talk about narrow tables with less redundant data, but I was comparing similar number of joins. You might want to go back and read what I said).
With a DV you do have alternatives within ETL where you can avoid joins altogether when you use lookups even when you’re doing your mart builds. Using these alternatives is not always a good idea (sometimes it is).
There is more to the Data Vault than the Hubs, Links and Sats which are basically just enabling a flexible data model. The methodology is an integral part of the equation. The ETL and loading code are all described and patterned and can be generated. You get integration at least by business key up to a point. You get CDC done for you before you enter the marts. The loads are very simple but numerous which enable easy restarting and scaling the loading processes. The hubs and it’s satellites enable a very easy dimension load source (with or without joins) and the integrated data sets let you build/rebuild fairly quickly to the marts. The smaller pieces are easier to maintain, manage, extend and debug.
Thanks for an entertaining discussion.
Hi Sanjay
Thanks for the clarification. Some of them do shed light on the issue, yet I am afraid you have still not made your argument as to why it makes sense to have an intermediate DV model between a historical staging area and a data mart. I am really sorry you feel that the discussion was done, I was just beginning to think we were getting somewhere.
First of all, let me point out that I find it very rude that you conveniently play the victim when I challenge your position, yet you have no problem dishing out some rather tough critique to people who respond to you. If you want to dish them out, it is only polite that you learn how to take them (or roll with the blows, if I may risk staying the boxing analogy that Thomas C, with his tongue in cheek post, introduced and I tried to continue with my uppercut sentence)
Second, structured antagonism is one of the methods in which science is advanced – also known as peer-review. I invite you to try antagonism, its often a pretty good way to get to the core truth of a matter..
Third, here is the argument form that I what made the original Kimball thread boil and is now boiling this discussion too: “Kimball has this challenge –> therefore Vault”… This is what I call a Straw man argument, perhaps my terminology is wrong – in which case I apologise, as it should now be perfectly clear that I am not a native English speaker. I would invite anyone still listening to openly say if they disagree with ANY of these statements:
- Transforming data in one fell swoop is a bad idea and “landing” an intermediate result can be a good idea
- If you want auditing or bi-temporal tracking – you typically need to handle this at some point in the architecture
- You need a way to map keys from different source systems together
- It is possible to auto generate most of the trivial, repeatable load code in a warehouse (as indeed, Thomas C’s time extender does)
- Agility in data warehouses is a key property that makes them successful
The position you have NOT convincingly argued for, is that it follows from any of these statements that we need a new modelling methodology. Which is what this thread is all about after all.
Back to your comments: I don’t disagree that you COULD be in a situation where the loading of our hypothetical Customer/City dimension requires many joins if the data comes from multiple source. What I AM say in that I don’t want to overly complicate the loader by normalising out the data in vault form to handle this. Sometimes, you DONT need those extra joins – and I don’t see why I would model my temporary landing spot after them.
And to address your comments:
“It’s because of the load simplification. The process is doing so little that it can just be re-started on failure. The chances of inserting wrong data is lowered dramatically as well for the same reason. ”
… I don’t think there is disagreement that having restartable loaders is a good idea. Although if you want to run a lot of these small loaders in parallel (and carefully dodging deadlocks) – small changes do not guarantee easy rollback. On this note, as I am sure you know, some ETL tools gives you the option to efficiently “pre land” the data in a format that is optimal for that particular job. For example, Integration Services has a “raw” format that is quite useful for this purpose. The question then is: Why use a data model for it?
Perhaps you can also shed some light on how landing the data in Vault is compatible with Dan’s claim: “DV model is flexible enough to absorb BOTH real-time feeds and batch feeds at the same time! and even to the same table in certain circumstances.”… Admittedly, this is Dan’s claim but not yours, the question is if you agree or not that DV is a good for real time DW?
You make these comment and if I understand you right, you claim that this forms an argument as to why DV the right path:
“The complexities are all done in a single pass in the star-schema situation. Sometimes it’s so complex that the data has to be landed to temp tables and done again (IF this is done, it’s a bandage in my book). With the DV, you’re getting the data out and integrating business keys by hubs which are deduping the keys across systems. The satellite loads are very simple compared to a dimension load because they are typically sourced from single tables. These are doing your CDC for you. You also have the dimensional grain from the hub. ”
… Let me get this straight then. You claim than when something is complex and require a place to land data – that is a bandage solution (I disagree)? Yet in the next sentence, you are proposing that very same intermediate landing area and call it DV. Are you really saying that DV is a “structured, common bandaging area”? If so, we are not THAT much in disagreement, I just don’t think it makes any sense to normalise anything or introduce new terminologies for that. CDC, can be done several places in the architecture, but an obvious choice (Because it only requires a single, high performance merge and is super easy to automate) is in a direct copy of the source system.
“Now, admittedly this is not always a solved problem especially if you don’t know the key mappings. A “Same-As” link serves the key mapping if the knowledge comes later.”
… See other thread for discussion about this and the solution I propose to this problem.
On your notes on parallelism. You have made these comments:
“This is where load parallelism comes into play in an MPP style. You have loads of tiny running processes loading the data in parallel.”
And:
“It has a level of integration and capacity for data absorption that is unparalleled. It can scale like MPP because of the core data model and the ETL load process model of the data into the DV based DW.”
First of all, load of tiny processes does not make something inherently more parallel – especially not if the processes are are already intra process parallelisable. SOMETIMES it does, but that depends on the startup cost of each process compared to the runtime of it. ETL processes tend to have rather high startup costs (in the hundreds of ms) – so having a lot of small ones does not really add any parallelism to the model. On top of this, if you have a lot of small processes competing for the same data structures – you have to be very careful to avoid deadlocks (for example, if you try to insert into a hub and populate a satellite at the same time)
Second, the “core data model” of DV is a highly normalised form (approaching 6NF anchor models) – and you say that because of this it scales into MPP. This makes no sense at all to me – am I reading your statement right? I am very curious to hear exactly where this claim about the model leading to better MPP parallelism is going.
and:
Thomas: “I assume this refers to the “integrated” property that you claim Vault has, but Kimball doesn’t?”
Sanjay: “Neither I nor any other DV supporters said or claimed that Kimball doesn’t have integrated data.”
Yet, let me quote you:
Sanjay: “This data is in fact INTEGRATED by hubs and links which is where the power lies. New data can be absorbed into the Data Vault DW WITHOUT touching or affecting an existing line of code or data. Change impacts are minimal – impossible in a conformed BUS architecture.”
I don’t know how to read this then. Do you claim that Kimball has the “integrated” property (which I am still not sure you have defined clearly) or not? Do you claim there is a difference between DV and Kimball here?
I’ve been away and just came across this … I found it weird.
“I am afraid you have still not made your argument as to why it makes sense to have an intermediate DV model between a historical staging area and a data mart.”
It doesn’t make sense. It’s redundant to have both historical staging and the DV. It was NEVER the debate (or did you think it was?)
The DV replaces historical staging as a DW and deems all front-end constructs as the Data Marts.
“First of all, let me point out that I find it very rude that you conveniently play the victim when I challenge your position, yet you have no problem dishing out some rather tough critique to people who respond to you.”
I have no issues with criticism and debate. Just don’t put words in my mouth and make incorrect interpretations on which you base arguments.
“Second, structured antagonism is one of the methods in which science is advanced – also known as peer-review.”
Regressed much more than advanced. Hardly anything good or creative has come out of “design by committee” teams. We have to live today with many abominations like the Java Programming language because of it.
As for the rest of it, you’ve continued to misinterpret, so I’ll just leave it.
Hi, in case there is still someone active in this long debate I wan t to add my naïve view on the subject. I never used DV and worked only on medium size Kimball like DW (DW = constellation of star schema).
In don’t see a war between both approach, star schema is indeed the presentation model that allow perormance and ease of use. The question is to know if a DV is better than the traditional staging area (or a Inmon EDW, but let’s remove this one).
Indeed a DV is more costly than a staging area (assuming it contains historical data as well) which can be produced by replication. It think that DV provide 2 advantages here :
1- Space. Having different satellites for differents update frequencies will take less space. keeping history of a table where only one column change in a type 2 style would take a lot more space even in a column store. As there is an overload for the structure of the DV itself it is unclear there is a real gain here, but it could be.
2- Standardization. If one source system is on SQL Server with one naming convention and another one in Oracle with another naming convention you can end up with two staging area (1 per type of database if you CDC software can’t handle different type of database). Even if only 1 staging, you still have different naming conventions and it start to be complicated.
3- Normalization of the source. DV could be a good way to handle denormalized source system, especially if the denormlization is not correct (two different value for the same thing because the source system is broken somewhere).
4- Heterogenity of the sources. If you go for the usual staging area, you probably keep logs outside the DB and use the ETL to use them for dimension/fact creation. DV will force you to put them with the rest. It’s clean.
Both DV and staging area can be real time with similar ease.
For the creation of the star schema, I read a lot that they can be just views. This would be a win for DV as usually the staging area is not on the same DBMS than the DW but correspond to the source system software, so no view is possible. Nevertheless, most of the time there will be a need to make them physical, i.e. going back to the old incremental ETL work. huge/huge join will be likely with a DV because of having many satellite (with huge >> RAM size, therefore sort/merge join or disk-based hash join). Therefore, sometimes it’s good sometimes it’s a bit worse because of the number of huge/huge joins.
In a column store MPP database, both the production->DV and DV-> star schema will be more costly. In such system parallelization is leveraged by default and a join is far more costly than the intra columns join of wide tables.
Overall, I think DV can help to clean the staging area and put it really inside the BI scope where staging is already maintaned by IT (as a way to disallow you to put load on the production systems which is smart). If the production side is a mess with a lot of change and heterogeneity then a DV can isolate you from that and work on sound ground. If you trust IT to keep archiving all that happen in production systems and give you something federated and stable, then the DV is probably too costly.
It is not the first time in the data warehouse world that we trade performance and initial cost for having something more clean in the long run. I would say than ETL is a bit of that too.