TDWI Blog

Philip RussomPhilip Russom, Ph.D., is senior director of TDWI Research for data management and is a well-known figure in data warehousing, integration, and quality, having published over 550 research reports, magazine articles, opinion columns, and speeches over a 20-year period. Before joining TDWI in 2005, Russom was an industry analyst covering data management at Forrester Research and Giga Information Group. He also ran his own business as an independent industry analyst and consultant, was a contributing editor with leading IT magazines, and a product manager at database vendors. His Ph.D. is from Yale. You can reach him by email (prussom@tdwi.org), on Twitter (twitter.com/prussom), and on LinkedIn (linkedin.com/in/philiprussom).


Hadoop for the Enterprise: An Overview in 25 Tweets

By Philip Russom, Research Director for Data Management, TDWI

To help you better understand Hadoop’s evolution into mainstream enterprise usage—and why you should care—I’d like to share with you the series of 25 tweets I recently issued on the topic. I think you’ll find the tweets interesting, because they provide an overview of enterprise Hadoop and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report Hadoop for the Enterprise. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Introduction to Hadoop for the Enterprise
1. #Hadoop is expanding into more industries, use cases & enterprise breadth. More in #TDWI Webinar Apr. 14 Noon ET http://bit.ly/1F9d2iy
2. #Hadoop for the Enterprise tech drivers: scalability, low cost, & many data types.
3. #Hadoop for the Enterprise biz drivers: #analytics, data exploration, value from #BigData.

Hadoop Adoption is Up
4. #TDWI SURVEY SEZ: #Hadoop adoption accelerating. Production clusters up 60% in 2 yrs.
5. #TDWI SURVEY SEZ: Half of respondents have #Hadoop clusters in development, coming online in 12 months.
6. #TDWI SURVEY SEZ: 60% of users surveyed will have #Hadoop in production by 2016.

Benefits and Barriers
7. #TDWI SURVEY SEZ: 89% surveyed say #Hadoop is opportunity for biz/tech #innovation.
8. #TDWI SURVEY SEZ: #Hadoop’s benefits: improve #analytics, #EDW, scalability, exotic data.
9. #TDWI SURVEY SEZ: #Hadoop’s barriers: weak skills, biz case, security, open source tools.

Organizational Issues with Enterprise Hadoop
10. As #Hadoop goes enterprise scope, ownership, staffing, dev methods & economics shift.
11. #Hadoop clusters are becoming central, shared IT infrastructure in mainstream firms.
12. #TDWI SURVEY SEZ: Common #Hadoop job titles are: #DataScientist, architect, analyst, developer.
13. #TDWI SURVEY SEZ: Firms train employees in #Hadoop cuz they can’t find or afford folks to hire.

The Many Use Cases for Enterprise Hadoop
14. #TDWI SURVEY SEZ: Leading future #Hadoop uses: ent data hubs, archives, misc BI/DW.
15. #TDWI SURVEY SEZ: Half of respondents will add #DataQuality & #MDM for #Hadoop data.
16. #TDWI SURVEY SEZ: Established #Hadoop practice extends a #DataWarehouse (46%).
17. #TDWI SURVEY SEZ: Data lakes (36%) & enterprise data hubs (28%) are new practices for #Hadoop.
18. #TDWI SURVEY SEZ: Archiving on #Hadoop is upcoming for new (36%) & old (19%) data.
19. #TDWI SURVEY SEZ: #Hadoop for content mgt (17%) & operational ent apps (11%) are new.

Hadoop’s Roles in Enterprise Data Strategies and Architectures
20. #TDWI SURVEY SEZ: 66% feel #Hadoop is important to their enterprise data strategy.
21. #TDWI SURVEY SEZ: #Hadoop is becoming key to multi-platform #DataWarehouse environments (DWEs).
22. #TDWI SURVEY SEZ: a third of #Hadoop clusters are off premises, on cloud, SaaS, managed provider. Surprising!

Hadoop Development Details
23. #Hadoop cluster size scales down to dept use (8 nodes) or up to enterprise (1000 nodes).
24. #TDWI SURVEY SEZ: #Hadoop clusters per enterprise = 10 on average, with median at 4.
25. #TDWI SURVEY SEZ: 58% of #Hadoop dev done w/mix of hand-coding & hi-level tools. 23% coded only.

Want to learn more about Hadoop for the Enterprise?

For a more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report Hadoop for the Enterprise, which is available in a PDF via a free download

You can also register for and replay my TDWI Webinar, where I present the findings of Hadoop for the Enterprise.

Posted by Philip Russom, Ph.D. on April 27, 20150 comments


Q&A RE: Hadoop for the Enterprise

Attendees of a recent TDWI Webinar asked excellent questions.

By Philip Russom, TDWI Research Director for Data Management

Recently, on April 14, I broadcast a TDWI Webinar in which I presented some of the findings from my new TDWI report on "Hadoop for the Enterprise." You can download a free copy of the report in a PDF, and you can replay the Webinar. With each link, you may need to scroll down to find what you want. If you’re new to Hadoop, you may wish to first read the 2013 TDWI Best Practices Report Integrating Hadoop into Business Intelligence and Data Warehousing

Attendees of the Webinar posed several very good questions about various issues around Hadoop. Please allow me to share a few attendee questions and the answers I sent them via e-mail:

What is a Hadoop cluster? And why would an organization need more than one?

The Wikipedia article on “Computer Cluster” is a good general description of all clustered server pools. The article doesn’t mention Hadoop, but Hadoop’s clustering strategy is in line with the article, except that Hadoop can run on heterogeneous servers, whereas the article recommends that all servers be identical. The point of any cluster is to get scalable and high-perfromance computational power, but at a relatively low cost because of commodity priced hardware.

An organization may need more than one Hadoop cluster, due to departmental funding and sponsorship (which is common with analytic applications) or other organizational dynamics. As I pointed out in the Webinar, as users decide on a strategy for Hadoop on an enterprise scale, they tend to abandon the departmental focus in favor of central IT providing Hadoop as a shared enterprise asset (as IT often does with corporate networks, racks of servers, and storage subsystems).

You don't need big data to take advantage of Hadoop?

That’s correct. I’ve found many user organizations with a small Hadoop implementation (8 nodes seems common) used as the data layer under a departmental analytic application or analytics sandbox of some sort. Hadoop makes sense when the department has exotic data (perhaps in lots of files), which Hadoop excels with. Use cases include sentiment analytics with schema-free human language text or supplier analytics with multi-structured XML or JSON files.

Note that, in the examples, the data volumes are modest, but it’s still “big data” in the sense that it’s not the usual structured and relational data. For many users dealing with big data (whether on Hadoop or elsewhere), the value proposition is that big data is new and different, and therefore offers new insights and more complete views of customers. Even when big data is truly big (tens of terabytes or more), users don’t have much trouble managing it; hence, big data is not a scalability crisis, as some people have claimed.

Hadoop has a well-deserved reputation for scaling up linearly. But these examples show that Hadoop also scales down successfully.

Do companies transfer master data into Hadoop to support analytics in a real-time or batch data replication process?

Yes, but that’s still rather rare today. In fact, only 10% of survey respondents who have Hadoop in production today are doing master data management (MDM) on Hadoop. But 45% anticipate doing so within three years. Similarly, data quality is in a similar position, with 11% doing it today versus 55% in the future. Personally, I’ve seen it take a while to ramp up all the data management best practices when a new data platform appears. That seems to be the case with Hadoop. But the proliferation of Hadoop into more of the enterprise is driving up requirements for data management best practices, too.

Let’s now focus on your question. Modern MDM architectures typically support a mix of operational and analytic purposes; they do the same on Hadoop. 

Today, Hadoop is strong on volume but weak on real-time operation. So MDM (and other operations) are usually exclusively batch oriented. Given strong Hadoop projects like Storm and Spark, real-time data operations will become more favorable soon.

Can we get a use case for Hadoop and MDM?

As I mentioned in the Webinar, MDM on Hadoop is pretty rare today, but survey results show it will soon be far more common, along with similar practices like data quality.

There are many ways to architect an MDM solution, but many are built atop or around some kind of hub, which includes a database or operational data store (ODS) plus appropriate interfaces in and out of the hub. At TDWI, we’ve seen a number of organizations start migrating subsets of enterprise data to Hadoop, and simply modeled databases and ODSs seem to migrate to Hadoop successfully. The straightforward tabular structures of these (unlike complex warehouse dimensions) usually fit well with Hive tables or HBase in the Hadoop environment. With the so-called enterprise data hub on Hadoop gaining in popularity, we should expect to see more migrations like this in coming years.

A lot of MDM master databases (or systems of record) have very wide records, because they’re also used to compile the “complete view” of customers and other enterprise entities. I’ve heard conflicting opinions from Hadoop users; some think Hive tables are best for wide records, while others swear HBase is best. I hear similar debates involving query mechanisms, including HiveQL, Pig, Drill, and Impala. If you contemplate similar tasks, I recommend you take a known ODS to Hadoop and test on both Hive and HBase, with a variety of query approaches.

Can HBase replace a classic data warehouse, and can it compete from a performance side?

If you have a “classic” data warehouse, then I’ll assume it is designed for dimensional models, optimized for complex queries, and supported by a rich metadata layer with auditing capabilities. HBase today is not particularly good with any of those, so it makes an unlikely replacement.

Even so, some pieces of the warehouse environment do well on HBase. For example, many warehouses include a number of operational data stores (ODSs). These may be physically managed in the warehouse’s core database instance, or they may be running on standalone hardware servers and database instances. Either way, I’ve interviewed users who’ve migrated these pieces to HBase—or Hive or both. They say it’s an easy migration, tweaking on the new platform is minimal, and performance is fine, as long as batch processing is all you need. Furthermore, moving these pieces to Hadoop frees up capacity on the warehouse, so it can grow into more data and use cases that truly must reside in the core warehouse platform. Or, if the migrated ODSs were on standalone platforms, then Hadoop seems to work as a consolidation strategy.

There has been less talk [about] making Hadoop transaction oriented, i.e., ACID compliant. Is there any trend or survey outcome?

To be honest, I haven’t looked into transaction processing on Hadoop, although I’ve heard that some people in both open source and vendor communities are working on it.

Why would I be so remiss? Because the leading use cases I see today don’t require transaction processing and hence the four ACID properties. That includes extensions of data warehousing and data integration, plus a wide range of analytics. Upcoming use cases—data archiving and content management—don’t involve transaction processing either. Furthermore, if you want open source software, the other NoSQL database management systems are strong on transaction processing (as are older open source databases), so you may wish to look into those.

I’m sorry to cop out on you with a non-answer. But at least you can see that transaction processing on Hadoop is a low priority for those of us excited about doing data warehouse, data integration, reporting, and analytics on Hadoop.

Posted by Philip Russom, Ph.D. on April 15, 20150 comments


Successful Application and Data Migrations and Consolidations

Minimizing Risk with the Best Practices for Data Management
By Philip Russom, TDWI Research Director for Data Management

I recently broadcast a really interesting Webinar with Rob Myers – a technical delivery manager at Informatica – talking about the many critical success factors in projects that migrate or consolidation applications and data. Long story short, we concluded that the many risks and problems associated with migrations and consolidations can be minimized or avoided by following best practices in data management and other IT disciplines. Please allow me to share some of the points Rob and I discussed:

There are many business and technology reasons for migrating and consolidating applications and data.
  • Mergers and Acquisitions (M&As) – Two firms involved in a M&A don’t just merge companies; they also merge applications and data, since these are required for operating the modern business in a unified manner. For example, cross-selling between the customer bases of the two firms is a common business goal in a merger, and this is best done with merged and consolidated customer data.
  • Reorganizations (reorgs) – Some reorgs restructure departments and business units, which in turn can require the restructuring of applications and data. 
  • Redundant Applications – For example, many firms have multiple applications for customer relationship management (CRM) and sales force automation (SFA), as the result of M&As or departmental IT budgets. These are common targets for migration and consolidation, because they work against valuable business goals, such as the single view of the customer and multi-channel customer marketing. In these cases, it’s best to migrate required data, archive the rest of the data, and retire legacy or redundant applications.
  • Technology Modernization – These range from upgrades of packaged applications and database management systems to replacing old platforms with new ones.
  • All the above, repeatedly – In other words, data or app migrations and consolidations are not one-off projects. New projects pop up regularly, so users are better off in the long run, if they staff, tool, and develop these projects with the future in mind.
Migration and consolidation projects affect more than applications and data:
  • Business Processes – The purpose of enterprise software is to automate business processes, to give the organization greater efficiency, speed, accuracy, customer service, and so on. Hence, migrating software is tantamount to migrating business processes, and a successful project executes without disrupting business processes.
  • Users of applications and data – These vary from individual people to whole departments and sometimes beyond the enterprise to customers and partners. A successful project defines steps for switching over users without disrupting their work.
Application or data migrations and consolidations are inherently risky. This is due to their large size and complexity, numerous processes and people affected, cost of the technology, and (even greater) the cost of failing to serve the business on time and on budget. If you succeed, you’re a hero or heroine. If you fail, the ramifications are dire for you personally and the organization you work for.

Succeed with app/data migrations and consolidations. Success comes from combining the best practices of data management, solution development, and project management. Here are some of the critical success factors Rob and I discussed in the Webinar:
  • Go into the project with your eyes wide open – Realize there’s no simple “forklift” of data, logic, and users from one system to the next, because application logic and data structures often need substantial improvements to be fit for a new purpose on a new platform. Communicate the inherent complexities and risks, in a factual and positive manner, without sounding like a “naysayer.”
  • Create a multi-phased plan for the project – Avoid a risky “big bang” approach by breaking the project into manageable steps. Pre-plan by exploring and profiling data extensively. Follow a develop-test-deploy methodology. Coordinate with multi-phased plans from outside your data team, including those for applications, process, and people migration. Expect that old and new platforms must run concurrently for awhile, as data, processes, and users are migrated in orderly groups.
  • Use vendor tools – Programming (or hand coding) is inherently non-productive as the primary development method for either applications or data management solutions. Furthermore, vendor tools enable functions that are key to migrations, such as data profiling, develop-test-deploy methods, full-featured interfaces to all sources and targets, collaboration for multi-functional teams, repeatability across multiple projects, and so on.
  • Template-ize your project and staff for repeatability – In many organizations, migrations and consolidations recur regularly. Put extra work into projects, so their components are easily reused, thereby assuring consistent data standards, better governance, and productivity boosts over time.
  • Staff each migration or consolidation project with diverse people – Be sure that multiple IT disciplines are represented, especially those for apps, data, and hardware. You also need line-of-business staff to coordinate processes and people. Consider staff augmentation via consultants and system integrators.
  • Build a data management competency center or similar team structure – From one center, you can staff data migrations and consolidations, as well as related work for data warehousing, integration, quality, database administration, and so on.
If you’d like to hear more of my discussion with Informatica’s Rob Myers, please replay the Webinar from the Informatica archive.

Posted by Philip Russom, Ph.D. on March 11, 20150 comments


Great Data for Great Analytics

Evolving Best Practices for Data Management

By Philip Russom, TDWI Research Director for Data Management

I recently broadcast a really interesting Webinar with David Lyle, a vice president of product strategy at Informatica Corporation. David and I had a “fireside chat” where we discussed one of the most pressing questions in data management today, namely: How can we prepare great data for great analytics, while still leveraging older best practices in data management? Please allow me to summarize our discussion.

Both old and new requirements are driving organizations toward analytics. David and I started the Webinar by talking about prominent trends:

  • Wringing value from big data: The consensus today says that advanced analytics is the primary path to business value from big data and other types of new data, such as data from sensors, devices, machinery, logs, and social media.
  • Getting more value from traditional enterprise data: Analytics continues to reveal customer segments, sales opportunities, and threats for risk, fraud, and security.
  • Competing on analytics: The modern business is run by the numbers, not just gut feel, to study markets, refine differentiation, and identify competitive advantages.

The rise of analytics is a bit confusing for some data people. As experienced data professionals do more work with advanced forms of analytics (enabled by data mining, clustering, text mining, statistical analysis, etc.) they can’t help but notice that the requirements for preparing analytic data are similar-but-different as compared to their other projects, such as ETL for a data warehouse that feeds standard reports.

Analytics and reporting are two different practices. In the Webinar, David and I talked about how the two involve pretty much the same data management practices, but in different orders and priorities:

  • Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well. Squeaky clean report data demands elaborate data processing (for ETL, quality, metadata, master data, and so on). This is especially true of reports that demand numeric precision (about financials or inventory) or will be published outside the organization (regulatory or partner reports).
  • Advanced analytics, in general, enables the discovery of facts you didn’t know, based on the exploration and analysis of data that’s probably new to you. Preparing raw source data for analytics is simple, though at high levels of scale. With big data and other new data, preparation may be as simple as collocating large data sets on Hadoop or another platform suited to data exploration. When using modern tools, users can further prepare the data as they explore it, by profiling, modeling, aggregating, and standardizing data on the fly.

Operationalizing analytics brings reporting and analysis together in a unified process. For example, once an epiphany is discovered through analytics (e.g., the root cause of a new form of customer churn), that discovery should become a repeatable BI deliverable (e.g., metrics and KPIs that enable managers to track the new form of churn in dashboards). In these situations, the best practices of data management apply to a lesser degree (perhaps on the fly) during the early analytic steps of the process, but then are applied fully during the operationalization steps.

Architectural ramifications ensue from the growing diversity of data and workloads for analytics, reporting, multi-structured data, real time, and so on. For example, modern data warehouse environments (DWEs) include multiple tools and data platforms, from traditional relational databases to appliances and columnar databases to Hadoop and other NoSQL platforms. Some are on premises and others are on clouds. On the downside, this results in high complexity, with data strewn across multiple platforms. On the upside, users get great data for great analytics by moving data to a platform within the DWE that’s optimized for a particular data type, analytic workload, price point, or data management best practice.

For example, a number of data architecture uses cases have emerged successfully in recent years, largely to assure great data for great analytics:

  • Leveraging new data warehouse platform types gives analytics the high performance it needs. Toward this end, TDWI has seen many users successfully adopt new platforms based on appliances, columnar data stores, and a variety of in-memory functions.
  • Offloading data and its processing to Hadoop frees up capacity on EDWs. And it also gives unstructured and multi-structured data types a platform that is better suited to their management and processing, all at a favorable cost point.
  • Virtualizing data assets yields greater agility and simpler data management. Multi-platform data architectures too often entail a lot of data movement among the platforms. But this can be mitigated by federated and virtual data management practices, as well as by emerging practices for data lakes and enterprise data hubs.

If you’d like to hear more of my discussion with Informatica’s David Lyle, please replay the Webinar from the Informatica archive.

Posted by Philip Russom, Ph.D. on February 2, 20150 comments


Q&A RE: Data Warehouse Architecture Issues

Attendees of a recent TDWI Webinar asked excellent questions.
By Philip Russom, TDWI Research Director for Data Management

Recently, on Tuesday April 15, 2014, I broadcasted a TDWI Webinar in which I presented some of the findings from my new TDWI report, Evolving Data Warehouse Architectures in the Age of Big Data. You can download a free copy of the report in a PDF file. And you can replay the Webinar.

Attendees of the Webinar posed several very good questions about various issues in data warehouse architecture. Please allow me to share a few of the attendees’ questions and the answers I sent them via e-mail:

Q. As we update our data warehouse from more reporting to more analytics functions, should we design a brand new data warehouse architecture, or improve from the existing one?

If the existing data warehouse and its architecture fulfill business requirements and technical performance requirements (for speed and scale), then you should try to build out the existing architecture. For that to work, your existing vendor platform under the warehouse must perform well with multiple mixed workloads, including analytic workloads; ask your vendor representative for customer references who’ve succeeded with mixed workloads. Also, building up data sets for advanced analytics typically means loading large data volumes into the warehouse, which may cost more money with some licenses; again, ask your vendor if there are such ramifications under your current license. 

If your current core warehouse platform cannot support mixed workloads with high performance (or adding analytic data costs too much money), you may decide to manage and process large data sets for advanced analytics on a separate standalone platform that integrates with your warehouse. But in that case, you still keep your existing data warehouse and most of its data structures intact, just making slight changes for better integration with the new additional platform(s) for advanced analytics.

Q. Given the lack of integration across this multi-platform [data warehouse] environment, how do we avoid the need to replicate DW transactional sources into the big data platforms, as transactions are required in mining?

Good question, and there are number of issues here. First, a well-designed multi-platform environment won’t suffer a “lack of integration.” TDWI’s definition of “logical data warehouse” is that the logical design specifies integration schemes (not just data models) across physically distinct platforms, whether that integration takes a data model approach (as in shared or conformed dimensions, etc.) or a data integration approach (as in jobs for ETL, replication, etc.) or both. Second, I take your point, that replicating data more than needed can lead to a variety of problems, as data gets out of sync and loses integrity. A good architecture can minimize replication, and sometimes alleviate it. Third, for decades, users have faced the same decision you’re looking at: do we store, manage, and analytically process our rich, valuable collection of transactional data in the warehouse proper or on a standalone but integrated platform, such as the usual operational data store (ODS)?

For years, a solution I’ve seen users successfully adopt is to deploy a homegrown ODS that they’ve designed and optimized for transactions. The ODS is on a standalone platform that’s integrated with the core warehouse (plus other ODSs, marts, etc.), running on a relational DBMS atop commodity priced hardware. Note that the upcoming trend is toward ODSs atop Hadoop (but only if the data volumes are massive). The idea is to manage transactional data on a platform that’s much cheaper than the DW, on a standalone platform where the relentless sorting, updating, and processing of that data won’t degrade warehouse performance. Yet, the ODS is easily reached from all tools, plus through data federation and virtualization as well, which minimizes the replication of transactional data.

If you give the ODS the capacity it needs to persist multiple sort orders and data subsets in the ODS, then copying data outside the ODS is further reduced. Also, if you use data mining tools that can work on data “in situ” (i.e., in the ODS’s relational database) without moving data to the tool, then that also reduces copying and moving transactional data.

Q. The need for data warehouses is never going to go away. But isn’t the separation between "operations" and "analytics" starting to blur? In other words, the future isn't DWE; it's a "data environment" that does both.

Operational BI is all about getting operational data into BI faster and more frequently, while also embedding BI functions in operational applications and their processes as well. Operational BI is a very popular practice. It has been for years, and will get even more popular, as organizations adjust their BI efforts to bring them closer to real time (to be more competitive, customer conscious, efficient, etc.). The widespread existence of operational BI corroborates that the line between operations and BI is already quite blurred and will become even more so.

In another trend, many organizations are purposefully evolving toward a more or less loosely unified data environment for most enterprise data. I say “more or less” and “loosely” because early adopters are quick to say that the architecture is not 100 percent of the enterprise and integration is spotty, on an “as needed” basis. As one architect joked, “it’s more archaeology than architecture, because the work usually consists of imposing a logical architecture over mature, preexisting systems.” For early adopters, it makes sense to architect data globally, when customer data and some other data domains are pervasively shared across multiple applications, departments, and processes. It also makes sense in firms where business processes ramble across multiple business units and IT systems. Obviously, there’s an infinitude of resulting enterprise data architectures.

The data warehouse environment (DWE) I’m describing is a local microcosm of such a broad and loosely unified multi-platform data architecture. However, in some organizations today, the data warehouse and similar data platforms are just a few among many other data platforms, integrated on an enterprise scale. But those organizations are as yet the minority, although we at TDWI expect it to be the norm for IT-intense organizations within five years. TDWI’s Vegas conference has been devoted to issues in enterprise-scale data architecture for years, and will continue to be. You might consider attending next February.

Q. Can you point us to white papers on the difference between reporting and analytics [and how that affects DW architecture]?

You can read my blog on the subject. Or you could read the new report on evolving data warehouse architectures, because I adapted material from the blog to become a section in the report, starting on page 24.

Q. What’s the role, or is there a role, for variants like an ODS in the new world [of data warehouse architectures]? Is it part of the real-time world?”

Historically, some of the first standalone systems in a multi-platform data warehouse (going back to the mid-1990s) were ODSs deployed on their own hardware sever with their own DBMS instances. These are still with us, and will continue to be with us, as data warehouse environments evolve into even more platforms used at once. An ODS can be designed and optimized by users for a wide range of data domains and uses (including real-time data), but I’m currently seeing a lot of users deploying ODSs for various types of big data and other data earmarked for advanced analytics.

Q. Saying Inmon vs. Kimball is no longer relevant is like saying Newton is no longer relevant in the world of physics today. It's still important, maybe not as fundamental as 1–2 decades ago.

For decades, Newton practiced alchemy in his copious spare time, because he was convinced that changing lead to gold was possible. Our heroes aren’t always 100 percent right.

Concerning Inmon and Kimball, see the top of page 7 in the report. Also please read the User Story on that same page. “No longer relevant” is your phrase, not mine. In my view, Inmon and Kimball’s innovations are as relevant as ever, and are still being applied daily. And they just keep giving: Inmon has recently extended our understanding of unstructured data and Kimball is currently working new best practices for Hadoop.

It’s the users who’ve changed. Instead of arguing about which to choose, users choose to apply Inmon and Kimball techniques (and others, too) in the same extended warehouse environment. And that’s a wise choice on their part, since hybrids and diversity seem to be winning strategies for a growing number of user organizations and their diversified DW architectures nowadays.

Q. Some organizations consider Hadoop a replacement for their current DW appliance. How is this possible?

As I said in the Webinar, I’ve only found two organizations that took out a data warehouse and put Hadoop in its place. While that corroborates that a replacement is possible, it’s not likely, nor is it a compelling trend.

Instead of replacement, we at TDWI see far more users augmenting their data warehouse environment with the Hadoop Distributed File System (HDFS), plus related Hadoop tools, especially MapReduce, Hive, HBase, and Pig. In short, HDFS handles things that relational warehouses are not designed for, such as unstructured data, algorithmic analytics, millions of files, and petabyte-size data sets. But the relational warehouse is still best for the structured and multidimensional data that goes into standard reports, performance management, and set-based analytics (typically OLAP or SQL-based analytics).

Another possibility is that Hive atop MapReduce and HDFS makes a highly scalable “row store” type of database. Sometimes you don’t need a full-featured (and expensive) relational DBMS, and hence a row store will do just fine. For example, many of the ODSs found today in data warehouse environments are candidates for migration to Hadoop. That includes ODSs that manage large “archives” (I use the word loosely) of transactional data and other operational data that’s persisted and kept long-term for advanced analytics that just need simple tabular structures. Most standalone ODSs of that description today run on mature DBMSs, but could run almost as well (for less money) on Hadoop.

Finally, let’s remember that not all organizations need a data warehouse, as represented by 15 percent of survey respondents.

Q. Can you recommend any sample success stories on how to integrate Hadoop or similar big data into an existing data warehouse [environment]?

Yes, many real-world use cases and user stories are discussed in the 2013 TDWI report Integrating Hadoop into Business Intelligence and Data Warehousing.

Posted by Philip Russom, Ph.D. on April 30, 20140 comments


Evolving Data Warehouse Architectures: An Overview in 35 Tweets

By Philip Russom
Research Director for Data Management, TDWI

To help you better understand the ongoing evolution of data warehouse architectures and why you should care, I’d like to share with you the series of 35 tweets I recently issued on the topic. I think you’ll find the tweets interesting because they provide an overview of big data management and its best practices in a form that’s compact, yet amazingly comprehensive.

Every tweet I wrote was a short sound bite or stat bite drawn from my recent TDWI report Evolving Data Warehouse Architectures in the Age of Big Data. Many of the tweets focus on a statistic cited in the report, while other tweets are definitions stated in the report.

I left in the arcane acronyms, abbreviations, and incomplete sentences typical of tweets, because I think that all of you already know them or can figure them out. Even so, I deleted a few tiny URLs, hashtags, and repetitive phrases. I issued the tweets in groups, on related topics; so I’ve added some headings to this blog to show that organization. Otherwise, these are raw tweets.

Basic Components of the Average Data Warehouse Architecture

  1. Most DW Arch’s have 4 layers: logical, physical, hardware topology, data standards.
  2. DW logical architecture is mostly about data models, entity models & relationships.
  3. DW logical arch also defines standards for data models, dev practices, interfaces, etc.
  4. DW physical architecture is mostly a plan for data deployment on servers.
  5. DW physical arch also defines topology for hardware & software servers plus interfaces.

Users’ Views of Architectural Components

  1. #TDWI SURVEY SEZ: Data standards & rules are highest priority (71%) of #EDW architecture.
  2. #TDWI SURVEY SEZ: Logical design (66%) is the starting point of an #EDW architecture.
  3. #TDWI SURVEY SEZ: Physical plan (56%) locates logical pieces in an #EDW architecture.
  4. #TDWI SURVEY SEZ: Only 12% have #EDW that’s “collection of data & platforms without a plan.”
  5. #TDWI SURVEY SEZ: Only 12% feel Inmon vs Kimball argument is priority for #EDW architecture.

The Evolution of Data Warehouse Architectures

  1. #TDWI SURVEY SEZ: 79% say their #DataWarehouse has an architecture.
  2. #TDWI SURVEY SEZ: #EDW arch is evolving dramatically (22%), moderately (54%) or slightly (22%)
  3. #TDWI SURVEY SEZ: Driving #EDW arch evolution: #Analytics 57%, #BigData 56%, #RealTime 41%.
  4. #TDWI SURVEY SEZ: Driving #EDW arch evolution: BizPerfMgt 38%, OLAP 30%, UnstrucData 25%.
  5. #TDWI SURVEY SEZ: Driving #EDW arch evolution: competition 45%, compliance 29%, dep’ts 29%.

The Importance of Data Warehouse Architectures

  1. #TDWI SURVEY SEZ: Architecture extremely (79%) or moderately (19%) important to #EDW success.
  2. #TDWI SURVEY SEZ: #EDW Architecture is an opportunity (84%), not a problem (16%).

Benefits and Barriers for Data Warehouse Architecture

  1. #TDWI SURVEY SEZ: Stuff that benefits from #DWarch: #analytics, biz value, data breadth.
  2. #TDWI SURVEY SEZ: Barriers to #DWarch success: skills gap, sponsorship, #DataMgt, funding.

Multi-Platform Data Warehouse Environments

  1. #EDWarch trend: more standalone platforms: #analytics DBMSs, columnar, appliances, #Hadoop, etc.
  2. As #EDW workloads get more diverse, so do types of standalone data platforms in #EDW environment.
  3. As types and numbers of data platforms grow in DW environs, architecture gets ever more distributed. #
  4. Distributed #EDWarch is good&bad: provides workload optimized platforms. But may spawn data silos.
  5. Logical layer of #EDWarch more important than ever to unite big design across multi data platforms.

Single-Platform versus Multi-Platform DW Architectures

  1. #TDWI SURVEY SEZ: Totally pure #EDWarchs are rare. Only 15% have central monolithic #EDW.
  2. #TDWI SURVEY SEZ: Hybrid #EDWarchs are most common today = central #EDW + a few other data platforms (37%).
  3. #TDWI SURVEY SEZ: 2nd most common Hybrid #EDWarch = central #EDW + many other data platforms (16%).
  4. #TDWI SURVEY SEZ: Sometimes #EDW plays small role in #EDWarch compared to workload platforms (15%).
  5. #TDWI SURVEY SEZ: Some organizations (15%) have many workload-specific data platforms, but no true DW.

Big Data’s Influence on Evolving DW Architectures

  1. #TDWI SURVEY SEZ: 41% will extend existing core #EDW to handle #BigData.
  2. #TDWI SURVEY SEZ: 25% will deploy new data platforms to handle #BigData.
  3. #TDWI SURVEY SEZ: 23% have no strategy for their #EDW’s architecture, though they need one.
  4. #TDWI SURVEY SEZ: Only 6% feel they don’t need a strategy for their #EDW’s architecture.

Reports and Analytics have Different DW Architecture Needs

  1. Many users preserve #EDW for reporting, BizPerfMgt & OLAP, but take #analytics data elsewhere.
  2. Data prep for reports differs from same for #analytics. So, many users prep data on separate platforms.

Want to learn more about evolving data warehouse architectures?

For a more detailed discussion—in a traditional publication!—get the TDWI Best Practices Report, titled Evolving Data Warehouse Architectures in the Age of Big Data, which is available in a PDF file via a free download.

You can also register for and replay my TDWI Webinar, where I present the findings of the TDWI report Evolving Data Warehouse Architectures in the Age of Big Data.

Posted by Philip Russom, Ph.D. on April 15, 20140 comments