TDWI | Training & Research | Business Intelligence, Analytics, Big Data, Data Warehousing

Think
- Research & Resources
- Webinars
  - Building Breakthroughs: Harnessing Data and AI for Innovation December 18, 2024
  - Expert Panel Exploring Best Practices for Unified Data Management January 13, 2025
  - De-Risking Innovation: Safely Adopting GenAI January 14, 2025
  - Talking Business to Your Data: Conversational Analytics January 16, 2025
- Virtual Summits
  - Virtual Events TDWI Virtual Summit Series: Generative AI in Action: Lessons Learned from Successful Implementations December 9, 2024
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
Train
- In-Person Events
  - Conference TDWI Transform West - Las Vegas December 13, 2024
  - Executive Summit TDWI Modern Data Leader's Summit Las Vegas: Transforming Data Ecosystems December 17, 2024
- Virtual Live Seminars
  - Seminar Data Architecture Essentials: Building a Data Foundation for Enterprise Analytics November 26, 2024
  - Seminar Getting Started with AI in Your Organization November 26, 2024
  - Seminar Data Modeling Essentials November 26, 2024
  - Seminar ChatGPT 101 for Business Users November 26, 2024
- Online Learning
- By Topic
  - By Topic
    
    Explore the Latest AI, Analytics, and Data Research and Training by Topic
  - BI, Analytics, and Data Literacy
  - AI, Data Science, and Machine Learning
  - Data Management and Governance
  - Platforms and Architecture
  - Strategy and Methods
- Train Your TeamCustom solutions for training your team
  
  TDWI MembershipExclusive access to the research, tools, training, and connections
Engage
- Connect
  - Connect and Contribute to Our Vibrant Community of Data Leaders
    
    Subscribe to TDWI Stay up to date on the latest news and events. Sign Up
    
    Become a TDWI Member Gain exclusive access to the research, tools, training, and connections to move your careers, teams, and projects forward. Learn More
    
    Become a Part of the TDWI Research Panel Make a difference in the data and analytics industry and earn incentives by sharing your insights with TDWI. Explore Now
    
    Speak at TDWI Events Share your expertise and build your personal brand as a speaker at a TDWI In-Person or Virtual Event. Submit a Proposal
    
    Become a TDWI Research Fellow Apply to be a member of TDWI’s industry leading research team. Apply Today
    
    Showcase Your Data & AI Solutions Reach and engage with TDWI community through multi-channel marketing programs. Learn More

TDWI Blog

TDWI Blog: Data 360

From Hollywood to Hadoop

I transcended time and space earlier this week when I attended Hadoop World in New York City.

It started Monday evening. After taking a high-speed train from Boston, I emerged from the bowels of Penn Station onto the bright lights and bustling streets of mid-town Manhattan. The pavement was wet from a passing rain and lightening pulsed in the distant sky, framed by the city’s cavernous skyscrapers. I felt like I had entered a Hollywood set for an apocalyptic movie. But that was just the beginning.

Invigorated by the city’s pulsing energy, I decided to walk 15 blocks to my hotel. Halfway there, the winds picked up, the muted lightening roared to life, and rain scoured the streets in endless waves. I ducked under a large hotel canopy just in time to see hail the size of shooter marbles pelt everything in sight. After 15 minutes, the deluge subsided. But by the time I reached my hotel, I was soggy and stunned.

Welcome Aboard!

The next morning, as I listened to the proceedings from Hadoop World, I realized that the prior night’s surreal weather was a fitting prelude to the conference—at least for me. Hadoop World was a confab for programmers—almost 1,000 of them. As a data guy, it felt like I had been transported to parallel universe where the people looked and acted the same but spoke a completely different language. But what I did understand, I liked.

With Hadoop, it seems that the application community finally discovered data and its potential to make businesses smarter. “Hadoop is a high value analytics engine for today’s businesses,” said Mike Olson, during his kickoff keynote. Mike is CEO and Founder of Cloudera, an open source provider of Hadoop software and services and host of the event. Following Olson on the stage was Tim O’Reilly, founder of O’Reilly Media, a long-time high-tech luminary and open source proponent. He said, "We are the beginning of an amazing world of data-driven applications. It's up to us to shape the world."

It was wonderful to see the developer community discover data in all its glory. To my fellow developers, I say, “Welcome aboard!” We’re all on the same page now.

Fathoming Hadoop

Hadoop is one of the first attempts by the developer community to get their arms around data in a way that conforms to their skills, knowledge, and culture. From a data guy’s perspective, Hadoop is clunky, slow, and woefully immature. But it does have advantages. As a result, it’s already popping up in corporate data environments as a complement to analytical databases. For example, some leading-edge companies are using Hadoop to process and store large volumes of clickstream and sensor data that they then feed into analytical databases for query processing.

So what is Hadoop? It might be easier to say what it's not.

· Hadoop is not a database; it’s a distributed file system (Hadoop Distributed File System or HDFS) that scales linearly across commodity servers. It is also a programming model (MapReduce) that enables developers to build applications in virtually any language they want and run them in parallel across large clusters.

· Hadoop is not a transactional system; it’s a batch-oriented system that runs hand-crafted Map-Reduce programs. You are not going to run iterative queries in Hadoop.

· Hadoop does not support random data access; it reads and writes all data sequentially, which makes it tortuously slow for tactical updates and queries and mixed workload applications.

Today, Hadoop shines as an infinitely scalable data processing environment for handling huge volumes of data that would be prohibitively expensive to store and analyze in a traditional relational database or even a data warehousing appliance. Hadoop lets companies capture and store all their data—structured, semi-structured, and unstructured—without having to archive or summarize the data. Consequently, some companies, such as Comscore and CBS Interactive, use Hadoop as a massive staging area to capture, store, and prepare large volumes of data for delivery to downstream analytic structures.

The main advantages of Hadoop are:

1. Open Source. The software is free. And free is good compared to spending millions of dollars on a relational database to handle tens of terabytes to petabytes of data (if it can.) You can download individual components from the Apache Software Foundation, or purchase a “distribution” from third party providers, such as Cloudera or IBM. A distribution is a package of Hadoop-related applications that are tested to ensure compatibility and stability and delivered with support and professional services on a subscription basis.

2. Linear Scalability. Hadoop is an MPP system that runs on commodity servers. It scales linearly as you add more servers. It has minimal overhead compared to relational databases so it offers superior scalability.

3. Streaming. Hadoop is a file system that does not require specialized schema or normalization to capture and store data or a special language to access it. Therefore, Hadoop makes it possible to perform (high-speed) reads and writes. In addition, a new application called Flume lets Hadoop consume streaming event data. In other words, it’s easy to get large volumes of data in and out of Hadoop.

4. Unstructured data. Because of its schema-less design, Hadoop and MapReduce work well on any type of data. MapReduce interprets data at run time based on the keys and values defined in the MapReduce program. Thus, a developer can design the program to work against structured, semi-structured, or even unstructured data, such as images or text.

5. Minimal Administration. Hadoop automatically handles node failures, making it easy to administer large clusters of machines and write parallelized programs that run against the cluster.

The Future of Hadoop

We are in the early days of Hadoop. There is a tremendous amount of excitement and energy around the initiative. The open source community is innovating quickly and bringing to market new capabilities that make Hadoop more database-like and a better partner in corporate data centers. For example, the community has introduced Hive, a SQL-like language that generates MapReduce programs under the covers and makes Hadoop appear more like a relational engine. It has also released Pig, a dataflow language that makes it easier to create MapReduce transformation logic than writing low-level Java.

Conversely, some BI vendors are adopting elements of Hadoop. For example, database vendors, such as Aster Data and Greenplum, have added support for MapReduce. And many relational database and ETL vendors, such as Pentaho and Talend, have implemented or announced bidirectional interfaces for moving data in and out of Hadoop. In addition, BI vendors, led by DataMeer, are working on JDBC interfaces to Hadoop so users can execute reports and queries against Hadoop from the confines of their favorite BI tool. Expect a slew of announcements this year from the likes of MicroStrategy, SAP BusinessObjects, IBM Cognos, and others supporting Hadoop.

It's clear that we’ve entered the era of big data analytics. And frameworks, such as Hadoop, are helping to advance our ability to generate valuable insights from large volumes of data and new data types. Just as exciting, the developer and data communities are converging to address large-scale data issues. And while our language and approaches may differ, it won’t be long before we all sing the same tune with the same words.

Posted on October 15, 2010

Contributors

View All Blog Posts

Comments

Average Rating

Add your Comment

Your Name:(optional)

Your Email:(optional)

Your Location:(optional)

Rating:

Please rate

Comment:

Please type the letters/numbers you see above.

↑

Research & Resources

Webinars

Virtual Summits

By Topic

In-Person Events

Virtual Live Seminars

Online Learning

By Topic

Connect and Contribute to Our Vibrant Community of Data Leaders

TDWI Blog

TDWI Blog: Data 360

From Hollywood to Hadoop

Contributors

Comments

Add your Comment

TDWI

Engage

Research