4 Technical Blogs on Apache Spark, Hadoop and Data Lakes From Pentaho’s CTO

Here’s a quick summary of four of our favorite blogs from our Chief Geek, also known as the Lord of the Ones and Zeroes, James Dixon.

The integration we launched last year enables Spark jobs to be orchestrated using Pentaho Data Integration so that Spark can be coordinated with the rest of your data architecture.  Like Hadoop, Spark has come a long way since it was created as a scalable in-memory solution for one data scientist. Since then, Spark can now answer SQL queries, added some support for multi-user/concurrency, and the ability to run computations against streaming data using micro-batches. Also, Spark itself has no storage layer. It makes sense to be able to run Spark inside of Yarn so that HDFS can be used to store the data, and Spark can be used as the processing engine on the data nodes using Yarn.

  • It’s not just about big data, it’s a whole new platform with new capabilities – so expect some differences.
  • Don’t get rid of your data warehouse.  No database has managed to replace every database. No database ever will because the variety of the use cases is too large.
  • Think about your data supply chain. Integration matters.
  • It’s complicated. If you want to put it in production, it’ll be a production.

This is an awesome summary of how having a data lake makes it easier to answer business questions about things that change over time, even while many business systems are based on point-in-time (or “state”) information.

This is a great summary of the top articles and debates about data lakes. Includes links to relevant videos and more.  This post covers how people use the term “data lake” and some of the back-and-forth commentary on how people think about data lakes.  Note that Dan Woods at Forbes was one of the earliest people to discuss the idea of the data lake.  Lastly, it’s important to note that there is a lot of variety in the definitions and the stories that people tell about data lakes. 

Let’s block ads! (Why?)

Blog