Tag Archives: Google Cloud

Hadoop Developer Training by Matt

100% PowerPoint free!  Learn at your own pace with 10 hours of video training and 15+ hours of in-depth exercises created by Matt Pouttu-Clarke and available exclusively through Harvard Innovation Labs on the Experfy Training Platform.

  • Learn to develop industrial strength MapReduce applications hands-on with tricks of the trade you will find nowhere else.
  • Apply advanced concepts such as Monte Carlo Simulations, Intelligent Hashing, Push Predicates, and Partition Pruning.
  • Learn to produce truly reusable User Defined Functions (UDFs) which stand the test of time and work seamlessly with multiple Hadoop tools and distributions.
  • Learn the latest industry best practices on how to utilize Hadoop ecosystem tools such as Hive, Pig, Flume, Sqoop, and Oozie in an Enterprise context.

Click here for more info!

Apache Beam vs Apache Spark comparison

Google recently released a detailed comparison of the programming models of Apache Beam vs. Apache Spark. FYI: Apache Beam used to be called Cloud DataFlow before it was open sourced by Google:

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison1

Beam vs Spark
Spark requires more code than Beam for the same tasks

Here’s a link to the academic paper by Google describing the theory underpinning the Apache Beam execution model:

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

When combined with Apache Spark’s severe tech resourcing issues caused by mandatory Scala dependencies, it seems that Apache Beam has all the bases covered to become the de facto streaming analytic API.  The cool thing is that by using Apache Beam you can switch run time engines between Google Cloud, Apache Spark, and Apache Flink.  A generic streaming API like Beam also opens up the market for others to provide better and faster run times as drop-in replacements.  Google is the perfect stakeholder because they are playing the cloud angle and don’t seem to be interested in supporting on-site deployments.  Hats off Google, and may the best Apache Beam run time win!