Category Archives: Google Cloud

C is the new Java

In my previous post Game Over for VMs, what’s next?  I made the case that the underlying concerns that gave rise to virtual machine technology no longer exist.  Container technologies such as Docker have made all VMs, including the Java VM, unnecessary.  This post documents an experiment defining the cost of using VM technologies for a specific and non-trivial use case.

Experimental Design

Design principles:

  1. A non-trivial use case with real world applications.
  2. Terabytes or even petabytes of data readily available for testing.
  3. A reputable third party has developed and open sourced functionally identical code which runs in both a native and a VM environment.

The design I settled upon was to test regex processing using Google’s RE2 library against the Common Crawl data set.  Google has open sourced RE2 in both it’s C and Java forms.

Results

I developed a code base to extract potential phone numbers from the raw text format provided by Common Crawl written in both C and Java.  The goal was to extract anything that looked like a phone number from the crawl data and provide the original matching line along with a standardized phone number as output.  The intention was to use this program as a high performance upstream filter to “boil the ocean” and provide every potential instance of a phone number.  Later downstream processing could then be used to validate and identify if the phone numbers were valid and actual, but it was not the intention of this process to do so.

The following command was used to execute the C test:

cat ~/Downloads/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.wet | time ./getphone-c/Release/getphone-c > getphone-c.txt

The following command was used to execute the Java test:

cat ~/Downloads/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.wet | time java -jar getphone-java/target/getphone-main.jar > getphone-java.txt

Both tests were executed on an uncompressed 412 MiB crawl file with an identical preceding warmup test.  Both tests saturated a single core throughout the test.

Throughput native vs VM

Environment Throughput (Mib/sec) Memory (KiB)
native 380.6 714
VM 20.4 65536
% savings 94.6% 98.9%

Conclusions

This test assumes that Google puts reasonable effort into tuning both the C and Java versions of RE2.  The test shows that for this use case the native implementation produces 20 times the hardware ROI of the VM implementation.

Hadoop Developer Training by Matt

100% PowerPoint free!  Learn at your own pace with 10 hours of video training and 15+ hours of in-depth exercises created by Matt Pouttu-Clarke and available exclusively through Harvard Innovation Labs on the Experfy Training Platform.

  • Learn to develop industrial strength MapReduce applications hands-on with tricks of the trade you will find nowhere else.
  • Apply advanced concepts such as Monte Carlo Simulations, Intelligent Hashing, Push Predicates, and Partition Pruning.
  • Learn to produce truly reusable User Defined Functions (UDFs) which stand the test of time and work seamlessly with multiple Hadoop tools and distributions.
  • Learn the latest industry best practices on how to utilize Hadoop ecosystem tools such as Hive, Pig, Flume, Sqoop, and Oozie in an Enterprise context.

Click here for more info!

Hyperdimensionality and Big Data

In this video training, Matt explains how hyperdimentional reasoning implicitly plays a part in all big data analyses and how today’s analytics and deep learning can utilize hyperdimensionality to improve accuracy and reduce algorithmic blind spots.

Watch on YouTube:

Apache Beam vs Apache Spark comparison

Google recently released a detailed comparison of the programming models of Apache Beam vs. Apache Spark. FYI: Apache Beam used to be called Cloud DataFlow before it was open sourced by Google:

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison1

Beam vs Spark
Spark requires more code than Beam for the same tasks

Here’s a link to the academic paper by Google describing the theory underpinning the Apache Beam execution model:

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

When combined with Apache Spark’s severe tech resourcing issues caused by mandatory Scala dependencies, it seems that Apache Beam has all the bases covered to become the de facto streaming analytic API.  The cool thing is that by using Apache Beam you can switch run time engines between Google Cloud, Apache Spark, and Apache Flink.  A generic streaming API like Beam also opens up the market for others to provide better and faster run times as drop-in replacements.  Google is the perfect stakeholder because they are playing the cloud angle and don’t seem to be interested in supporting on-site deployments.  Hats off Google, and may the best Apache Beam run time win!