Category Archives: Cloud Computing

Location independent computing providing capacity on demand, automated scaling, and automated deployment for applications.

C is the new Java

In my previous post Game Over for VMs, what’s next?  I made the case that the underlying concerns that gave rise to virtual machine technology no longer exist.  Container technologies such as Docker have made all VMs, including the Java VM, unnecessary.  This post documents an experiment defining the cost of using VM technologies for a specific and non-trivial use case.

Experimental Design

Design principles:

  1. A non-trivial use case with real world applications.
  2. Terabytes or even petabytes of data readily available for testing.
  3. A reputable third party has developed and open sourced functionally identical code which runs in both a native and a VM environment.

The design I settled upon was to test regex processing using Google’s RE2 library against the Common Crawl data set.  Google has open sourced RE2 in both it’s C and Java forms.

Results

I developed a code base to extract potential phone numbers from the raw text format provided by Common Crawl written in both C and Java.  The goal was to extract anything that looked like a phone number from the crawl data and provide the original matching line along with a standardized phone number as output.  The intention was to use this program as a high performance upstream filter to “boil the ocean” and provide every potential instance of a phone number.  Later downstream processing could then be used to validate and identify if the phone numbers were valid and actual, but it was not the intention of this process to do so.

The following command was used to execute the C test:

cat ~/Downloads/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.wet | time ./getphone-c/Release/getphone-c > getphone-c.txt

The following command was used to execute the Java test:

cat ~/Downloads/CC-MAIN-20160524002110-00000-ip-10-185-217-139.ec2.internal.warc.wet | time java -jar getphone-java/target/getphone-main.jar > getphone-java.txt

Both tests were executed on an uncompressed 412 MiB crawl file with an identical preceding warmup test.  Both tests saturated a single core throughout the test.

Throughput native vs VM

Environment Throughput (Mib/sec) Memory (KiB)
native 380.6 714
VM 20.4 65536
% savings 94.6% 98.9%

Conclusions

This test assumes that Google puts reasonable effort into tuning both the C and Java versions of RE2.  The test shows that for this use case the native implementation produces 20 times the hardware ROI of the VM implementation.

Game Over for VMs, what’s next?

That includes the Java VM.  Yes, you heard it.  I’ve been writing Java since 1996 and in 2016 I can officially say that all the reasons I supported Java for all these years no longer apply.  I accurately predicted the rise of Java when the technology was literally a laughing stock, and I have stuck with it for very good reasons until now.  What’s Changed?  And more importantly: What’s Next?

FYI: this post relates to mission critical enterprise software, not desktop software

What’s Changed?

  • Back in the day, there were legions of different processors, endian issues, and not even agreement on how big a “byte” was.  Now only three viable von Neumann architectures exist:  Intel, Intel, and Intel.  The von Neumann architecture itself is dead too.  We’ll get to that.
  • In yesteryear, networks were slow so you had to move the code around to avoid moving the data across the network.  Java Applets and Hadoop are good examples of this.  Java was ideal for this because of platform independence, dynamic class loading, and dynamic compilation to native code.  Now it is actually the software which is slowing the networks today, not the other way around.  We’ll get to that.
  • In the old days, operating systems vied for superiority, spreading FUD as they went.  No one knew who would win (nail-biter).  Now there are only three operating systems vying for dominance: and they are all flavors of Linux.

What’s Next?

Spinning up a Linux container has literally almost no overhead, and yet has enterprise class resource management, security, and robustness.  The industry currently focuses on microservices as a design pattern for client side applications, however this pattern applies equally to server-side applications as well.  New Linux flavors like CoreOS and Alpine build on this concept where everything except the kernel is a microservice operating in a container.  This allows very high levels of performance, security, and efficiency in the kernel that all the other services rely upon.  These new server-side microservice platforms provide all the enterprise class deployment, management, monitoring, security, and interoperability that the Java Platform delivered 21 years ago, without the need for a virtual machine of any kind.  Server-side microservices provide both resource isolation and maximum performance at the same time: at a level which neither the Java VM nor any VM can come close to matching.  And what would be the language of choice for implementing these world changing server-side microservices?

The OS itself is implemented in C, so naturally any server-side microservice not implemented in C will have a very hard time competing with those who are.  Note that the container model completely eliminates the normal native package management hell associated with C, even to the point where an “Apple Store” for containers was recently announced by Docker.

Marketplaces like the Docker Store allow purchasing an entire server cluster pre-configured with server-side microservices of your choice on any cloud platform or even in a local bare metal data center.  The same solution also solves the cloud vendor lock-in that many companies have been struggling with.  Like I said: GAME OVER

On a final note: von Neumann microprocessors no longer fit Moore’s Law and price / performance ratios been degrading for some time now.  The data volumes and low latency requirements of the Internet-of-Things will soon place unbearable pressures on the von Neumann microprocessor model.  Programmable hardware such as FPGA have traditionally required learning different languages and complete software re-write to take advantage of programmable processor architecture.  Seymour Cray founded a company 20 years ago with this exact realization, and the graphic below says it all.  If you want to leverage the power of the Saturn FPGA cartridge, you’ll be writing your code in C, thank you very much 🙂

saturn-i-comparisons-041013-2-e1422647383272

 

 

Hadoop Developer Training by Matt

100% PowerPoint free!  Learn at your own pace with 10 hours of video training and 15+ hours of in-depth exercises created by Matt Pouttu-Clarke and available exclusively through Harvard Innovation Labs on the Experfy Training Platform.

  • Learn to develop industrial strength MapReduce applications hands-on with tricks of the trade you will find nowhere else.
  • Apply advanced concepts such as Monte Carlo Simulations, Intelligent Hashing, Push Predicates, and Partition Pruning.
  • Learn to produce truly reusable User Defined Functions (UDFs) which stand the test of time and work seamlessly with multiple Hadoop tools and distributions.
  • Learn the latest industry best practices on how to utilize Hadoop ecosystem tools such as Hive, Pig, Flume, Sqoop, and Oozie in an Enterprise context.

Click here for more info!

Hyperdimensionality and Big Data

In this video training, Matt explains how hyperdimentional reasoning implicitly plays a part in all big data analyses and how today’s analytics and deep learning can utilize hyperdimensionality to improve accuracy and reduce algorithmic blind spots.

Watch on YouTube:

Oracle SWiS DAX APIs: instant classic

The Software In Silicon Data Analytic Accelerator (SWiS DAX) APIs released by Oracle this week signify a sea change for big data and fast data analytic processing.  SWiS-DAXNatively accelerated common analytic functions usable in C, Python, and Java have already shown a 6x lift for a Spark cube building applicationApache Flink and Apache Drill completely eclipse Spark performance so it will be very interesting to see upcoming benchmarks of these higher performing frameworks on SWiS DAX.  There is nothing to keep any vendor or group from bench marking with these APIs as they will work with any C, Python, or Java application.

I’m also looking forward to testing performance of SWiS DAX on non-partitionable data sets in a big memory SMP architecture as well.  The easy problems are partitionable, and true data discovery should allow any-to-any relations without injecting a-priori partitioning assumptions.

It seems that Oracle’s long standing commitment to developing Sun’s Sparc processors is about to pay off in a very big way for big data and fast data analysts.

The sound of one hand clapping?

bubble-burstTo me this weekend wasn’t the Panthers vs. Broncos match-up for Super Bowl 50, or when we found out that Bernie Sanders won the New Hampshire primary.  Although both of these were hoooge: it WAS when these parallel but significant facts emerged:

  1. Google makes it’s historical first open source contribution to the Apache Foundation in the form of Apache Beam
  2. Apache Beam supports three run time engines: Google Cloud, Apache Spark, and Apache Flink
  3. Independent reproducible academic research shows that Apache Flink handily out-performs Apache Spark on in-memory terasort workload
  4. Google releases a rigorous point-by-point comparison showing that Apache Beam’s programming model requires less code than Apache Spark to do the same tasks

So for whoever drank the Spark cool-aid let me translate: you write more code to do things more slowly *and* now have the privilege of competing head-to-head with Google.

This is what’s called a bubble, folks.

Please VC funders: end this quickly it’s more painless that way.  And don’t put Spark on your resume because then people might notice the cool-aid stains.

Apache Beam vs Apache Spark comparison

Google recently released a detailed comparison of the programming models of Apache Beam vs. Apache Spark. FYI: Apache Beam used to be called Cloud DataFlow before it was open sourced by Google:

https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison1

Beam vs Spark
Spark requires more code than Beam for the same tasks

Here’s a link to the academic paper by Google describing the theory underpinning the Apache Beam execution model:

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

When combined with Apache Spark’s severe tech resourcing issues caused by mandatory Scala dependencies, it seems that Apache Beam has all the bases covered to become the de facto streaming analytic API.  The cool thing is that by using Apache Beam you can switch run time engines between Google Cloud, Apache Spark, and Apache Flink.  A generic streaming API like Beam also opens up the market for others to provide better and faster run times as drop-in replacements.  Google is the perfect stakeholder because they are playing the cloud angle and don’t seem to be interested in supporting on-site deployments.  Hats off Google, and may the best Apache Beam run time win!

Feel the Beam! Google Casts a new Light on Big Data

using-lasers-to-preserve-mt-rushmoreApache Beam from Google finally provides robust unification of batch and real-time Big Data.  This framework replaced MapReduce, FlumeJava, and Millwheel at Google.  Major big data vendors already contributed Apache Beam execution engines for both Flink and Spark, before Beam even officially hit incubation.  Anyone else seeing the future of Big Data in a new light?  I know I am…

Academic underpinning: http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Google’s comparison of Apache Beam vs. Apache Spark: https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison

Streaming Feature Extraction for Unions with statzall

unions-cSupporting unions for fields with multiple types makes for more robust and automated feature extraction.  For example, “account numbers” may contain business relevant strings or spaces due to different data stewards or external data providers.

Rather than transforming all numbers to String, statzall takes the opposite approach and packs Strings into doubles using the open source Unibit Encoding.  This allows extremely efficient feature extraction of basic data science primitives using existing hardened APIs such as CERN Colt.  With single-threaded performance of 8 mm / sec automated feature extraction on *all* measures becomes possible.

In addition, statzall provides support for one-click deployment to your Hadoop YARN cluster using CDAP.  Or if you use cloud, you can literally set up a fully automated Internet-scale feature extraction cluster in 5 minutes using the Coopr Cloud.

Titan: Cassandra vs. Hazelcast persistence benchmark

10 node load test comparison using Amazon EC2 SSD-based instances.  1 billion vertices and 1 billion edges processed for each test run.  Used the titan-loadtest project to run each test.

Method

Experiment maximizes data locality by co-locating load generation, Titan graph database, and Cassandra/Hazelcast within the same JVM instance while partitioning data across a cluster. Exploration of methods for tuning garbage collection, Titan, and Cassandra for the peer computing use case.

The following components were utilized during the experiment:

Technology Version
RHEL x64 HVM AMI 6.4
Oracle JDK x64 1.7_45
Apache Cassandra 1.2.9
Hazelcast 3.1.1
Titan 0.3.2

Each test iteration has 6 read ratio phases starting with 0% reads (100% writes) all the way up to 90% reads and 10% writes.  For all tests, the persistence implementation executes in the same JVM as Titan to avoid unnecessary context switching and serialization overhead.  Tests were conducted using an Amazon placement group to ensure instances resided on the same subnet.  The storage was formatted with 4K blocks and used the noop scheduler to improve latency.

For each phase, new vertices were added with one edge linking back to a previous vertex.  No tests of update or delete were conducted.

Please see the titan-loadtest project above for all Cassandra and Hazelcast settings and configurations used in the test.

Results

Titan 10 node Summary
Titan 10 node Summary

Please note: the results are listed in rates of thousands of vertices per second and include the creation of an edge as well as a vertex. Also, the Hazelcast SSD x1 results used a custom flash storage module for Hazelcast developed privately so those results are not replicable without that module installed.

Conclusions

Hazelcast performed better than Cassandra for all tests and demonstrated one order of magnitude better performance on reads.  Surprisingly, Hazelcast slightly outperformed Cassandra for writes as well.