Tag Archives: streaming data science

Why I tried Apache Spark, and moved on..

I tried Apache Spark, and moved on.  Here’s why:

Resourcing Issuesspark-do-not-enter

Apache Spark, written in Scala, causes severe resourcing issues for customers due to the additional technical skill requirements:

  1. Scala ranks #30 with 0.5 % in market saturation, while Java ranks #1 with 21.5% of the market, a difference of 4300%: http://www.tiobe.com/index.php/content/paperinfo/tpci/index.html
  2. Introduction of native Functional Programming constructs into the Java language with release 1.8 practically eliminates the business case for Scala altogether: https://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html
  3. Scala works best with IntelliJ Idea IDE, which has licensing costs and is extremely unlikely to replace free Eclipse tooling at any large company
  4. Scala is among a crowd of strong contenders and faces a moving target as Java has gained 5% in market share between 2015 and 2016.  To put this in perspective, Scala has less market share than Lisp

Consistency and Integrity Issues

Trying to get Spark to meet rigorous standards of data consistency and integrity proves difficult.  Apache Spark’s design originates from companies who consider Data Consistency and Data Integrity secondary concerns, while most industries consider these primary concerns.  For example, achieving at-most-once and at-least-once consistency from Spark requires numerous workarounds and hacks: http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/

Dependency Hell with a Vengeance

Apache Spark (and Scala) import a huge number of transitive dependencies compared to other alternative technologies.  Programmers must master all of those dependencies in order to master Spark.  No wonder very few true experts in Spark exist in the market today.

What’s the Alternative to Spark?

For real-time in-memory processing Use Case: data grids, once the purview of blue chip commercial vendors, now have very strong open source competition.  Primary contenders include Apache Ignite and Hazelcast.

For fast SQL analytics (OLAP) Use Case: Apache Drill provides similar performance to Spark SQL with a much simpler, more efficient, and more robust footprint.  Apache Kylin from eBay looks to become a major OLAP player very quickly, although I have not used it myself.

For stream processing Use Case: Apache Beam from Google looks likely to become the de-facto streaming workhorse, unseating both Apache Flink and Spark Streaming.  Major big data vendors have already contributed Apache Beam execution engines for both Flink and Spark, before Beam even officially hit incubation.

If you try these alternative technologies, and compare to Spark, I’m sure you’ll agree that Spark isn’t worth the headache.