The Software In Silicon Data Analytic Accelerator (SWiS DAX) APIs released by Oracle this week signify a sea change for big data and fast data analytic processing. Natively accelerated common analytic functions usable in C, Python, and Java have already shown a 6x lift for a Spark cube building application. Apache Flink and Apache Drill completely eclipse Spark performance so it will be very interesting to see upcoming benchmarks of these higher performing frameworks on SWiS DAX. There is nothing to keep any vendor or group from bench marking with these APIs as they will work with any C, Python, or Java application.
I’m also looking forward to testing performance of SWiS DAX on non-partitionable data sets in a big memory SMP architecture as well. The easy problems are partitionable, and true data discovery should allow any-to-any relations without injecting a-priori partitioning assumptions.
It seems that Oracle’s long standing commitment to developing Sun’s Sparc processors is about to pay off in a very big way for big data and fast data analysts.
Scala works best with IntelliJ Idea IDE, which has licensing costs and is extremely unlikely to replace free Eclipse tooling at any large company
Scala is among a crowd of strong contenders and faces a moving target as Java has gained 5% in market share between 2015 and 2016. To put this in perspective, Scala has less market share than Lisp
Consistency and Integrity Issues
Trying to get Spark to meet rigorous standards of data consistency and integrity proves difficult. Apache Spark’s design originates from companies who consider Data Consistency and Data Integrity secondary concerns, while most industries consider these primary concerns. For example, achieving at-most-once and at-least-once consistency from Spark requires numerous workarounds and hacks: http://blog.cloudera.com/blog/2015/03/exactly-once-spark-streaming-from-apache-kafka/
Dependency Hell with a Vengeance
Apache Spark (and Scala) import a huge number of transitive dependencies compared to other alternative technologies. Programmers must master all of those dependencies in order to master Spark. No wonder very few true experts in Spark exist in the market today.
What’s the Alternative to Spark?
For real-time in-memory processing Use Case: data grids, once the purview of blue chip commercial vendors, now have very strong open source competition. Primary contenders include Apache Ignite and Hazelcast.
For fast SQL analytics (OLAP) Use Case: Apache Drill provides similar performance to Spark SQL with a much simpler, more efficient, and more robust footprint. Apache Kylin from eBay looks to become a major OLAP player very quickly, although I have not used it myself.
For stream processing Use Case:Apache Beam from Google looks likely to become the de-facto streaming workhorse, unseating both Apache Flink and Spark Streaming. Major big data vendors have already contributed Apache Beam execution engines for both Flink and Spark, before Beam even officially hit incubation.
If you try these alternative technologies, and compare to Spark, I’m sure you’ll agree that Spark isn’t worth the headache.
Supporting unions for fields with multiple types makes for more robust and automated feature extraction. For example, “account numbers” may contain business relevant strings or spaces due to different data stewards or external data providers.
Rather than transforming all numbers to String, statzall takes the opposite approach and packs Strings into doubles using the open source Unibit Encoding. This allows extremely efficient feature extraction of basic data science primitives using existing hardened APIs such as CERN Colt. With single-threaded performance of 8 mm / sec automated feature extraction on *all* measures becomes possible.
In addition, statzall provides support for one-click deployment to your Hadoop YARN cluster using CDAP. Or if you use cloud, you can literally set up a fully automated Internet-scale feature extraction cluster in 5 minutes using the Coopr Cloud.
Dependency collisions furnish plenty of headaches and late nights while implementing Lambda Architecture. Java class loading containers tend toward complexity and many frameworks such as Hadoop, JCascalog, and Storm provide tons of transitive dependencies but no dependency isolation for third party Lambda code deployed in the same JVM. So currently to benefit from running in the same JVM, all the transitive dependencies come along for the ride. This impacts:
Portability: Lambda logic shared in the batch layer and the speed layer must cope with at least two different sets of dependencies.
Manageability: framework version upgrades may cause hard failures or other incompatibilities in Lambda logic due to changing dependency versions.
Performance: solving the problem by forcing JVM isolation incurs overhead from IPC or network chatter. Likewise, utilizing scripting or dynamic languages merely hides the problem rather than solving it, while imposing additional performance penalties.
Complexity: developing Lambdas requires knowledge and accommodation of all the dependencies of all frameworks targeted for deployment.
Many developers and architects rightfully shy away from deploying bloated dependency container frameworks such as Spring or OSGI inside of a big data framework. Big data and fast data provide enough complexity already. So lets look at how we can use a simple pattern, along with basic core Java libraries, to avoid “Dependency Hell“.
Signs you may be encountering this problem include these types of exceptions occurring after deployment:
java.lang.NoSuchMethodError: when a method no longer exists which your code depends on
java.lang.ClassNotFoundException: when a class your code uses cannot be found in the deployed class path
java.lang.IllegalAccessException: when conflicting versions of an API mark methods or fields private
Please reference my github project for a complete working implementation along with unit tests.
The Java Service Loader framework introduced in JDK 1.6 provides a way to dynamically bind an implementation class to an interface while specifying an alternate class loader. Loading a Lamda implementation with ServiceLoader forces all code called within the context of the Lamda to use the same class loader. This allows creating a service which supports parent-first class loading, child-first class loading, or child-only (or hermetic) class loading. In this case, the hermetic loader prevents any possible dependency collisions. To create a hermetic loader, we need simply utilize the built-in URLClassLoader in Java. Our implementation jar can reside on the local file system, on a web server, or in HDFS: anywhere a URL can point to. For the parent class loader, we specify the Java bootstrap class loader. So we can implement a hermetic class loading pattern in one line of code:
ServiceLoader<Map> loader = ServiceLoader.load(Map.class,
new URLClassLoader(urls, Map.class.getClassLoader()));
Note that we intentionally avoid calling ClassLoader.getSystemClassLoader() in order to prevent the calling context (such as Hadoop or Hive) form polluting the Lambda class path. Core packages such as java.lang and java.util use the bootstrap class loader, which only carries core Java dependencies shipped as part of the JDK. The diagram above shows how the LambdaBoot framework fits within a big data framework such as Hadoop.
Mapping the World
In the example above, we use a Map interface to interact with the Lambda. This allows us to avoid having a separate jar containing a Service Provider Interface (SPI). Instead, we can subclass the Map and provide any behavior desired by intercepting get and put calls to specific keys. By optionally returning a Future from a get or put call on the Map, we get asynchronous interaction as well if desired. In addition, the non-sensitive Map keys can provide metadata facilities. In the example of linear regression implemented as a Map, we use a conflicting version of Apache Commons Math not yet supported by Hadoop to calculate the regression.
Using a very simple pattern, we can deploy Hermetic Lambdas to any Java environment without fear of dependency collisions. The ServiceLoader also acts as a service registry, allowing us to browse metadata about available Lambdas, dynamically load new Lamdas, or update existing Lambdas.