Tag Archives: Inverse Code-on-Demand

Hermetic Lambdas: a Solution for Big Data Dependency Collisions

Dependency collisions furnish plenty of headaches and late nights while implementing Lambda Architecture.  Java class loading containers tend toward complexity and many frameworks such as Hadoop, JCascalog, and Storm provide tons of transitive dependencies but no dependency isolation for third party Lambda code deployed in the same JVM.  So currently to benefit from running in the same JVM, all the transitive dependencies come along for the ride.  This impacts:

  • Portability: Lambda logic shared in the batch layer and the speed layer must cope with at least two different sets of dependencies.
  • Manageability: framework version upgrades may cause hard failures or other incompatibilities in Lambda logic due to changing dependency versions.
  • Performance: solving the problem by forcing JVM isolation incurs overhead from IPC or network chatter.  Likewise, utilizing scripting or dynamic languages merely hides the problem rather than solving it, while imposing additional performance penalties.
  • Complexity: developing Lambdas requires knowledge and accommodation of all the dependencies of all frameworks targeted for deployment.

Many developers and architects rightfully shy away from deploying bloated dependency container frameworks such as Spring or OSGI inside of a big data framework.  Big data and fast data provide enough complexity already.  So lets look at how we can use a simple pattern, along with basic core Java libraries, to avoid “Dependency Hell“.

Signs you may be encountering this problem include these types of exceptions occurring after deployment:

  • java.lang.NoSuchMethodError: when a method no longer exists which your code depends on
  • java.lang.ClassNotFoundException: when a class your code uses cannot be found in the deployed class path
  • java.lang.IllegalAccessException: when conflicting versions of an API mark methods or fields private

Please reference my github project for a complete working implementation along with unit tests.


Shows how the Lambda executes within the big data framework in an isolated dependency context
Lambda Execution Model

Hermetic Classloading

The Java Service Loader framework introduced in JDK 1.6 provides a way to dynamically bind an implementation class to an interface while specifying an alternate class loader.  Loading a Lamda implementation with ServiceLoader forces all code called within the context of the Lamda to use the same class loader.  This allows creating a service which supports parent-first class loading, child-first class loading, or child-only (or hermetic) class loading.  In this case, the hermetic loader prevents any possible dependency collisions.  To create a hermetic loader, we need simply utilize the built-in URLClassLoader in Java.  Our implementation jar can reside on the local file system, on a web server, or in HDFS: anywhere a URL can point to.  For the parent class loader, we specify the Java bootstrap class loader.  So we can implement a hermetic class loading pattern in one line of code:

ServiceLoader<Map> loader = ServiceLoader.load(Map.class, 
    new URLClassLoader(urls, Map.class.getClassLoader()));

Note that we intentionally avoid calling ClassLoader.getSystemClassLoader() in order to prevent the calling context (such as Hadoop or Hive) form polluting the Lambda class path.  Core packages such as java.lang and java.util use the bootstrap class loader, which only carries core Java dependencies shipped as part of the JDK.  The diagram above shows how the LambdaBoot framework fits within a big data framework such as Hadoop.

Mapping the World

In the example above, we use a Map interface to interact with the Lambda.  This allows us to avoid having a separate jar containing a Service Provider Interface (SPI).  Instead, we can subclass the Map and provide any behavior desired by intercepting get and put calls to specific keys.  By optionally returning a Future from a get or put call on the Map, we get asynchronous interaction as well if desired.  In addition, the non-sensitive Map keys can provide metadata facilities.  In the example of linear regression implemented as a Map, we use a conflicting version of Apache Commons Math not yet supported by Hadoop to calculate the regression.

Shows how the LambdaBoot framework can work with deployment tools such as Jenkins, Puppet, and Git
Lambda Deployment Model

Implications

Using a very simple pattern, we can deploy Hermetic Lambdas to any Java environment without fear of dependency collisions.  The ServiceLoader also acts as a service registry, allowing us to browse metadata about available Lambdas, dynamically load new Lamdas, or update existing Lambdas.

Titan: Cassandra vs. Hazelcast persistence benchmark

10 node load test comparison using Amazon EC2 SSD-based instances.  1 billion vertices and 1 billion edges processed for each test run.  Used the titan-loadtest project to run each test.

Method

Experiment maximizes data locality by co-locating load generation, Titan graph database, and Cassandra/Hazelcast within the same JVM instance while partitioning data across a cluster. Exploration of methods for tuning garbage collection, Titan, and Cassandra for the peer computing use case.

The following components were utilized during the experiment:

Technology Version
RHEL x64 HVM AMI 6.4
Oracle JDK x64 1.7_45
Apache Cassandra 1.2.9
Hazelcast 3.1.1
Titan 0.3.2

Each test iteration has 6 read ratio phases starting with 0% reads (100% writes) all the way up to 90% reads and 10% writes.  For all tests, the persistence implementation executes in the same JVM as Titan to avoid unnecessary context switching and serialization overhead.  Tests were conducted using an Amazon placement group to ensure instances resided on the same subnet.  The storage was formatted with 4K blocks and used the noop scheduler to improve latency.

For each phase, new vertices were added with one edge linking back to a previous vertex.  No tests of update or delete were conducted.

Please see the titan-loadtest project above for all Cassandra and Hazelcast settings and configurations used in the test.

Results

Titan 10 node Summary
Titan 10 node Summary

Please note: the results are listed in rates of thousands of vertices per second and include the creation of an edge as well as a vertex. Also, the Hazelcast SSD x1 results used a custom flash storage module for Hazelcast developed privately so those results are not replicable without that module installed.

Conclusions

Hazelcast performed better than Cassandra for all tests and demonstrated one order of magnitude better performance on reads.  Surprisingly, Hazelcast slightly outperformed Cassandra for writes as well.

The Elephant and the Rhino

If we use JavaScript for Code On Demand couldn’t we also use it for Inverse Code on Demand using Hadoop?  It’s not as hard to find people who know JavaScript as it is to find people who know Groovy or Python.  Less of a learning curve for everyone involved if we use the most popular scripting language for our embedded business rules.   Plus if we build them right we could use the same business rule across our entire ultra high scale web presence: from the browser to the elastic map reduce

Mozilla Rhino doesn’t have a lot of marketing behind it, but it’s been around.  Feature rich, mature, and embeddable: Mozilla’s JavaScript engine looks like a winner. We could inject input splits and emit results with ease.

Plus if you thought the Bull in the china shop was fun, wait until you see the Elephant and the Rhino!

Make the Elephant and the Rhino your partner, and utterly destroy a china shop near you!  China Shops don’t scale anyway…

Tips for Implementing Rhino in Hadoop

Make sure to use the Java scripting API compilation option in the setup and cleanup method of your mapper or reducer. Please see this article on how to compile scripts.  This dramatically reduces the CPU requirements and execution time for scripts.

Inverse REST

The principles of REST allow HTTP to scale to Internet user volumes, and code-on-demand provides one of the building blocks of REST.  Code-on-demand allows rich content on the browser via JavaScript or browser plug-ins, and this technique has matured so much that it requires minimal server interaction to run even the most sophisticated applications.

In short, code-on-demand enables scaling across across users by moving the code (logic) to the consumer, instead of requiring the consumer to make a remote request to the code.

It follows logically that any successful approach to scaling across big data requires inverting REST and executing the code as close to the data as possible, rather than trying to move big data to the code.

Likewise, the web architecture scales across users by utilizing caching to provide data locality for shared data.  Scaling across big data also requires data locality for reference data. We must move our small data close to our big data to scale efficiently.

The major flaw designed into many current big data applications involves the failure to utilize the inverse of REST: SOA and integration vendors sell customers on the idea that big data problems can be solved by moving all data through a layer of middle-ware: whether this be a J2EE application server, a service bus, or an integration tool.  I have literally spent years of my career trying to tune and optimize middle-ware solutions for big data.  That’s why I can say definitively that the middle-ware concept does very well at selling a lot of consulting hours, software licenses, and hardware.  What it does not do is scale to big data.

You could code all the big data logic in stored procedures, assuming willingness to embed business logic into a closed system, and assuming that a database will scale to your data volumes.  Database vendors are only beginning to utilize Inverse REST: evaluating filters , transformations, and lookups in the storage pipeline is a new (or non-existent) feature in most DBMS systems.  Yet another opportunity for vendor lock-in.

Hadoop Map Reduce follows an open system implementation of inverse REST.

Regardless of who wins that battle between RDBMS and Map/Reduce one thing is certain: anyone not leveraging the principles of Inverse REST will be left in the dust.

Google, Yahoo, Facebook, StumbleUpon, and others have already hit the wall, and it’s only a matter of time before we all do.