Top 10 Things I Love About Cascading + Hadoop on the Cloud
My recent experience with Map-Reduce on the Cloud really makes me happy I never got away from the tech… here’s why:
- No fixed infrastructure investments means that performance optimizations and tuning directly effect the bottom line.
- No longer have to estimate hardware for a system that isn’t built yet.
- Everything is open source so I don’t have to deal with software vendors. I can fix things myself if I have to, rely on the community for help, and third party paid support is available if needed.
- Leverage the same software stack used by millions of other cloud developers. No vapor-ware survives here…
- Completely open MapReduce implementation within Hadoop allows cool solutions like using an embedded Java database for looking up dimension data.
- Everything is testable using automated unit testing in the local JVM, leading to code robustness and quality far exceeding any data processing stack I’ve worked on. To be able to debug the entire Hadoop technology stack on the local JVM is priceless.
- Any Java developer can pick up Cascading in a couple of months and be productive.
- Cascading joins offer more flexibility than SQL joins. For example, Cascading OuterJoin supports empty data on either side of the join condition.
- Everything is streaming and INSERT only so data processing best practices are self-evident and mandatory in MapReduce, where the same best practices are optional in SQL-based systems. For example, a true INSERT only database model is fairly rare in the Oracle world; however, in MapReduce an INSERT only model is the only option available.
- I can draw any Cascading data processing design using a simple flow chart, and have direct correlation between the design and the finished product every time.
To top it all off, Cascading is a great path for entry level college grads to get into big data processing. Admittedly, you need to know a lot to make an entire distributed data processing system work on a day in and day out basis. But you only need to know the fundamentals of any JVM-based language (Scala, Jython, JRuby, Groovy, Clojure, etc) to learn how to process big data with Cascading. This is the first time in a very long time that any of my employers has been talking about bringing on interns or new grads.