100% PowerPoint free! Learn at your own pace with 10 hours of video training and 15+ hours of in-depth exercises created by Matt Pouttu-Clarke and available exclusively through Harvard Innovation Labs on the Experfy Training Platform.
Learn to develop industrial strength MapReduce applications hands-on with tricks of the trade you will find nowhere else.
Apply advanced concepts such as Monte Carlo Simulations, Intelligent Hashing, Push Predicates, and Partition Pruning.
Learn to produce truly reusable User Defined Functions (UDFs) which stand the test of time and work seamlessly with multiple Hadoop tools and distributions.
Learn the latest industry best practices on how to utilize Hadoop ecosystem tools such as Hive, Pig, Flume, Sqoop, and Oozie in an Enterprise context.
In this video training, Matt explains how hyperdimentional reasoning implicitly plays a part in all big data analyses and how today’s analytics and deep learning can utilize hyperdimensionality to improve accuracy and reduce algorithmic blind spots.
The Software In Silicon Data Analytic Accelerator (SWiS DAX) APIs released by Oracle this week signify a sea change for big data and fast data analytic processing. Natively accelerated common analytic functions usable in C, Python, and Java have already shown a 6x lift for a Spark cube building application. Apache Flink and Apache Drill completely eclipse Spark performance so it will be very interesting to see upcoming benchmarks of these higher performing frameworks on SWiS DAX. There is nothing to keep any vendor or group from bench marking with these APIs as they will work with any C, Python, or Java application.
I’m also looking forward to testing performance of SWiS DAX on non-partitionable data sets in a big memory SMP architecture as well. The easy problems are partitionable, and true data discovery should allow any-to-any relations without injecting a-priori partitioning assumptions.
It seems that Oracle’s long standing commitment to developing Sun’s Sparc processors is about to pay off in a very big way for big data and fast data analysts.
To me this weekend wasn’t the Panthers vs. Broncos match-up for Super Bowl 50, or when we found out that Bernie Sanders won the New Hampshire primary. Although both of these were hoooge: it WAS when these parallel but significant facts emerged:
Google makes it’s historical first open source contribution to the Apache Foundation in the form of Apache Beam
When combined with Apache Spark’s severe tech resourcing issues caused by mandatory Scala dependencies, it seems that Apache Beam has all the bases covered to become the de facto streaming analytic API. The cool thing is that by using Apache Beam you can switch run time engines between Google Cloud, Apache Spark, and Apache Flink. A generic streaming API like Beam also opens up the market for others to provide better and faster run times as drop-in replacements. Google is the perfect stakeholder because they are playing the cloud angle and don’t seem to be interested in supporting on-site deployments. Hats off Google, and may the best Apache Beam run time win!
Apache Beam from Google finally provides robust unification of batch and real-time Big Data. This framework replaced MapReduce, FlumeJava, and Millwheel at Google. Major big data vendors already contributed Apache Beam execution engines for both Flink and Spark, before Beam even officially hit incubation. Anyone else seeing the future of Big Data in a new light? I know I am…
Supporting unions for fields with multiple types makes for more robust and automated feature extraction. For example, “account numbers” may contain business relevant strings or spaces due to different data stewards or external data providers.
Rather than transforming all numbers to String, statzall takes the opposite approach and packs Strings into doubles using the open source Unibit Encoding. This allows extremely efficient feature extraction of basic data science primitives using existing hardened APIs such as CERN Colt. With single-threaded performance of 8 mm / sec automated feature extraction on *all* measures becomes possible.
In addition, statzall provides support for one-click deployment to your Hadoop YARN cluster using CDAP. Or if you use cloud, you can literally set up a fully automated Internet-scale feature extraction cluster in 5 minutes using the Coopr Cloud.
10 node load test comparison using Amazon EC2 SSD-based instances. 1 billion vertices and 1 billion edges processed for each test run. Used the titan-loadtest project to run each test.
Experiment maximizes data locality by co-locating load generation, Titan graph database, and Cassandra/Hazelcast within the same JVM instance while partitioning data across a cluster. Exploration of methods for tuning garbage collection, Titan, and Cassandra for the peer computing use case.
The following components were utilized during the experiment:
Each test iteration has 6 read ratio phases starting with 0% reads (100% writes) all the way up to 90% reads and 10% writes. For all tests, the persistence implementation executes in the same JVM as Titan to avoid unnecessary context switching and serialization overhead. Tests were conducted using an Amazon placement group to ensure instances resided on the same subnet. The storage was formatted with 4K blocks and used the noop scheduler to improve latency.
For each phase, new vertices were added with one edge linking back to a previous vertex. No tests of update or delete were conducted.
Please see the titan-loadtest project above for all Cassandra and Hazelcast settings and configurations used in the test.
Please note: the results are listed in rates of thousands of vertices per second and include the creation of an edge as well as a vertex. Also, the Hazelcast SSD x1 results used a custom flash storage module for Hazelcast developed privately so those results are not replicable without that module installed.
Hazelcast performed better than Cassandra for all tests and demonstrated one order of magnitude better performance on reads. Surprisingly, Hazelcast slightly outperformed Cassandra for writes as well.
Demonstration of efficient garbage collection on JVM heap sizes in excess of 128 gigabytes. Garbage collection behavior analyzed via a mini-max optimization strategy. Study measures maximum throughput versus minimum garbage collection pause to find optimization “sweet spot” across experimental phases. Replicable results via well-defined open source experimental methods executable on Amazon EC2 hardware.
Elastic Map Reduce makes it so easy to spin up a cluster that sometimes it’s also easy to waste money with unused, partially used, or downright unauthorized clusters. Obviously, as a business, Amazon doesn’t put a whole lot of effort to keep it’s customers from spending too much money. Amazon has an instance count limit for the entire account, however effectively managing these costs involves getting a lot more granular and providing some more detailed information.
That’s why I created this program which estimates charges for current and historic EMR clusters. It first obtains the normalized instance hours for all clusters running under the current credentials, then divides by the Normalized Compute Time provided in the Amazon EMR FAQ. Then we multiply by the EMR Hourly Rate to get the charge for each current and historic job flow (cluster). Historic job flows come from the Amazon job flow history which takes only the past 20 days or so.
The job flow id is the primary key for this data set. Output is tab delimited streamed to stdout. The last column contains a complete dump of the job flow in JSON format. Here is some example output: