Supporting unions for fields with multiple types makes for more robust and automated feature extraction. For example, “account numbers” may contain business relevant strings or spaces due to different data stewards or external data providers.
Rather than transforming all numbers to String, statzall takes the opposite approach and packs Strings into doubles using the open source Unibit Encoding. This allows extremely efficient feature extraction of basic data science primitives using existing hardened APIs such as CERN Colt. With single-threaded performance of 8 mm / sec automated feature extraction on *all* measures becomes possible.
In addition, statzall provides support for one-click deployment to your Hadoop YARN cluster using CDAP. Or if you use cloud, you can literally set up a fully automated Internet-scale feature extraction cluster in 5 minutes using the Coopr Cloud.