The YARN Application Master provides all the raw data to accurately estimate the resources required for your big data application to meet it’s SLAs when deployed to production. By identifying crucial counters and deriving resource ratios by task and for the application as a whole we can even infer run times from a smaller test environment to a larger production footprint.
All YARN frameworks provide similar counters, however we will be using the popular Hadoop MapReduce framework as an example. We can also get the same values displayed on the web interface above directly from the MapReduce API. The following counters drive the capacity plan:
|TOTAL_LAUNCHED_MAPS||Map tasks launched||Used as a divisor to obtain avg map metrics|
|TOTAL_LAUNCHED_REDUCES||Reduce tasks launched||Used as a divisor to obtain avg reduce metrics|
|MILLIS_MAPS||Total time spent by all maps (ms)||Used as a numerator to obtain avg map task time|
|MILLIS_REDUCES||Total time spent by all reduces (ms)||Used as a numerator to obtain avg reduce task time|
|The following counters calculate twice, once for all Mappers and once for all Reducers, it’s important not to mix ratios across task types.|
|CPU_MILLISECONDS||CPU time used||Used as a numerator to obtain avg task CPU|
|COMMITTED_HEAP_BYTES||RAM used||Used as a numerator to obtain avg task RAM|
|FILE_READ_OPS||Read Operations||Used as a numerator to obtain avg task read ops|
|FILE_WRITE_OPS||Write Operations||Used as a numerator to obtain avg task write ops|
|FILE_BYTES_READ||Read Bytes||Used as a numerator to obtain avg task read bytes|
|FILE_BYTES_WRITTEN||Write Bytes||Used as a numerator to obtain avg task write bytes|
The primary assumption when inferring between environments is that the data being operated on remains the same. If the input data differs between environments then results may skew, especially for reducers.
Calculating Resource to Task Type Ratios
By calculating ratios, we can then scale the run time and other resources up and down depending on available task slots and quotas in the target environment.
|Time spent per map (ms)||MILLIS_MAPS / TOTAL_LAUNCHED_MAPS|
|Time spent per reduce (ms)||MILLIS_REDUCES / TOTAL_LAUNCHED_REDUCES|
|CPU used per map (ms)||CPU_MILLISECONDS (for maps) / TOTAL_LAUNCHED_MAPS|
|CPU used per reduce (ms)||CPU_MILLISECONDS / TOTAL_LAUNCHED_REDUCES|
|RAM used per map||COMMITTED_HEAP_BYTES (for maps) / TOTAL_LAUNCHED_MAPS|
|RAM used per reduce||COMMITTED_HEAP_BYTES (for reduces) / TOTAL_LAUNCHED_REDUCES|
|Read Operations per map||FILE_READ_OPS (for maps) / TOTAL_LAUNCHED_MAPS|
|Read Operations per reduce||FILE_READ_OPS (for reduces) / TOTAL_LAUNCHED_REDUCES|
|Write Operations per map||FILE_WRITE_OPS (for maps) / TOTAL_LAUNCHED_MAPS|
|Write Operations per reduce||FILE_WRITE_OPS (for reduces) / TOTAL_LAUNCHED_REDUCES|
|Read Bytes per map||FILE_BYTES_READ (for maps) / TOTAL_LAUNCHED_MAPS|
|Read Bytes per reduce||FILE_BYTES_READ (for reduces) / TOTAL_LAUNCHED_REDUCES|
|Write Bytes per map||FILE_BYTES_WRITTEN (for maps) / TOTAL_LAUNCHED_MAPS|
|Write Bytes per reduce||FILE_BYTES_WRITTEN (for reduces) / TOTAL_LAUNCHED_REDUCES|
We can now scale parallel task quotas and other resource quotas up and down to calculate the impact on the job for a particular environment. For example, wall clock time for the map phase can vary from all tasks running in parallel ( t = MILLIS_MAPS / TOTAL_LAUNCHED_MAPS ) all the way down to a single task running in parallel ( t = MILLIS_MAPS ). Similarly for all other variables. For resource constraints, dividing by the most severe restriction governs the cost to total run time. For example, if we enforce a quota restricting CPU time to CPU_MILLISECONDS * .5 then MILLIS_MAPS will be increased to MILLIS_MAPS / .5. This would occur if for example the max mappers per node were increased to twice the number of cores. Resource to Task Type Ratios come in handy for impact assessment and prediction based on any conceivable environmental constraint.