Tag Archives: service level agreements

Capacity Planning with YARN

The YARN Application Master provides all the raw data to accurately estimate the resources required for your big data application to meet it’s SLAs when deployed to production.  By identifying crucial counters and deriving resource ratios by task and for the application as a whole we can even infer run times from a smaller test environment to a larger production footprint.

Example of Hadoop MapReduce application counters

All YARN frameworks provide similar counters, however we will be using the popular Hadoop MapReduce framework as an example.  We can also get the same values displayed on the web interface above directly from the MapReduce API.  The following counters drive the capacity plan:

Counter Description Usage
TOTAL_LAUNCHED_MAPS Map tasks launched Used as a divisor to obtain avg map metrics
TOTAL_LAUNCHED_REDUCES Reduce tasks launched Used as a divisor to obtain avg reduce metrics
MILLIS_MAPS Total time spent by all maps (ms) Used as a numerator to obtain avg map task time
MILLIS_REDUCES Total time spent by all reduces (ms) Used as a numerator to obtain avg reduce task time
The following counters calculate twice, once for all Mappers and once for all Reducers, it’s important not to mix ratios across task types.
CPU_MILLISECONDS CPU time used Used as a numerator to obtain avg task CPU
COMMITTED_HEAP_BYTES RAM used Used as a numerator to obtain avg task RAM
FILE_READ_OPS Read Operations Used as a numerator to obtain avg task read ops
FILE_WRITE_OPS Write Operations Used as a numerator to obtain avg task write ops
FILE_BYTES_READ Read Bytes Used as a numerator to obtain avg task read bytes
FILE_BYTES_WRITTEN Write Bytes Used as a numerator to obtain avg task write bytes

The primary assumption when inferring between environments is that the data being operated on remains the same.  If the input data differs between environments then results may skew, especially for reducers.

Calculating Resource to Task Type Ratios

By calculating ratios, we can then scale the run time and other resources up and down depending on available task slots and quotas in the target environment.

Ratio Method
Time spent per map (ms) MILLIS_MAPS / TOTAL_LAUNCHED_MAPS
CPU used per map (ms) CPU_MILLISECONDS (for maps) / TOTAL_LAUNCHED_MAPS
Read Operations per map FILE_READ_OPS (for maps) / TOTAL_LAUNCHED_MAPS
Read Operations per reduce FILE_READ_OPS (for reduces) / TOTAL_LAUNCHED_REDUCES
Write Operations per map FILE_WRITE_OPS (for maps) / TOTAL_LAUNCHED_MAPS
Write Operations per reduce FILE_WRITE_OPS (for reduces) / TOTAL_LAUNCHED_REDUCES
Read Bytes per map FILE_BYTES_READ (for maps) / TOTAL_LAUNCHED_MAPS
Read Bytes per reduce FILE_BYTES_READ (for reduces) / TOTAL_LAUNCHED_REDUCES
Write Bytes per map FILE_BYTES_WRITTEN (for maps) / TOTAL_LAUNCHED_MAPS
Write Bytes per reduce FILE_BYTES_WRITTEN (for reduces) / TOTAL_LAUNCHED_REDUCES

Capacity Scaling

We can now scale parallel task quotas and other resource quotas up and down to calculate the impact on the job for a particular environment. For example, wall clock time for the map phase can vary from all tasks running in parallel ( t = MILLIS_MAPS / TOTAL_LAUNCHED_MAPS ) all the way down to a single task running in parallel ( t = MILLIS_MAPS ). Similarly for all other variables.  For resource constraints, dividing by the most severe restriction governs the cost to total run time.  For example, if we enforce a quota restricting CPU time to CPU_MILLISECONDS * .5 then MILLIS_MAPS will be increased to MILLIS_MAPS / .5.  This would occur if for example the max mappers per node were increased to twice the number of cores.  Resource to Task Type Ratios come in handy for impact assessment and prediction based on any conceivable environmental constraint.

Building the Organic Cloud with Amazon APIs and Java

Well, now that we have a high availability and low latency design for our RESTful services, where can we find the weakest link in maintaining our service level agreements?  The answer often lies in the crufty, hidden, and often poorly tested “glue code” lying between (and below) the systems themselves.  This code rears it’s head when a critical notification fails to fire, or a cluster fails to scale automatically as designed. How do we get this code cleaned up, out in the open, and rock solid?  The answer, of course, is to apply agile principles to our infrastructure.

The field of system programming desperately needs an update into the Agile Age!  And why not?  Popular cloud management APIs provide complete command and control via Agile-enabled languages such as Java…  And many efforts are under way to make cloud APIs standardized and open source.  So let’s delve into some details about how we can transform shell script spaghetti into something more workable!

Automated Build and Deployment

System programs, like application programs, should follow the same automated build and deployment process as the rest of the application portfolio.

Automated Unit Test and Mocking

I long to see the day that system programs have automated unit test coverage metrics just like the rest of the civilized world. Let’s test that fail over logic in a nice safe place where it can’t ruin lives!  Think you can’t mock all failure scenarios?  Think again.  With 100% Java APIs available from many cloud providers (not to mention other test-friendly languages) we could impersonate any type of response from the cloud services. Friends don’t let friends code without testing.

Change Management Tied to Code Commits

Let’s all admit it: at one point or another we have all been tempted to cheat and “tweak” things in production.  Often it doesn’t seem possible that a particular change could cause a problem.  However, in some system programming environments “tweaking” has become a way of life because people assume there is no way to replicate the production environment.  In a cloud environment this is simply not the case.  All cloud instances follow a standardized set of options and configurations commanded by a testable language.  If we create a Java program or Python script to build and maintain the production environment, how trivial is it to re-create the same environment?  In this case tying all system program code commits directly to a Jira issue becomes possible, and complete end-to-end management of environment change also becomes possible.  All changes are code changes.  Wow, what a concept!

Testable Components

By building testable components we create layers of abstraction which hide and manage complexity.  We can now confidently build intelligent instances which dynamically configure themselves, hand in hand with the software which runs on them.  Testability makes this not only feasible, but in fact imperative as a competitive advantage.

Organic Cloud Systems

So what happens when we standardize an Agile process for system programs in the same way we do for application programs?  Repeatable results…  And we also largely take the burden of busy work off of infrastructure folks so they can focus on engineering the next generation of great systems rather than doing so much fire fighting at 3 am.  Not to say this solves all of those problems, but most repeatable IT problems have automatable solutions.  Most likely we can automate anything which we can put in a run sheet using cloud APIs.  We then have organic cloud systems which automatically configure themselves and adapt to environmental changes and failure scenarios without sacrificing reliability.