CouchDB by Example
Looking at the Apache pages utilizes some useful examples of how to use CouchDB. However, I found this cool Mu Dynamics simulator which shows how to query CouchDB by example. I found that the simulator quickly clarifies how the Map/Reduce query processing works with concrete problems. A much faster way to kick the tires than building and installing from scratch…
Scalability and Performance
The Apache wiki provides some pointers on CouchDB performance. It seems that striping (RAID 0) provides the best performance improvements in the benchmarks I’ve seen, esp in environments with many users and a lot of random reads (see below). CouchDB doesn’t support parallel query out of the box but opening multiple sockets to the same server would allow parallel access to data for a single client.
The map/reduce query capability produces materialized views automatically: featuring automated incremental refresh. Therefore if the data doesn’t change in a view then each query does not re-query the base data. By default the first query after a committed change to the base data triggers the incremental refresh of the view (this is configurable).
In terms of scalability, multi-master replication provides multiple data copies across concurrent machines and/or data centers. Via the stateless RESTful interface load balancing doesn’t get any easier… Also, sharding is not supported out of the box but solutions like Lounge and Pillow look promising for supporting larger data sets.
Check out this blog post on how to spin up a CouchDB cluster on Amazon EC2 using Puppet. Seems an ideal candidate for Elastic Load Balancing: the ability to automatically spin up instances as demand increases should really make CouchDB live up to it’s name!
Potential Use Cases With Hadoop
Use of CouchDB to serve key/value reference data in an Inverse REST architecture seems an ideal use case. Since we could maintain the CouchDB data using a separate GUI we could correct reference data or meta data dynamically as the job is running. No need to wait around until a large job has crashed and burned to fix reference data problems. Probably also we would want to have an LRU cache on the client side (in the Hadoop job) to minimize network hits.
Although it may seem counter-intuitive, leveraging the multi-layered REST caching and load balancing architecture to move small data close to big data could make a lot of sense, especially in a cloud environment designed to adapt dynamically to large spikes.