Since April I have been hacking Erlang at Campanja, a Stockholm based startup offering a service to optimize search engine marketing. Simply put, Campanja helps people that advertise using google adwords to buy the kind of clicks that provide value to their business and at a price that makes sense for them.
Our system runs entirely distributed on top of Amazons Elastic Cloud, keeping close taps on our many Erlang nodes is essential to understanding what is and has been going on.
We have been using Graphite for quite a while to keep graphs of various metrics. Recently we picked up the excellent Riemann monitoring system. Since we want to instrument a lot of metrics across our system, it is important that doing so is straightforward and mostly automatic.
The Riak project has adopted Folsom for collecting metrics in its 1.2 version. They also contributed code for a new type of histogram sampling (slide_uniform) which keeps a constant number of data points for a given metric and a specific sliding window in time. This means one can for example send timing information for each database write to such a histogram metric, even if they happen at a high rate. Only a representative subset will be stored in a constant amount of memory. Estatsd, which we previously used, has no such sampling. Therefore in a similar situation it consumes increasing amounts of memory. If you send data quickly enough as we did in some tests, until none is left and the virtual machine crashes.
On top of the sampling Folsom provides excellent functions to compute statistical summaries of the collected data.
We decided that we wanted to use all of the above together, so I implemented a new application called Folsomite, which we also released on github. Folsomite periodically aggregates all Folsom metrics present as well as a couple of VM statistics and forwards them to both Graphite and Riemann. If we want instrumentation for a node, all we have to do is to add Folsomite and data for all of the metrics we capture via Folsom will automatically be available for graphing and monitoring.
If you also care about details on what is going on in your Erlang nodes - and really, you should - give it a try yourself!