TL;DR;

This is not a 15-minute tutorial on ELK, you'll see a common problem when you have to grep more than one log on more than one server and how we overcome that problem. Please refer to this excellent article of Complete ELK on AWS in 15 minutes


Finding things in distributed log files is a daunting task, not to mention that is not nice to have 10-15 ssh consoles opened running grep all over the place.

That was the scenario that we had at Bemobi a few months ago and we're now in the proccess of erradicating that, but first let's set some points out:

  • This is not a tutorial on how to configure the whole ELK stack
  • This is not a best practice to follow and YMMV

To end with this madness just to find a few log lines that wanted to, Diogo Pessoa and I began to study ELK stack and a few other alternatives (Kafka anyone?).

We have a few platfoms that do event logging very well, those generally are transaction ones and that help us having one huge log line for the whole transction instead of several lines for one unique event, for more about logging I advise you to read Jay Kreps' The Log: What every software engineer should know about real-time data's unifying abstraction. With that in mind, we've opted to start our experiment with this type of log, which is easier to parse and easier to reason about. One example of this log line is bellow:

2014-10-01 17:18:31,717 ERROR [CLASS-OF-APP] (HTTP-WORKER-POOL-NODE-ID) app[APPID] ...several other K[V]... responseCode[MAPPED_RESPONSE_CODE] responseTime[TIME_IN_MILIS] Reason: REASON_OF_ERROR  

The line shows some of the info stored on our log and how the data is structured -in a series of pair and values on the format KEY[VALUE]- making easy to reason about and to query it later. This log is done locally on dozen machines for a particular service, so whenever we need to find an error, a grep were issued to the servers.

After choosing the ELK (Elastic Search, LogStash and Kibana), we went to a basic setup with local logging forward to it. Disaster :(. The cpu usage of our platform went up to the roof.

Next we went to syslog-ng to ship the logging to the rabbit-mq, so asynchronous messaging worked smoothly and helped us to relief the system usage. With that in mind we have now this basic setup:

Quick edit: As @suyograo mentioned, you can use Kafka, or flume or anything instead of RabbitMQ to ship things to logstash.

ELK test architeture

ELK test setup

This basic setup provided an amazing result and we're now able to visualize our service and build a few graphs like this:

Events per second for a service:
Evts per second

Errors for a specific client
Errors per client

Calls / 5min interval on a specific client
5min evt per client

So instead of sending a grep command for dozen of servers, we just issue a query to Kibana and use faceting and aggregation to get a chart or data for a specific scenario.

This is a huge win for both us on the dev/ops side and the business where now they can answer themselves about common questions whereas, otherwise, we would need to deploy or enable a team to do the grepassociated to the search.

The solution is working well for us in this first tryout, meanwhile Kafka went hype and is now the defacto standard for log processing and probably will be tested on our environment anytime.

ELK stack is a great and works out of the box, if you have any scenario where you find youself doing the same as I did for a long time, please consider using it, as it gonna save you a lot of time.

Related Links: