Building a Log-Management & Analytics Solution for Your StartUp
Background:
As described in an earlier post, I am working with an early stage startup. So, one of my responsibility is to architect, build and manage the cloud infrastructure for the company. Even though I have had designed/built and maintained the cloud infrastructure in my previous roles, this one was really challenging and interesting. Due in part to the fact, that the organisation is a high growth #traveltech startup and hence,
- The architecture landscape is still evolving,
- Performance criteria for the previous month look like the minimum acceptable criteria the next (in terms of itineraries automated, rating, mappings, etc.)
- The sheer volume of user-growth
- Addition of partner inventories which increases the capacity by an order of magnitude
And several others. Somewhere down the lane, after the infrastructure, code-pipeline and CI is set-up, you reach a point where managing (read: trigger intervention, analysis, storage, archival, retention) logs across several set of infrastructure clusters like development/testing, staging and production becomes a bit of an overkill.
Enter Log Management & Analytics
Having worked up from a simple tail/multitail to Graylog-aggregation of 18 server logs, including App-servers, Database servers, API-endpoints and everything in between. But, as my honoured colleague (former) Mr.Naveen Venkat (CPO of Zarget) used to mention in my days with Zarget, There are no “Go-To” persons in a start-up. You “Go-Figure” yourself!
There is definitely no “One size fits all” solution and especially, in a Start-up environment, you are always running behind Features, Timelines or Customers (scope, timeline, or cost in conventional PMI model).
So, After some due research to account for the recent advances in Logstash and Beats. I narrowed down on the possible contenders that can power our little log management system. They are,
- ELK Stack
- Graylog
- Logstash
(I did not consider anything exotic or involves us paying (in future) anything more than what we pay for it in first year. So, some great tools like splunk, nagios, logpacker, logrythm were not considered)
Evaluation Process:
I started experimenting with Graylog, due to familiarity with the tool. Configured it the best way, I felt it appropriate at that point in time. However, the collector I had used (Sidecar ) had a major problem in sending files over 255KB and the interval was less than 5 secs.
One of the main use-case for us is to ingest the actual JSON data from multiple sources. (We run a polynomial regression across multiple sources, and use the n th derivatives to do further business operations). When the daily logs you need to export is in upwards of 500MB for an app (JSON logs), add other application log(s), web-server, load-balancers, CI (Jenkins), database, redis and … yes, you get the point?
(())Upon further investigation, The sidecar collector was actually not the culprit. Our architecture had accounted for several things, but by design, we used to hit momentary peaks in CPU utilisation for the “Merges”.
So, once the CPU hit 100% mark, sidecar started behaving very differently. But, ultimately fixed it with a patched version of sidecar.