Building a Log-Management & Analytics Solution for Your StartUp
Building a Log-Management & Analytics Solution for Your StartUp
Background:
As described in an earlier post, I run the Engineering at an early stage #traveltech #startup called Itilite. So, one of my responsibility is to architect, build and manage the cloud infrastructure for the company. Even though I have had designed/built and maintained the cloud infrastructure in my previous roles, this one was really challenging and interesting. Due in part to the fact, that the organisation is a high growth #traveltech startup and hence,
- The architecture landscape is still evolving,
- Performance criteria for the previous month look like the minimum acceptable criteria the next
- The sheer volume of user-growth, growth of traffic-per-user
- Addition of partner inventories which increases the capacity by an order of magnitude
And several others. Somewhere down the lane, after the infrastructure, code-pipeline and CI is set-up, you reach a point where managing (read: trigger intervention, analysis, storage, archival, retention) logs across several set of infrastructure clusters like development/testing, staging and production becomes a bit of an overkill.
Enter Log Management & Analytics
Having worked up from a simple tail/multitail to Graylog-aggregation of 18 server logs, including App-servers, Database servers, API-endpoints and everything in between. But, as my honoured colleague (former) Mr.Naveen Venkat (CPO of Zarget) used to mention in my days with Zarget, There are no “Go-To” persons in a start-up. You “Go-Figure” yourself!
There is definitely no “One size fits all” solution and especially, in a Start-up environment, you are always running behind Features, Timelines or Customers (scope, timeline, or cost in conventional PMI model).
So, After some due research to account for the recent advances in Logstash and Beats. I narrowed down on the possible contenders that can power our little log management system. They are,
- ELK Stack — Build it from scratch, but have flexibility.
- Graylog — Out of the box functionality, but you may have to tune up individual components to suit your needs.
- Fluentd — Entirely new log-management paradigm, interesting and we explored it a bit.
(I did not consider anything exotic or involves us paying (in future) anything more than what we pay for it in first year. So, some great tools like splunk, nagios, logpacker, logrythm were not considered)
Evaluation Process:
I wrote an Ansible script to create a replica environment and pull in the necessary configurations. And used previously written load-test job to simulate a typical work hour. This configuration was used for each of the frameworks/tools considered.
I started experimenting with Graylog, due to familiarity with the tool. Configured it the best way, I felt appropriate at that point in time.
Slight setback:
However, the collector I had used (Sidecar with Filebeat) had a major problem in sending files over 255KB and the interval was less than 5 secs. And the packets that are to be sent to the Elasticsearch never made it. And the pile-up caused a major issue for application stability.
One of the main use-case for us is to ingest XML/JSON data from multiple sources. (We run a polynomial regression across multiple sources, and use the nth derivatives to do further business operations). Our architecture had accounted for several things, but by design, we used to hit momentary peaks in CPU utilisation for the “Merges”. And all of these were “NICE” loads.
When the daily logs you need to export is in upwards of 5GB for an app (JSON logs), add multiple APIs and some micro-services application logs, web-server, load-balancers, CI (Jenkins), database-query-log, bin-log, redis and … yes, you get the point?
(())Upon further investigation, The sidecar collector was actually not the culprit. Our architecture had accounted for several things, but by design, we used to hit momentary peaks in CPU utilisation for the “Merges”. And all of these were “NICE” loads! (in our defence)
So, once the CPU hit 100% mark, sidecar started behaving very differently. But, ultimately fixed it with a patched version of sidecar and actually shifting to NXLog.
Experiment with the ELK is a different beast in itself, as provisioning and configuring took a lot more time than I was comfortable with. So, switched to AWS “Packaged Service” . We deployed the ES domain in AWS, fired up a couple of Kibana and Logstash instances and connected them (after what appeard to be forever), it was a charm. Was able to get all information required in Kibana. One down-side is that you need to plan the Elastic Search indices according to how your log sources will grow. For us, it was impractical.
Fluentd was an excellent platform for normalising your logs, but then it also depended on Kibana/ES for the ultimate analysis frontend.
So, finally we settled down to good old Graylog.
Advantages of Graylog
The tool perfectly fit into our workflow and evolving environment:
- Graylog is a free & open-source software. — So we wont have pay now or in future.
- Its trigger actions and notifications are a good compliment to Graylog monitoring, just a bit deeper!
- With error stack traces received from Graylog, engineers understand the context of any issue in the source code. This saves time and efforts for debugging/troubleshooting and bug fixing.
- The tool has a powerful search syntax, so it is easy to find exactly what you are looking for, even if you have terabytes of log data. The search queries could be saved. For really complex scenarios, you could write an ElasticSearch query and save it in the dashboard as a function.
- Graylog offers an archiving functionality, so everything older than 30 days could be stored on slow storage and re-imported into Graylog when such a need appears (for example, when the dev team need to investigate a certain event from the past).
- Java, Python & Ruby applications could be easily connected with Graylog as there is an out-of-box library for this.
#logmanagement #analytics #startup #hustle #opensource #graylog #elk