Designing and developing applications is hard. Maintaining applications and ensuring they’re secure, available and work as intended? Even harder. Without sounding pessimistic, we know that eventually things may not go as planned - an application could crash, a server could go down, a bug may slip through our tests and negatively affect functionality, or worse, a data breach occurs. Auditing and monitoring are critical to prepare for the above events. Establishing an audit trail and maintaining an audit log are important for a variety of development, operational and compliance oriented reasons. They help DevOps teams troubleshoot availability and performance-related concerns, developers debug functional issues as they arise, security teams quickly detect and contain breaches, and compliance teams adhere to a minimum, necessary standard based on market and regulatory demands.
Aptible Deploy comes with built-in support for easily aggregating your container, SSH session and HTTP(S) endpoint logs and routing them to your destinations of choice for record-keeping and future analysis, be it in popular external destinations like Datadog, SumoLogic and PaperTrail, or to a self-hosted ElasticSearch database.
Since 2014, Aptible log drains have been used by customers to send hundreds of millions of log lines to various destinations. While the majority of our customers were able to aggregate their logs without hiccup, we also heard a few of them experience issues when the volume of logs being generated were extremely high. These issues ranged from inconvenient delays in receiving logs in their destinations to packet losses during periods of high throughput.
Such issues weren’t just frustrating for our customers when trying to debug their production issues, but also for our reliability engineers troubleshooting them across a wide variety of use cases and destinations.
So we decided to fix this.
Engineering a solution for the above problems involved accounting for the varying logging volume across different customer use cases, optimizing around throughput limitations of third party destinations, and ensuring we could provide maintainable, scalable support on an on-going basis. We had to take a close look at the engineering choices we made for the past versions of Aptible log drains, and decide if we want to add on to it, or re-architect it from the ground up. The log processing and delivery engine we used in previous versions was Logstash, an open-source ingest pipeline from Elastic. While we found Logstash’s parsing and output configuration options acceptable for our needs, over time, we realized that not only was performance lacking, but that Logstash was opaque and limited in monitoring metrics for the customer use-cases we cared about. One such example was the Aptible engineering team being limited to using CPU usage as a leading metric for monitoring the reliability of log drains. The problem with this is that high CPU isn't always a symptom of a drain not working, leading to cases of false-alarms, and having insufficient data to dig in further and remediate when there really was a drain issue. As our customers grew with Aptible, this logging engine became harder to maintain, monitor and scale efficiently. Based on this, we made the decision to use an alternative provider and re-design our logging engine. During our exploration of alternatives to Logstash, we quickly found that FluentD, another open source data collector, gave us everything our previous logging collector provided and more, namely better performance and more robust metrics. Using FluentD, we’ve developed and released the next version of log drains.
“I've been watching our logs since the new version of log drains was released to our production environment, and the improvements have been substantial. I'm seeing no drift in log timestamps between when they were generated and when they were received by PaperTrail, which used to be off by about 30 minutes before this release. Everything I've seen has looked sequential as well. I can make an HTTP request to an app and see it come through PaperTrail within seconds. If log performance stays this way, it's a huge win! "- James Dempsey, Software Developer at Aidin The log drains of all Aptible accounts have been updated to the latest version, requiring no additional setup from customers. Customers can expect the following from the latest version.
With this update, users can see a noticeable improvement in the reliability and speed of their log drains. Customers may experience minimal to no lag when generating and sending their logs, even at very high volumes due to the work we put in to increase throughput in the new version of our drains.
“It used to take about 15 minutes for logs to show up in the Datadog UI, but now it takes just about a minute on average. Big improvement!” - Darryn Campbell, Principal Software Engineer at CarePort
Using a combination of FluentD data, and visualizing and graphing this data into metrics of importance in Grafana, we’ve been able to set up alerts to monitor for issues based on the the the number of logs waiting to be sent , the number of times customer drains retry sending logs, failed output writes to different destinations, and others. We believe these metrics allow our reliability engineers to quickly identify root-causes, be it on Aptible’s side or the customer's side as issues arise, and remediate them more efficiently. Over time, we’ll evolve these metrics as we learn how our newest version of log drains performs in a wider variety of real world scenarios. Depending on how well these metrics perform, we may also choose to expose them to customers to enable more proactive, self-service remediation of log drain issues.
We’re excited to release the latest iteration of log drains to Aptible. Do let us know how this is doing and if you have any questions or concerns, we’re always happy to help. If you want to understand what we’re working on currently, vote on features you care about or suggest new ideas for Aptible to work on, you can do so in our roadmap portal here.