The Aptible Update Webinar Series is a quarterly presentation that covers recent features and changes to the Enclave deployment platform and Gridiron security management products. These webinars feature technical sessions led by Aptible engineers, live demonstrations, customer examples, and Q&A with the Aptible team.
We hosted our Q1 Update Webinar on January 25, 2018. In it, we covered:
- Aptible Response to Meltdown and Spectre. An overview of the Meltdown and Spectre vulnerabilities, Enclave’s security architecture (emphasizing trust levels and isolation) and the team’s response to patching the exploits.
- Metric Drains. Metric Drains let you route metrics to the destination of your chioce. This feature empowers you to do much more with your metrics, including alerts and troubleshooting performance issues.
- Additional Feature Updates. We discussed Managed HIDs, VPC Peers and VPN Tunnels, and more.
We provided a recap of this webinar on our blog.
Chas: Okay, are we about ready? Okay. Thanks everybody for joining. Welcome, I’m Chas Ballew, I’m CEO of Aptible, and thank you for attending today. As many of you know, this webinar is a part of our regular series, the Aptible Update Webinar series. We do this quarterly. It’s a presentation that covers recent features, and changes to the Enclave deployment platform, and Gridiron Security Management [00:00:30] products. Our goal at Aptible, the entire reason we get up in the morning, is to make the best tools for developers to build security into their architecture and organizations. Today’s going to be a bit of a special webinar. We’re going to start with a segment dedicated to Meltdown and Spectre, and our new web security advocate Elissa will moderating a discussion with our engineers about those two vulnerabilities and what we’ve done to protect you, our customers, from them. Then after that we’ll hear from Thomas Orozco, our [00:01:00] Enclave lead, who will tell us about the new Metric Drains feature and a number of other releases for Enclave. So, a bit of logistics to start, use the Q&A tool and zoom here to ask questions during the panel or during the rest of the Webinar. Elissa and Frank Macreery, our CTO, will moderate those and ensure that everyone’s questions get answered, either during the talks or at Q&A breaks or offline afterwards. We’re also recording the Webinar, as we’ve recorded previous ones. I will [00:01:30] make it available, post it on YouTube, and share a link to the recording, the slides we have, and the transcript with everyone. So, with that said, I’m going to introduce Elissa Shevinsky, our new web security advocate for this panel. Elissa: Thanks Chas. We know that Meltdown and Spectre is the topic that’s important to a lot of our customers, so we want to take some time to talk about how Aptible handled these vulnerabilities, [00:02:00] and how it impacts you. We’re speaking today with Frank and Thomas. Frank is CTO at Aptible, and Thomas is lead engineer for Enclave. Frank, Thomas, could you give us the TLDR on these vulnerabilities? Yeah, sure thing, thanks Elissa. So I’m sure by now most of you heard about Meltdown and Spectre. Part of the reason for the hype around these vulnerabilities is tied to the kind of part of the computing infrastructure that these affect, [00:02:30] and the resulting very broad impact that they have. So, in particular, Meltdown and Spectre are both vulnerabilities that exploit a flaw that exists in basically in every modern CPU architecture, and so, pretty much anything running on a modern CPU from, you know, cloud computing infrastructure, to Saas software, to your local desktop and laptop workstations is affected in some way by Meltdown and Spectre. I think one thing that’s worth adding and noting about this is that, [00:03:00] from a practical standpoint- Thomas, I think you’re muted real quick.
Thomas: I apologize. All right, so one thing that might be worth noting as well, is the impact of these availabilities. And often in of themselves, they’re not the worst vulnerabilities that have existed in say, like [Nanooks 00:03:18], or even other operating systems and so on. What makes them unique is beyond their scope, and the fact that they affect practically everything, is that they are actually … both of them are very difficult [00:03:30] to detect if they are being exploited. If someone is like exploiting something like Meltdown against you, is very difficult. You know about that? Besides this, Meltdown in particular is also very easy to exploit as well. Thomas, what does it mean to exploit Meltdown? Well, that’s a good question. So, Meltdown, what it lets you do essentially is if you have … on a patent like Enclave, if you have a customer container running, it’s going to be running entrusted code, at least from our perspective, and the perspective [00:04:00] of the platform, is running arbitrary code really. What Meltdown lets you do is kind of, sort of, break out of the isolation by reading memory that you normally should not have access to. So if you imagine you have your app running, and then there’s the internal memory, which is where essentially the rest of the operating system is running. Meltdown is gonna let an app potentially read into that internal memory. Normally that would raise an error, but with Meltdown, you manage to kind of bypass that limitation and can re-access that memory, which may include things like, you know, if you have other processes of the same instance, [00:04:30] they’ll have their own environment variable that’s an obvious target. Potentially there is disk caches as well. There’s lots of thing in internal memory that are sensitive, and you get access into Meltdown. You know, in particular, like this idea that Meltdown exploits privilege escalation paths, makes it particularly relevant for cloud computing infrastructure, because whenever you’re running in the cloud, there’s just by definition, many layers of separation between you and bare metal hardware. So, for example, as [00:05:00] a customer of Enclave, on top of the bare metal you’ve got AWS running EC2 instances, on top of that you’ve got Aptible, another party running Enclave, an orchestration platform where ultimately, your own applications and databases run in Docker containers on top of that. So, you know, that’s part of the big threat and risk around vulnerabilities like Meltdown, is there’s just so many different ways and barriers across which you can exploit this vulnerability. [00:05:30] So, can we defend against vulnerabilities like Meltdown and Spectre? And what are we doing here at Aptible to proactively protect our Enclave customers. Yeah, sure thing. So, I mean, roughly speaking there’s two, kind of, general pathways for exploiting a vulnerability like Meltdown. So, the first is that you can gain access to data run by your peers. So, other customers whose app and database containers are hosted on the same instance as you. [00:06:00] The way that we protect against this architecturally on Enclave is we require that all sensitive workload. So whether you’re hosting PHI or any other sensitive and regulated data, those need to run on dedicated stacks. Elissa: It’s isolated. Frank: Yeah. So, it’s isolated, it’s dedicated. It means that, you know, the EC2 instances that dedicated stacks run on, the networks that they run on, are all belonging to just one customer and not shared [00:06:30] with other Aptible customers. The second way that you can exploit a vulnerability like Meltdown is to attack the pass itself, so to attack Enclave. So, and in this case, you know, as Thomas mentioned, there are environment variables, disk caches that may contain secrets. On our EC2 instances, some of these secrets are actually Aptible secrets, and so the way that we protect or mitigate the risk of accessing these secrets, is by separating our riskiest kind of systems that run [00:07:00] on trusted customer code from our most sensitive environment variables, and other secrets that are required to administer Aptible. So, the Aptible architecture is based on trust and isolation. Can you talk a bit about those principles and those decisions? I think I can probably speak a bit to that. I think it’s important to think about that the strength model under which we operate. As a path really the … I think it’s really important to understand, whenever you’re operating something, where you are like running [00:07:30] apps for other people, like your hosting platform, is that you kind of have to assume that your customer isn’t from a security perspective, that your customer is not your friend. You have to really assume that they are being hostile to you, like really, all the time. In particular, that means for us, it’s fairly easy to see why. If you look at Enclave, you can have folks signing up and creating a new account. We don’t know who they are. It’s just anyone on the internet can potentially open their own account on a shared environment. Even in dedicated environments where we have, you know, further trust because we’ve talked to these customers, potentially [00:08:00] we have an idea of who they are. Any customer could get compromised, their app could get compromised, because of vulnerability in their app. At which stage potentially someone takes control of their app, and someone happens to be running a trusted code, and we source them back to the same place. There’s a container running on your infrastructure, and it’s trying to do something nasty to you. So, you have to always be thinking about this, and that’s driven a lot of the architectural decisions we’ve made with Enclave. Yeah, so we’ve actually put together a diagram that represents, kind of how we think about threats, [00:08:30] the signs, every single component of the Enclave infrastructure to a specific threat level, and then demonstrates how we isolate the riskiest components from the most sensitive and privileged data. Yeah. In the diagram you can see … We are going to be starting at the right corner. You can see that we have these containers that are running, and as I explained, these are really … These are assumed to be hostile, and we expect them again, really from a [inaudible 00:08:59] perspective, we [00:09:00] expect them to be constantly trying to break out. As a result, since these containers are running EC2 instances, we have to really give these instances a very low level of trust. In fact, we don’t really trust these instances to … From an infrastructure perspective, we don’t trust them to do anything. We really have to minimize how much we trust them to be acting legitimately. So, ultimately, we do need to run tasks on these instances. You know, we need to launch containers, and do [00:09:30] another number of operations in order to facilitate releasing new App code or provisioning new databases. So the way we do this is that we have a separate job queue system that’s built on top of Redis, where we in queue tasks telling these untrusted instances specifically what they should do. In particular, we don’t allow the instances to ask for work items or determine what they are supposed to do on their own. [00:10:00] So, it’s always some other system that is in a more trusted, a more isolated layer that is placing tasks on a job queue for instances to run and making all of those decisions. Exactly. The other system is something we call a … sorry. Exactly, and the other system is something we call a coordinator. Coordinators are running on separate instances. So the idea is wherever we have … this coordinator is going to be making calls to [inaudible 00:10:24] APIs, making calls to our own APIs, or just training new data. This coordinator runs on a separate set of instances. [00:10:30] We have a set of coordinators in each region that we’re located in, so Southern US East one, US West one, US central, and so on. These are running on their own particular instances, so no customer code is running next to them. So we’re able to trust them a little more essentially. That said, it’s important to realize that whenever you’re like really architecting for security, you have to really, not just look at, you know, who’s giving orders, what direction data flows in, but also, you know, what data is being processed really. These coordinators give tasks onto the Job queue, but they’re also in a read results [00:11:00] from the same Redis instance. The coordinator will go back after asking for a container to be run. The coordinator will go back and check, “Hey, has this been done?” And if it was, what’s the idea of the container and everything. So as soon as you’re passing data like this, and it’s also encrypted, so you’re decrypting it. There’s always the risk that you know, maybe there’s other ways to sever the process. So even these coordinators, if only for us … even though they’re like, some of our most critical infrastructure, they only have a very moderate level of trust. These coordinators don’t have anything resembling [00:11:30] longterm credentials. Yeah. So those longterm credentials are coming further up the stack, from what’s ultimately the most trusted component in our infrastructure, our API services. So specifically, I’m talking about our Auth API and our Ops API. So these run in a separate VPC, on a separate set of instances from the Regional Coordinators, and are the only place where long-term credentials are stored. In other words, credentials that are able to create [00:12:00] those narrowly-scoped, short-term ephemeral credentials for both APS, AWIs, and our own APIs. So we isolate these instances. We also separate the API servers that actually serve requests from the internet from those API workers that do the job of generating credentials, and storing the long-term credentials that enable them to do that. Yeah, speaking of that, that last point you mentioned, I think is worth stressing a little bit, is that what we’re describing [00:12:30] right now, these layers of isolation. We’re talking about this really from the perspective of the escalation paths leverages Meltdown. If you break out of a container, how do you try and go to compromise Enclave? And more importantly, how do we prevent that from happening? Of course, on the other end of the spectrum, there’s also another major threat. Which is we have web APIs, we have folks that potentially are making API calls constantly to APIs. So, that’s something we also have to defend against. In that case we’ll have, you know, similar pattern of isolation, which you had just mentioned we have. [inaudible 00:12:58] that are passing stuff [00:13:00] that people are sending to us from the internet, these don’t have long term credentials. We have, again, these levels of isolation to make sure that anything sensitive is as remote as possible, and as separate as possible from anything that is entrusted. So, Enclave is really well architected for security, but I’m also aware that the Enclave team had to put in quite a lot of effort to mitigate Meltdown and Spectre. Can you talk about, you know, why that [00:13:30] is? Why you still had to put this work in, even though Enclave is so well architected? Yeah, sure thing. So, I mean, what we’ve talked about so far is the architecture of Enclave and how that limits the extent to which an attacker was able to exploit something like Meltdown, limit the privilege they can gain, or the capabilities they can access as a result of that, but ultimately, in the case of Meltdown, or any other vulnerability, we want to prevent that sort of escalation [00:14:00] entirely. With Meltdown, that means taking a set of steps specific to Meltdown. You know, this involved kernel patching, other vulnerabilities have different steps. The steps for Meltdown were pretty extensive, and time consuming. So we started our process on January 3, before any public announcement was made. That’s when we had been following along with the Infosec News community. There were rumors [00:14:30] that there was going to be a major vulnerability announced, but we, and really nobody else who was not within the embargo, nobody knew exactly the nature of the vulnerability. So we posted a status update saying that we were planning to follow closely with any news, and that customers might expect some maintenance to be done that might involve restarting their Apps and databases. So, the following day, the announcement was actually made, and so we began our patching process starting [00:15:00] with the most high risk set of instances that we run, which are our Shared-Tenancy Instances. We’ve already said that we love our customers, we treat all customer applications as potential untrusted by nature, but Shared-Tenancy stacks are especially untrusted, because these can represent basically anybody who comes in off the internet, and signs up to use Enclave. So we patched all app, [inaudible 00:15:28], in other words SSH build, and database [00:15:30] instances in Shared-Tenancy instances first, then we moved on to Dedicated-Tenancy stacks. So, patching App instances, Billed instances, SSH instances there, and then as a final step, we scheduled maintenance windows with customers, in order to restart all of their databases, and make sure that all instances in the fleet had been patched against Meltdown. We finished this process on January 9, about five days after we began. And you patched your [00:16:00] own Linux Kernel, yes? Yeah, we had to do that. One thing that happened with this is that Meltdown turned out to be … the embargo was supposed to last for about another week, and there were all these rumors that Frank mentioned. As a result the embargo kind of broke down a little early. It was supposed to happen a week afterwards. The main consequence of this was that there weren’t patches available. We used to bring two as a distribution like many people do. Other distributions had the same problem like DBM, [00:16:30] for example. What happened was that since the embargo broke down early, it didn’t actually have any updates available. So in the case of Meltdown, it’s important to realize that … As Frank mentioned, it’s a vulnerability in the CPU, but it can be mitigated at the kernel level, and a result that’s why we need to apply patches, but since the Ubuntu didn’t have the patches in the first place, we did have to roll out … we did have to rebuild our own essentially, which we based off the … We used for the most part just the official kernels from upstream. So really, kind of, [00:17:00] I guess Linus’s tree if you will. So that’s what we ended up using, so we had to kind of go through the process of upgrading and going through, 4.4 we were using before the 4.14, which is the one that received the upstream patches essentially. That was probably a slightly ambitious project but … Well, a little risky right? That’s true. It’s a pretty big upgrade, there’s always risk. We did everything we could to really mitigate [00:17:30] that risk by making sure that we validated that these changes were gonna work. We rebuilt the kernels, and still took the time to take a few hours to run all fine integration testing against them, which is something we normally do on a nightly basis, but this time we did it out of schedule to confirm that this was going to work properly. Ultimately, the timing worked out pretty well. I think it’s important to realize that we … Frank mentioned, we had the patches. Everything was patched on January 9. The actual POCs for Meltdown … So a code someone could use to really exploit Meltdown against you, [00:18:00] and you just had to really copy and paste it. You don’t need to understand anything about Meltdown, just run the code right? This was realistic about 12 hours after we finished our patching. So being able to upgrade, validate, and then deploy everywhere, that’s what allowed us to get those fixes in place without having to wait for Ubuntu, and as a result, that’s what allowed us to finish in time. If we had had to wait, we would have been late. Right.
Frank: Yeah. I mean, ultimately, we were balancing trade offs here right? There are a variety of things we could optimize for between [00:18:30] security of the platform, speed of getting a secure mitigation out, stability of the Enclave platform, and avoiding any kind of introducing instability as a result of this upgrade. And then, you know, optimizing for the time it took. Ultimately, this became priority number one for Aptible as an organization. So, we chose to optimize for all of those things except time taken, which did mean investing the time to [00:19:00] patch our own Linux kernel as Tom has mentioned, to not wait for the Ubuntu patches, and to also thoroughly test these against our full suite of integration tests, and still do so on a pace that got everything resolved before the first POCs were introduced. So, you know, we are very proud with the outcome, and we do think it reflects kind of our priorities as a company. Yeah, we were patched in advance of the POCs. Can you talk just a little bit [00:19:30] more about that prioritization? Yeah, sure. It goes to everything that we do as company. I think that there’s actually a lot to learn here. Not only in the context of Meltdown, but there are take aways that we learned from Meltdown, and the approach that we took that can be applied by any of our customers to any sort of security risk [00:20:00] out there. I think these kind of break down in a few ways. So, first, is as we demonstrated in the thread model diagram that we presented earlier. You want to start your security process by identifying abstract threats. So, identifying where you’re vulnerable, and how you can architect to protect against those vulnerabilities. Your threats as a company may be very different from our threats [00:20:30] that we face at Aptible, but you still want to enumerate those. You want to figure out what services you depend on that are the biggest threats, and you want to figure out how you can build your architecture, or modify your architecture in order to isolate and protect against those threats. Not all threats are abstract obviously, so you want to, whenever a new big vulnerability like Meltdown comes out, you want to respond to it, see how it fits into your entire security model, [00:21:00] and evaluate where assumptions that you may have made in the past, where those need to be amended, and changes you should make as a result. Just to hammer it all home, this is a process that should be ongoing over time. So the diagram that we showed was not how Enclave looked back in 2013, right? So this is the accumulation of four and a half years of experience, and going through this process of reviewing [00:21:30] abstract and real threats, and constantly modifying our security control overtime to address those. That’s really something that pretty much every company who’s operating in the cloud should be doing something similar to that. Right. Security is an iterative process. Yeah. No one’s perfect from day one. No one’s ever perfect, but it’s something where we can all improve.
Thomas: Yeah, so, to summarize, Enclave was already really well architected to protect against vulnerabilities like Meltdown [00:22:00] and Spectre, but when those vulnerabilities hit, the team still had to be very thoughtful, and proactive in applying patches, and in some cases being creative as with the Linux kernel. Is that about right? Yeah, absolutely. I think what it really boils down to is that architecting for, and mitigating these vulnerabilities, that’s what buys you the time to deploy fix. You can’t have an approach where your security is about patching things as fast as possible. Like you have to do this to be secure. [00:22:30] When something comes out, you have to patch it or else you’re vulnerable, but that cannot be your only approach. You need to afford you the time to make sure that should something come out, should something big happen, you have to be in a position where you orchestrate yourself to have the time to do something about it [inaudible 00:22:46]. Right. The secure architecture give you a little bit of room, because you’re not as vulnerable as you would otherwise be. Exactly. Yeah. Yeah. So I think we have some questions from the audience. Yeah. Let [00:23:00] me check some of these. One of the first questions we have is about what Frank mentioned. You mentioned that EC2 instances are isolated in cases where PHIs are not, and Sergio is asking, “Is this a feature that we as customers need to enable, or has that been done already for them?” Yeah. This is just inherent to Enclave. So, if you look at any of your environments and stacks on the Aptible dashboard, you’ll see that there. It’s clearly indicated, whether they’re a Shared-Tenancy [00:23:30] or Dedicated-Tenancy. So, your role or obligation here is to ensure that any sensitive workloads that you’re operating, so if it’s PHI, or other sensitive regulated data, you just need to make sure that those are running in dedicated stacks. It will say dedicated and eligible for PHI. So if you have any further questions about that too just reach out to us. So another question we had was from Norman, which is asking about, “ [00:24:00] How has Amazon dealt with the vulnerability of the CPU level, possibly with firmer updates, or what’s the plan?” This one I think I can take. So what they’ve done, and we have said, essentially what Frank was mentioning much earlier in the conversation, which is … Meltdown lets you cross isolation boundaries. Going from the container to the host, that’s one of them. From everybody else’s perspective, that’s what they don’t really care about, the one they care about is going from the guest, so from the VM that you have on the EC2, to the hyperviser [inaudible 00:24:28]. They have the ability to live patch this, [00:24:30] it’s fairly non-disruptive I think in most cases? We had a handful of instances that were running on older hardware on AWS that had to be restarted, but for the most part they deployed mitigation for Meltdown that way. I think Specter is the one that’s … Spectre is more like, still evolving I think at that stage. We mentioned it’s less obvious to exploit, but it’s one that’s going to require more complex mitigation than Meltdown did. That is one where there are indeed some updates that are available at the CPU level to try [00:25:00] to mitigate the problem, but these updates … There aren’t as much about fixing the problem in the CPU, that’s something that hasn’t happened. If it was going to happen, it would have happened by now. It might happen in a few years, once Intel and everyone else has time to rethink how this works, but for now … Once [Lenus 00:25:18] curses at enough people about it? Yeah. Right example. So, for now what these microcode updates allow, is they give more options that the kernel can use to kind of tell the CPU, you know, “Hey, you know, try to be a little careful here. [00:25:30] Try to do things a little slower.” These updates … So yeah, some of them are already available I believe on EC2 hosts, but I see a lot of discussion as to whether that’s actually the right way to go about it. It sounded like a couple weeks ago it was gonna be, yeah, you have to use the microcode updates. And as of a few days ago the entire conversation has changed, and now it sounds like this, it’s really not what you want to be doing. So, it’s still evolving. I’m guessing there will be some impact to microcode updates. I think some of them will be leveraged, but it’s unlikely that this … It would [00:26:00] be great if it could be, but unfortunately, it won’t be fixed with just the microcode updated [inaudible 00:26:07]. We have time for, how about one more question? Sure, We’ll answer the rest by text throughout the rest of this Webinar. I think another one we had was, “Before sensitive data, there’s production databases. Why were these patched last?” Frank, do you want to take that one? Yeah, sure. The big consideration [00:26:30] here is the threat risk of what’s running on that instance. So as we discussed. So in the care of database instances, those are not running on trusted customer code, they’re running database services that we determine the code for, and we provision those containers, and the customer doesn’t have control over what’s running there. That said, it’s not entirely impossible to say, you know, run something that has an effect [00:27:00] on what’s running on the host from say, you know, Postgres query or MySQL query, or really any database query. So, there is a risk there, but it is less of a risk than say, you know, running direct untrusted code from an Aptible SSH session, or from an app container. This is also part of the reason why we separate app containers on completely separate instances from database [00:27:30] containers, because those two have kind of differing relative classes of risk or threat. Yeah, I think ultimately it’s about … You can copy the POC to an app, you can copy the POC to an SSH session. You can’t copy a POC to a Postgres database, but still, you could potentially exploit it. Thomas: I think another thing also worth mentioning is that one of the reasons we gave our customers a little more time with databases as well, is that this will create … like when we were restarting apps, Enclave is architected for that to be zero downtime. Many customers [00:28:00] never realized we were restarting their apps right? When we have to restart databases there’s inevitably a little bit of downtime. So we also wanted to balance a little bit, and say if they’re less critical, lets give the customer a few hours, maybe a day, to know, “Hey, this is coming. We’re going to be reselling this.” So, just so that they have a heads up about it. They can plan about it, potentially communicate with their key customers about the coming downtime too. I’ve learned a lot from this conversation. Hopefully some of our customers have too. We’re certainly [00:28:30] available to keep taking questions through the various support channels. Thomas, you’re going to present next on Metric Drains, yes? Yeah, that is correct. We are gonna be moving onto a more traditional, I think, format for this Webinar where we have indeed some features we will review. Metric Drains is the first one we want to talk about. So, Metric Drains. They have a … monitoring the performance of their containers, which you probably know that from the name, I’m guessing. In any case, the way they [00:29:00] work is they’re functionally similar to Log Drains, except they work for metrics. So the way they work is we capture metrics on your containers about every 30 seconds, or twice a minute, and then we just centralize them, and route them to the destination of your choice. We do that every 15 seconds, which kind of gives you like a maximum latency between the metric that you’re seeing, and seeing it in wherever you’re sending it that gives you 45 seconds, which is fairly good, and fairly close to real time, which you have to look [00:29:30] at it. What are supported destinations? Right now we have three. Two of them being InfluxDB, it can be [inaudible 00:29:38] on Enclave. So, if you think about it, that’s somewhat similar to how you might be running [inaudible 00:29:43] in Grafana today. You can do the same thing with InfluxDB. Another tool is called Grafana. We’ll talk about Grafana in a bit, but it’s a visualization tool. You can also use InfluxDB that’s hosted somewhere else. Influx Data, the company behind InfluxDB, [00:30:00] also has a hosted offering, for example, so you could use that too. And finally, we also support Datadog. A lot of our customers use Datadog for APM, and their performance in general, so that’s why we let you enrich this by also adding also metrics from new Metric Drains. So speaking of these Metric Drains, it’s worse knowing what’s captured in them. So there’s several things. First of all, each of our containers are going to be capturing, you know, is this container running, what’s the [00:30:30] CP usage like? What’s the memory usage like? The memory usage is broken down by RSS and total memory, kind of like in the dashboard. We also give you the memory limits, if you had memory [inaudible 00:30:41]. You also get access to Disk Metrics. Things like Disk I/O, Disk usage, Disk limits. Doesn’t matter too only for databases, since they are not relevant for apps that don’t actually have dedicated storage.
In any case, for both InfluxDB and Datadog, the format of metrics [00:31:00] is slightly different. So you’d want to be reviewing the documentation for these, and these also explains what these mean, and some suggested use cases as well. So, how you might want to go about actually using these metrics, and doing something with them. So, speaking of what you can do with them. Really what it boils down to with Metric Drains is that it’s really about, these metrics exist. We used to be collecting them from the dashboard, for about 18 months I think we shared this feature. The goal of Metric Drains is to make it possible, and easier [00:31:30] for you really to do more with these metrics. So one first use case, which is something we had a lot of requests about and, which we will be happy to finally support, that’s called Retention. With Metric Drains you can retain metrics for as long as you want. You can retain them for years if you want, never actually evict them, whereas in the dashboard for example, we give you 24 hours. You can also choose new metrics across releases of your app. You restart your app, you resize the database. At one point you [00:32:00] see the database having a hard time, or performance [inaudible 00:32:04] you’re looking at Postgres, realizing you need more caches in your database. With Metric Drains, you can actually restart the database, and actually then compare, you know, how did that change when I made that change? Did the Disk I/O go down due to the fact that we now leverage more caches? So, it’s really about helping you compare, so that you can just really make your own decisions as to what you want to do with your metrics. Another use case for these actually, once you have the data, you’re probably going to find some [00:32:30] patterns, potentially you realize, some patterns are obvious. You feel like approaching the memory limits, bad things may happen. You’re short on database, and you’re running out of disc, bad things will probably happen. But there’s obvious cases where you might want to alert from things. Some of them, as you know, we already do. Like if you run out of disc, we’ll actually give you a heads up before you do. But in any case, you may want to be able to take ownership of this as well. And so with metric drains, since you get the metrics, you get to make that decision. You get to decide right and then alerting on disc usage or maybe CPU. [00:33:00] And finally the last use case for these is really about correlation. Since with metric drains, you gain ownership of your own metrics, this allows you to incorporate them somewhere else. Incorporate them in dashboards that you may already have. Potentially if you’re using tools for APM, so something as I mentioned earlier, dialogue. If you’re using these for app performance, like monitoring the number of transactions per second, or monitoring the number of users logging in, or active users. You can set up dashboards [00:33:30] where you potentially have correlations between how many events are happening in your app and CPU usage. That helps you better understand your applications, and get to the bottom of problems faster. Potentially drive new learnings, and you find out about new [inaudible 00:33:44] you might want to create. So again, you can enrich these metrics however you’d like, really. So before I wrap up on metric drains, there’s a few tips that I wanted to mention. The first one is we now support InfluxDB as a database on Enclave. You used to have your [00:34:00] Postgres, Redis, and MySQL, and Elasticsearch, [inaudible 00:34:06]. So InfluxDB joined that less now. So InfluxDB, I think, personally it’s something that we use for a number of [inaudible 00:34:13] ourselves. It’s a database that we use for our own metrics product, in fact. So we think it’s a great choice for metric drains. If you want to do this, then you probably want to look at another tool that’s called Grafana. Grafana essentially provides you with visualization and learning for your metrics. [00:34:30] So essentially it lets have these dashboards I showed you earlier in this presentation from Grafana. It’s really easy to create dashboards that are usable. Set up alerts for events that you want to know about. And it’s also very easy to deploy on Enclave, which is good, too, in this case. And so we have several tutorials about deploying Grafana, setting it up with the metric drain. Potentially looking at some suggested queries you want to use. So that’s about it for the metric [00:35:00] drains. Of course, if you have any questions about metric drains, we’ll be taking all questions at the end of the feature review for Enclave. So if you just post them now, we’ll get to them. As far as other things that are new on Enclave. We have, a small list, really. We have Managed HIDS, which is generally available. So I mentioned, yes, we talked about it on the last webinar. We were introducing it as a beta. And it’s now available for everyone. We also have VPC Peers and VPN Tunnels. And I’ll talk more about what these do, and why we can use them. The CLI [00:35:30] is a lot more usable than it used to. It provides a lot more functionality you can use. Our databases in general require you to do less and to get more out of them. And as I mentioned earlier, and I’m not going to go over it again, but as I mentioned earlier, we do have InfluxDB as database now. So let’s talk about HIDS. HIDS, for those that don’t know the acronym, it’s Host Intrusion Detection System. Managed HIDS as the name implies, is that we manage it for you. So the way this works, is you get on a weekly basis, you get audit ready [00:36:00] PDF and CSV reports. The goal is really if you have compliance requirements for inclusion detection, Manage HIDS is going to be a good option for you to get that on a weekly basis without having to do any further work. And the good thing about this is that, HIDS in general is actually pretty much either a requirement, or something that is inevitably for all frameworks, or compliance frameworks you will get credit if you’re operating HIDS on your infrastructure. So that’s why we really strive to make it easy for you to do. [00:36:30] Just enable it. In fact by default for all Shared Tenancy Stacks you have Managed HIDS. You can access the reports, we provide them for free. For Dedicated Stacks, it’s not the case. The pricing is on the side, but for all stacks, you can get access to them. One thing I want to mention though is that regardless of whether you’re purchasing the reports or not, we do operate HIDS across all our infrastructure regardless. So that about wraps it up for HIDS. The next features that we have worked on are VPC peers and VPN [00:37:00] Tunnels. Both of them are now in the dashboard. So it’s probably worth talking about what these are actually. Fundamentally, both of these features are about connecting your Aptible Stacks, so your apps, databases, assisted sessions, and points, all of this, connecting them to other networks. So in the case of VPC Peering, if you had your own VPC NWS, which you may have if you’re using some managed service from AWS, like LDS, for example, or perhaps you’re using… [00:37:30] other examples could be like AWS Lambda. Or like if you’re using DynamoDB, and so on. Well, DynamoDB doesn’t have to run a VPC, but if you’re using other products that have to run VPC, or even if you have your own instances on the side, with a peering connection what you can do essentially is connect your Aptible to Enclave stack over to your own VPC. The benefit to use that, they’re essentially gonna [inaudible 00:37:52], so you’ll get traffic flowing through the peering connection, all the internal endpoints that you have on Enclave will be able to connect to [00:38:00] your other VPC. You won’t have to expose any of these resources publicly, just these two pieces of infrastructure can talk to each other. So it’s very convenient to share your own VPC, you really probably wanna be doing VPC Peering. So that’s essentially the gist of it. The other upside of VPC Peering is that it requires no maintenance at all, and it’s free. So it’s fairly convenient. The downside however is that it only works for AWS. Peering connections on AWS level construct that lets you peer like mutual networks in AWS. So that’s really if you’re not in AWS, [00:38:30] then you can’t use a peering connection. And that’s where VPN Tunnels come in. VPN Tunnels, from internally do pretty much the same thing. They’ll let you make sure that you have a piece of your network, and connect the rest of your network. So, for example, the first pieces of your Aptible Stack, and the rest of your network might be non-premises network, it might be … maybe you’re on your Google platform, or Azure, maybe AWS, but you don’t want to use a peering connection, maybe you have a hospitable partner that has a VPN you [00:39:00] can set up. In any of these cases really, you have some other network arbitrarily somewhere, and you want to connect to. So VPN Tunnels let you do that over the public internet, and if you want to use that mind play. So that’s fully managed. So we take are of setting up the tunnel, maintaining it, monitoring it as well, but it does require additional resources on our end. It requires that we operate ADP engage way, and then that we have all the service behind it. So VPN tunnels on Enclave are indeed something we charge for. It’s 99 a month for each connection [00:39:30] that you have, but as I mentioned, it’s a lot more flexible than VPC Peering Connections. In both of these cases really, the setup, if you’re interested in these, the setup is through support. The reason that it’s through support is that there’s always some key exchange that has to happen for VPN. For VPC, there’s some information that we have to exchange as well, and potentially some pull-up stats that we’d like to be able to explain to you on your end. So the setup, just contact our support, and we can set up this [00:40:00] connection for you. As soon as they’re set up, you’ll be able to see them in your dashboards. If you forget a little bit about which IP is being used, what networks [inaudible 00:40:07], that’s new features that we introduced this quarter. You can now see the connection details in the dashboard, whereas VPC Peers and VPC Tunnels have existed for a few years now on Enclave. All right. That about wraps it up for tunneling and connection. The next set of changes that we release is about the CLI. So for the CLI, we have … the first one is [00:40:30] that we have JSON Output now. So you can see the example here. If you set that Aptible output format variable, you get outputs for your CLI that is going to be in JSON. So [inaudible 00:40:40] JSON. There’s two upsides for JSON. The first one is that … well, I mean, if you’re like reading this for a script, pipping it into something, posting it in a shell, JSON is of course more structured. The format is stable. So it’s going to be more easy. I guess, going to be easier for you to use in the scripting context. The other [00:41:00] upside is that since this is kind of machine … like designed to be consumed by machines, JSON also gives us the option of providing you with a lot more information. So, for example, if you’re like doing Aptible Apps, and missing your apps in environment, normally you would just get a list of Apps, but if you’re using JSON output, you also get the Git Remote, you get the list of services, you get the skill of these services, you get a lot more information as well. The second change to the CLI, this one is a more minor change. It’s really about some [00:41:30] commands that have changed. So the db:create command has not changed. The upside here is that it now supports picking a version. So you now have the option for … whenever you’re running … if you create, you can choose maybe I want to use a specific version of Redis. Something you could do via the dashboard, but now it’s in the CLI too. So if you have some automated processes, or potentially you’re rebuilding your entire environment on a weekly basis, and you want to make sure that doing this in staging, for example. You want to make sure that you’re using the same versions that you’re using in production, [00:42:00] that’s something you can now script, whereas before, you would have had to use different versions of this dashboard. Speaking of this, you can also first use the db:versions command to get a list of versions that are available. You can of course also get all of this in JSON format too. All right. So speaking of databases and their versions, something else we did is to make our databases a little smarter overall. The first change is about MongoDB. MongoDB instances, when we restored them from [00:42:30] backup, use to be kind of the way MongoDB behaves when you do that, which is it tries to replicate, it tries to join the existing replica set. The only problem is that the replica thinks it’s the original. The restored backup thinks it’s the original. So it joins the replica set, doesn’t find itself in it. So it’s very concerned, and they just kind of like choose not to do anything. So you can’t really use the restore instance immediately. You have to do some … there’s just some full [inaudible 00:42:54] configuration, essentially telling the new restored database, “Hey, you’re on your own now. So don’t try to rejoin the [00:43:00] existing set. Just be your own set.” That can be a little risky, and it’s also … it used to be manual. So what we changed here is that this is not fully automated. So if you have MongoDB, the good thing is you don’t need to think about this, you don’t need to care about this anymore. It’s just gonna happen for you, and you won’t have to do it. Second change for databases, which is probably further ranging is, it’s gonna affect more people, is that our databases now optimized, their configuration, according to their container size. So our configurations are not done up particularly, either unique to [00:43:30] Enclave or particularly aggressive, they really kind of put the largely default recommended sizing for these various databases, whether it’s for Postgres, my sequence, and so on.
Thomas: The upside of this is that whenever you launch a database on Enclave now, it should choose a four gig database, then we’re gonna configure the database that behaves ideally on that foggy footprint. For example, in the example of Postgres, that’s gonna mean that we’re gonna try to use a little more memory for [inaudible 00:43:55] and everything, but still to retain about … I think it’s 60% of its memory for caches, for [00:44:00] example. If you’re using MongoDB, we can do similar configurations as well to make sure that your database essentially operates as much as possible, operates within it’s memory limits, and so on. So that’s for you, the upside is, it makes it easier to experiment these new footprints. So if say you’re having performance problems, you’re suspecting it’s just that your database is undersized, you can make it twice as big. Scale it up twice, scale it up four times. The upside is now you don’t have to second guess yourself. Either it’s performing better, [00:44:30] in which case it’s great, or it’s not, in which case you may scale further, but at least you don’t have to be wondering, “Oh, maybe I also need to reconfigurate myself. Maybe I also need to do some tuning afterwards.” Now I can take care of that tuning for you, so that your database really always behaves optimally. That about wraps it up. I think we probably have perhaps some questions. Yeah. There’s a couple from the audience Thomas. All right. So we’ll have a few questions. [00:45:00] I think the first one is, I cannot create a custom metric drain that calculates it’s own metric based on a query of a Redis job [inaudible 00:45:07] for example. So metric drains, they don’t arbitrarily … you can’t send your own data to a metric drain, however what you can do for this case, we actually do that for ourselves, and it’s something that we probably have very specific ideas on how you can do that. Essentially for this case, what you want to probably do is just use whatever destination you have. So, for example, if you have InfluxDB, you can very easily [00:45:30] set up something that just might … you just query Redis, like the number of keys in your queue, and then you can just push that data point all the way to InfluxDB. In other words, you don’t actually even need the metric drain for this, you just need to talk to InfluxDB directly. The metric drain is more about actually collecting these metrics, and draining them somewhere, which in this case, if it’s a custom metric, you don’t even need that. That’s for [inaudible 00:45:51], you can just grab the metric, and send it to database. All right. We have another question, which was from Sergio, who’s asking, [00:46:00] what would you recommend as the most turnkey solutions for metrics? Personally, I think I would recommend trying out InfluxDB, and Grafana. The reason for this is that … well, first off, we’re all more familiar with these products, so that’s why we’re able to provide more … if you look at tutorials for Grafana, for example, we’re able to provide more query examples, and everything. I think generally speaking Grafana and InfluxDB are a little more closer to the metal I guess, a little more straightforward to get started with. If you’re on the [inaudible 00:46:26] ways to do it. [00:46:30] Datadog becomes very valuable when you have … if you’re already using it for something else. If you’re already pushing APN there, then you probably wanna be using that instead, because you’re gonna have some metrics over there. That being said again, we don’t have as much experience with Datadog ourselves. We use it mostly, because … the reason we support it is because we have a number of customers using it for APN. I can’t say for sure, but ultimately, it’s probably worth reviewing both, and kind of making your own opinion, but we do as I mentioned have a little few more tutorials if you’re using [00:47:00] InfluxDB and Grafana. And I think we also have another question, this coming from an anonymous attendee, can we use VPC Peering for geographical redundancy? So, yes, actually. As of a few months ago I think, [inaudible 00:47:15] really I think it does in fact support VPC Peering across regions now. The only thing you have to keep in mind about this is that, if you want to do this, it’s with two stacks on Enclave. So if you have one stack … if you have some of your Enclave resources in US west, then some [00:47:30] of your Enclave resources in US east, you will need to have two stacks for Enclave. So there’ll be two isolated stacks. The stack is actually part of your base feed if you have a production Enclave plan. So you do have to keep in mind that there will be a base price for having that second stack. The VPC Peering connection will be free, but the thing you are connection to, that’s when you will have … there will be some extra base fee that you have to pay. That’s it. One other thing I want to mention briefly about this is that, outside of this context, we also do have [00:48:00] for all of these … for everything that is backup related, we already have some geographical redundancy. So if you have a database in US east, your backups are within US west. That doesn’t mean that … it’s not necessarily you may want to go farther than this, and I think you might be right, but it’s something to keep in mind. [inaudible 00:48:19] that’s been taken care of by Enclave managing. And we have another question that just came in, so good timing on this one Norman. So [00:48:30] you’re asking, it looks like our CLI App’s command now return the number of containers per App in the JSON, did I see that correctly? Is JSON not a default and is it live? Or is there a parameter we choose a desired output format? So that’s a lot of questions, but they’re all good questions. The way it works is, indeed, the number of containers is now part of the Aptible App’s output. So you can see it here, you get a list of services, and for each of these services, you get the number of containers for that service, as well as the size. JSON is now the default output format. [00:49:00] Generally speaking, we try very hard never to break backward compatibility, and we do know that we have some customers that are currently using the text output, and cannot pausing this, and extracting data from it. So we did not want to change the default, or anything, but to enable it, it is indeed live right now. To do that, you have to set an environment viable, which is called Aptible output format. You can see this at the top of the screen on the right. It’s called Aptible output format, and you set that to JSON, and this will give you JSON output. You don’t have to take a note of this. You can also just [00:49:30] look for JSON output. We probably have a changed blog post about this, and documentation about it as well, if you want to find that exact strain. But in any case, yes, you saw it correctly. Many customers have been asking about that kind of extra information in Aptible Apps. So that’s why we took the opportunity to include that in the JSON output. I believe that’s about it. It looks like we are done with the questions. All right. So, Chas, I think I will hand it back over to you [00:50:00] for the wrap up. Chas: All right. Fantastic. Thank you all so much for participating, and thanks to the audience, and everyone who’s watching along right now for coming along. The next update webinar will be on April 25 at the same time. You can register using the link here, and I believe I have a … my copy and paste isn’t working, otherwise I would drop the link into the chat here, but you can just hit the link to register. [00:50:30] That’s it. Thank you very much. We appreciate it. All right. Thank you.
Defense in Brief
Sign up to get the best in security and compliance delivered monthly.
From the Blog
Aptible Enclave and Gridiron are HITRUST CSF Certified
Aptible has achieved HITRUST CSF Certification for Enclave and Gridiron. This post shares a bit more about what this means and how you can think about your own path to certification.Read more
Aptible SOC 2 Type 2 Report Now Available
Aptible has achieved SOC 2 Type 2 compliance for the security and availability Trust Service Principles. This post shares a bit more about what this means and why this type of compliance is so valuable to B2B SaaS companies in specific. We’ll also share how you can start building a security program that meets SOC 2 requirements and is audit-ready.Read more
Recap: Aptible January 2018 Quarterly Product Update Webinar
In case you missed it, you can watch a recording of our January webinar below. You can also grab the transcript and the slide deck in our resources section. We provide a full recap of the event in this post.Read more