Keeping Your Cloud Environment Secure and Optimized with Automated Remediation

In a previous role, I was responsible for managing the Cloud Custodian installation at a large enterprise. I had started working on the cloud team during the early days of this enterprise’s journey to AWS, so I learned a lot of lessons along the way. One thing I learned was, what happens when you give 6,000 developers access to AWS?

You end up in a situation where there is a lot of “garbage” everywhere:  unutilized or untagged resources, noncompliant, unencrypted resources, and more. This was costing the organization in terms of wasted money, operational inefficiency, and risks of potential breaches and non-compliance. You also end up in a situation in which, if you don’t have a centralized cloud governance and compliance tool, different teams and divisions using the public cloud will set up their own tools and scripts to try and clean up the mess in their domains. 

Enter Cloud Custodian

This is where Cloud Custodian, an open source, governance as code tool, came to the rescue. Cloud Custodian offered a way to clean up our AWS accounts with real-time remediation and a simple domain-specific language (DSL). Initially, we were able to show the power of Cloud Custodian to a few smaller divisions. Ultimately, once we could expand execution across the entire enterprise, we were able to really get the entire organization on the path to a well-managed cloud. Through this process, we quickly learned that Cloud Custodian’s real power was through real-time remediation.

Automated Remediation Enables Guardrails for Cloud at Scale   

In most large enterprises, you typically will have a central team that is trying to manage the hygiene of their organization’s cloud accounts. However, these central teams are usually only capable of creating reports of problematic resources and passing those reports to the leads of the individual accounts. These individual account leads then need to track down who is responsible for remediating those resources. This is typically done because the central team might not know who owns those problematic resources, often due to poor tagging or because they don’t have intimate knowledge of the resources running in those accounts. Out of an abundance of caution, these central teams will typically not take any actions against the resource unless it's an egregious offense, such as a publicly available database. Or they may be forced to act if they’ve exhausted all channels in trying to determine who is responsible for the resource and can prove they’ve done their due diligence before acting.

This manual process takes a long time and leaves your cloud in a non-compliant state. Further, you’ll never be able to reach that compliant state because developers will keep creating non-compliant resources. This is why it’s so important to have real-time automated remediation that acts as guardrails. For example, Cloud Custodian features event-based rules that execute on new resource creation or change. These guardrails can stop new non-compliant resources from making their way into your environment. At least with this scenario, your central team can work on cleaning up existing resources and know that they won’t endlessly have to keep repeating this effort because new instances keep popping up.

Real-time automated remediation may sound scary since a lot of people want control over whether a resource is removed. However, this is counterproductive because an individual will always be a bottleneck and cause more pain for developers. For example, say one of your developers creates an unencrypted RDS database and you have a compliance policy that says all data in your cloud needs to be encrypted at rest. You get an alert in your inbox that indicates this developer has created this unencrypted resource and you message the developer the next day to let them know they need to add encryption to the database or you’ll delete it. The developer misses your email and you delete the database two days later. The developer would then see their database is gone. Maybe only at that point do they finally see the email indicating their database was going to be deleted. However, you didn’t know that the developer had spent the last day loading test data into the database and was going to use it for a client demo later that week. You’ve now cost the developer two days of work because they missed an email and accidentally created an unencrypted database. What if you had a real-time policy in place for deleting any databases that weren’t created with encryption. 


The developer never would have been able to access their new database because Cloud Custodian would have issued the delete database call before it even came up. This might infuriate the developer because they might not know why the database isn’t being created. However,  they would most likely reach out for help, check their email for a Cloud Custodian message, or just realize their configuration was missing the encryption attribute. Even though the developer could end up upset in both examples, I’m sure the developer would rather lose an hour figuring out why their database wasn’t being created versus losing two days of work because their database was deleted. This also saves the company money in not losing two days of developer productivity and eliminates the risk of having a non-compliant resource in their environment.

Without real-time, automated remediation, keeping cloud accounts well managed and compliant is virtually impossible. This is particularly true when developers can access the cloud through native consoles and APIs. This native access has a lot of benefits, but it doesn’t provide a lot of guardrails, which means non-compliant resources can easily be deployed. This is why it's critical to not just try remediating existing resources once they get into your environment, but remove them before developers are able to use them.