Max Martynov and Kirill Evstigneev, Grid Dynamics

After initial migration to the cloud, companies often discover that their infrastructure costs are surprisingly high. No matter how good the initial planning and cost estimation process was, the final costs almost always come in above expectations.

On-demand provisioning of cloud resources can be used to save money, but initially, it contributes to increased infrastructure usage due to the ease and speed at which the resources can be provisioned. But companies shouldn’t be discouraged by that. And infrastructure teams shouldn’t use it as a reason to tighten security policies or take flexibility back from the engineering teams. There are ways to achieve both high flexibility and low cost but it requires experience, the right tooling, and small changes to the development process and company culture.

In this article, we present five strategies that we use to help companies reduce their cloud costs and effectively plan for cloud migration.

Lightweight CICD

In one of our recent articles we discussed how companies can migrate to microservices but often forget to refactor the release process. The monolithic release process can lead to bloated integration environments. Unfortunately, after being starved for test environments in the data center, teams often overcompensate when migrating to the cloud by provisioning too many environments. The ease with which it can be done in the cloud makes the situation even worse.

Unfortunately, a high number of non-production environments don’t even help with increasing speed to market. Instead, it can lead to a longer and more brittle release process, even if all parts of the process are automated.

If you notice that your non-production infrastructure costs are getting high, you may be able to reduce your total cloud costs by implementing a lightweight continuous delivery process. To implement it, the key changes would include:

-Shifting testing to the level of individual microservices or applications in isolation. If designed right, the majority of defects can be found at the service-level testing. Proper implementation of stubs and test data would ensure high test coverage.

-Reducing the number of integration testing environments, including functional integration, performance integration, user acceptance, and staging.

-Embracing service mesh and smart routing between applications and microservices. The service mesh can allow multiple logical “environments” to safely exist within the perimeter of production environments and allows testing of services in the “dark launch” mode directly in production.

-Onboarding modern continuous delivery tooling such as Harness.io to streamline the CICD pipeline, implement safe dark launches in the production environment, and enable controlled and monitored canary releases.

See our previous article that goes into more detail on the subject.

Application modernization: containers, serverless, and cloud-native stack

The lift and shift strategy of cloud migration is becoming less and less popular but only a few companies choose to do deep application modernization and migrate their workloads to containers or serverless computing. Deploying applications directly on VMs is a viable approach, which can align with immutable infrastructure, infrastructure-as-code, and lightweight CICD requirements. For some applications, including many stateful components, it is the only reliable choice. However, VM-based deployment brings infrastructure overheads.

Resource (memory, CPU) overhead of container clusters may be less for 30% or more due to denser packing, larger machines and asynchronous workload scavenging unused capacity.

Containers improve resource (memory, CPU) utilization for approximately 30% compared to VM-based workloads because of denser packing and larger machines. Asynchronous jobs further improve efficiency by scavenging unused capacity.

The good news is that container platforms have matured significantly over the last few years. Most cloud providers support Kubernetes as a service with Amazon EKS, Google GKE, and Azure AKS. With only rare exceptions of sine packaged legacy applications or non-standard technology stacks, the Kubernetes-based platform can support most application workloads and satisfy enterprise requirements.

Whether to host stateful components such as databases, caches, and message queues in containers is still open for choice but even migrating stateless applications will reduce infrastructure costs. In case stateful components are not hosted in container platforms, cloud services such as Amazon RDS, Amazon DynamoDB, Amazon Kinesis, Google Cloud SQL, Google Spanner, Google Pub/Sub, Azure SQL, Azure CosmosDB, and many others can be used. We have recently published an article comparing a subset of cloud databases and EDWs.

More advanced modernization can include migration to serverless deployments with Amazon Lambdas, Google Cloud Functions, or Azure Functions. Modern cloud container runtimes like Google Cloud Run or AWS Fargate offer a middle ground between opinionated serverless platforms and regular Kubernetes infrastructure. Depending on the use case, they can also contribute to infrastructure cost savings. As an added benefit, usage of cloud services reduces human costs associated with provisioning, configuration, and maintenance.

Reactive and proactive scalability

There are two types of scalability that companies can implement to improve the utilization of cloud resources and reduce cloud costs: reactive auto-scaling and predictive AI-based scaling. Reactive autoscaling is the easiest to implement, but only works for stateless applications that don’t require long start-up and warm-up times. Since it is based on run-time metrics, it doesn’t handle well sudden bursts of traffic. In this case, either too many instances can be provisioned when they are not needed, or new instances can be provisioned too late, and customers will experience degraded performance. Applications that are configured for auto-scaling should be designed and implemented to start and warm up quickly.

Predictive scaling works for all types of applications including databases, other stateful components, and applications that take a long time to boot and warm up. Predictive scaling relies on AI and machine learning that analyzes past traffic, performance, and utilization and provides predictions on the required infrastructure footprint to handle upcoming surges or slow downs in traffic.

In our past implementations, we found that most applications have well-defined daily, weekly, and annual usage patterns. It applies to both customer-facing and internal applications but works best for customer applications due to natural fluctuations in how customers engage with companies. In more advanced cases, internal promotions and sales data can be used to predict future demand and traffic patterns.

A word of caution should be added about scalability, regarding both auto-scaling and predictive scaling. Most cloud providers provide discounts for stable continuous usage of CPU capacity or other cloud resources. If scalability can’t provide better savings than cloud discounts, it doesn’t have to be implemented.

On-demand and low-priority workloads

To take advantage of both dynamic scalability and cloud discounts for continued usage of resources, a company can implement on-demand provisioning of low-priority workloads. Such workloads can include in-depth testing, batch analytics, reporting, etc. For example, even with lightweight CICD, a company would still need to perform service-level testing or integration testing, in test or production environments. The CICD process can be designed in such a way that heavy testing will be aligned with the low production traffic. For customer-facing applications, it would often correspond to the night time. Most cloud providers allow discounts for continued usage even when a VM is taken down and then reprovisioned with a different workload, so a company would not need to sacrifice flexibility in deployments and reusing existing provisioning and deployment automation.

The important aspect of on-demand provisioning of environments is to destroy them as soon as they are not needed. Our experience shows that engineers often forget to shut down environments when they don’t need them. To avoid reliance on people, we implement shutdown either as a part of a continuous delivery pipeline and implement an environment leasing system. In the latter case, each newly created on-demand environment will get a lease and if an owner doesn’t explicitly renew the lease it will get destroyed when the lease expires. Separate monitoring processes and garbage collection of cloud resources are also often needed to ensure that every unused resource will get destroyed.

An additional cost-saving measure that we effectively used in several client implementations is usage of deeply discounted cloud resources that are provided with limited SLA guarantees. Examples of such resources are spot (AWS) or preemptible (GCP) VM instances. They represent unused capacity that are a few times cheaper than regular VM instances. Such instances can be used for build-test automation and various batch jobs that are not sensitive to restarts.

Monitoring 360

The famous maxim that you can’t manage what you can’t measure applies to cloud costs as well. When it comes to monitoring of cloud infrastructure, an obvious choice is to use cloud tools. To make the most out of cost monitoring, cloud resources have to be organized in the right way to be able to measure costs by:

-Department

-Team

-Application or microservice

-Environment

-Change

While the first points might be obvious, the last one might require additional clarification. In modern continuous delivery implementations, nearly every commit to source code repository triggers continuous integration and continuous delivery pipeline, which in turn provisions cloud infrastructure for test environments. This means that every change has an associated infrastructure cost, which should be measured and optimized. We have written more extensively about measuring change-level metrics and KPIs in the Continuous Delivery Blueprint book.

Multiple techniques exist to properly measure cloud infrastructure costs:

-Organizing cloud projects by departments, teams, or applications, and associating the cost and billing of such projects with department or team budgets.

-Tagging cloud resources with department, team, application, environment, or change tags.

-Using tools, including cloud cost analysis and optimization tools, or tools such as Harness.io, which provides continuous efficiency features to measure, report, and optimize infrastructure costs.

With the proper cost monitoring and the right tooling, the company should be able to get a proper understanding of inefficiencies and apply one of the cost optimization techniques we have outlined above.

Conclusion

Cloud migration is a challenging endeavor for any organization. While it’s important to estimate cloud infrastructure costs in advance, the companies shouldn’t be discouraged when they start getting higher invoices than originally expected. The first priority should be to get the applications running and avoid disruption to the business. The company can then use the strategies outlined above to optimize the cloud infrastructure footprint and reduce cloud costs. Grid Dynamics has helped numerous Fortune-1000 companies optimize cloud costs during and after the initial phases of cloud migration. Feel free to reach out to us if you have any questions or if you need help optimizing your cloud infrastructure footprint.

5 Strategies to Reduce Cloud Cost