TDWI Articles

Best Practices for Smarter Cloud Spending

Managing your cloud spending is easier than you might think. Here are several practical examples to help you get started.

Moving your data lake from on premises to the cloud is no easy feat, as our enterprise learned in late 2023 when we migrated about 15PB of data, including the needed jobs, reports, and analytical models, in what was one of the fastest migrations I’ve experienced. Along the way, we faced our fair share of challenges, ranging from business justification and prioritization of jobs to migrate first to selecting the appropriate framework and testing approach that would mitigate increasing cloud costs. These challenges caused misunderstandings and frustration among team members and stakeholders because we underestimated the effort required to migrate, which led to burnout and fatigue among some team members.

For Further Reading:

How To Get the Upper Hand on Cloud Cost Management

Four Ways to Cut Your Cloud Costs

The Importance of Seeing Cloud Costs in Business Context

Despite the hurdles, we've learned a lot, especially about the importance of managing our finances in the cloud, which we now call FinOps.

FinOps isn't just about saving money during the move. It's about making smart spending a habit that continues even after the migration is complete. It’s not a one-time task -- it’s now a part of our processes. That's why we've assembled this list of best practices to help other enterprises -- perhaps yours -- understand how to manage cloud spending better.

Collaborative Cost Management

We established a team of people from finance, data engineering, operations, the infrastructure team, and business units to manage cloud costs and optimize spending. I’d like to call this a “FinOps hug,” where the team worked together to analyze cloud spending (dips and spikes and general trends) and identified areas for optimization -- from something as simple as using spot instances and optimizing data marts to limiting events and logs to what’s necessary for security compliance and monitoring. The team was able to implement cost-saving measures together resulting in a nearly 50% reduction in our initial cloud costs. It took us several months to reach a point where we considered the costs acceptable, but we remain true to our promise to identify further opportunities for optimization and improvement.

Life Cycle Cost Optimization

We implemented comprehensive retention policy and archiving strategies to achieve significant cost savings that took into account the current legal and regulatory environment and data processing requirements. We revisited the granularity of data we ingest and process, data usage patterns, and financial profitability of data (weighing business value against storage costs).

Data within the retention period is stored in standard storage in the cloud. If there is a regulatory requirement or future use case that requires data to be archived, data will be transferred to a lower cost option via intelligent tiering; this resulted in a savings of nearly $100K. Otherwise, data will be purged once it goes beyond the defined retention period.

We optimize costs across the cloud resource life cycle by implementing autoscaling resources based on demand. Some autoscaling functions need to be configured (such as thresholds, triggers, and policies) to align to specific requirements. This reduces costs during low-use cycles and schedules non-essential resources to shut down during off-peak hours. We’ve also utilized and leverage the use of spot instances to achieve cost savings, resource flexibility, and improved performance, enhancing operational efficiency by another $30K in the cloud.

Usage Visibility

We’ve set up monitoring dashboards for tracking our cloud use and spending on a weekly and monthly basis. Note that we have data as recent as the past two days, enabling us to prevent overage by detecting and addressing usage spikes before they result in unforeseen expenses. In cases of spikes or anomalies based on historical trends, we have notifications in place and we conduct investigations as soon as we receive them. The latency of data is very important because it provides timely visibility and early detection of issues; it also keeps us proactive. Ultimately, the goal of these dashboards is to keep us informed about our usage and prevent unexpected costs that may arise.

Data-Driven Decision-Making

This is where we use the data from the dashboards to generate insights to guide our spending decisions. We analyze historical usage data to identify trends and patterns, informing decisions about resource allocation and optimization to maximize value and investment.

For instance, if we notice some resources are rarely used, we determine why and might adjust their size or even stop using them if they’re not needed. If we see costs rise compared to last month, we figure out why. It could be a normal occurrence, or we might need to change how we use resources to stay within our budget. A classic example is when discovered development environments that were no longer used after seeing inactive usage for two consecutive months. We make it a point that as part of our process improvement, we perform cleanup to avoid unnecessary costs once projects are in production and declared “complete.” We check metrics such as CPU use, memory use, and network traffic. These metrics show if there are any problems or unusual patterns that need to be examined.

For example, one month our analysis showed:

March's figures are higher than February’s not solely due to March having more days; we've also noticed that the rise is attributed to increased relational database service usage. This happens when there are long-running transactions that lock the database for a long time which causes an increase in resource usage.

A Final Word

By using the practical examples offered here, organizations can effectively apply FinOps practices to manage cloud spending, optimize operations, and achieve cost savings.

About the Author

Derick Ohmar Adil is the head of data strategy and operations at Globe Telecoms where he aligns data strategy with business objectives and fosters a data culture, including promoting data literacy, enabling teams, and driving adoption of data-driven practices, including FinOps. Adil oversees day-to-day data operations, managing end-to-end operations of data platforms and products, handles incident management and problem resolution, and acts as the primary point of contact for stakeholders regarding day-to-day issues.


TDWI Membership

Accelerate Your Projects,
and Your Career

TDWI Members have access to exclusive research reports, publications, communities and training.

Individual, Student, and Team memberships available.