Managing the Cost of Cloud Computing

September 9, 2015

At Kurtosys, we talk a lot about the benefits of cloud computing and what it means for our clients to use private, public, or a blend of both, hybrid cloud computing. In our last article on this topic, we addressed the cost difference between public and private clouds. Let’s dive deeper into the specific topic of how to manage your expenses — which can accumulate rapidly if mismanaged — of cloud computing.

What does managing the cost of the cloud involve?

Managing the costs of the cloud involves first understanding the different cost models between cloud providers. For example, public cloud provider Amazon Web Services (AWS) charges based on the number of virtual machines (VMs) bought and the size of those VMs (the amount of memory and CPU). The cost doesn’t change on whether or not the VM is being used productively. In a private cloud situation, compute nodes are bought (leased) upfront and then virtual machines are allocated on top of those — you manage the utilization of your compute node for a fixed and known cost.

The challenge of managing costs — instant sprawl

So very often, the challenge with the cost model of paying by VM is the problem of “instant sprawl”. If there isn’t tight control over the creation of VMs and the management of the machines’ use and effectiveness, costs will increase without delivering business value. Very often allocation of VMs occurs with the intention of doing work, but in reality that work is often delayed due to other unforeseen activity.

In private cloud adoption, when the ability to create virtual machines exhausts, it’s critical to examine the inventory to analyze how machines are being utilized. You are forced to evaluate usage efficiency, and decide to expand the compute estate. It’s generally not automatic, but part of capacity planning. Then when there is a need for more virtual machines, we look at those allocated already, and decide if they are being configured and used correctly so that the underlying CPU and memory is used effectively. CPU and memory effectiveness optimization is self-reinforcing with cost allocation due to this check step.

The concern of cloud engineers

Today’s cloud engineers are completely concerned about the economics of running on the cloud. The large gains of going from raw servers to virtualized servers have already been accomplished by most organizations in the last decade. Consolidation ratios from 8:1 to 12:1 were easily achieved. This brought forth a great deal of savings, not only servers, but electricity, cooling, Rackspace, and manpower. The reality is that the consolidation ratios were achieved because most servers were grossly under-utilized from the start.

The sad fact is that the effect is seen today as industry produces many VMs that hardly do anything in terms of utilizing the underlying power of the physical server. The same reasons for inefficiency have not been fundamentally addressed.

Best way to cut costs

By effectively allocating the use of the different cost models – private and public – we have been able to cut our overall compute costs by over 30% here at Kurtosys. The appeal of being able to create instances and kill them in the “charge-by-hour model” lies in utilizing elastic computing or having instances that have predictable and sustained usage. This elastic computing is very attractive for burst or peak loads that are infrequent. For non-elastic computing, it comes down to the utilization of the server itself. Otherwise, you’re wasting the underlying compute power and paying for it.

How it all works

Virtual server A might be using 10% of the CPU (or memory) power, virtual server B might be using 10%, virtual server C might be using 50% but only half the time. Put that all together and one might get an average of 40% of power used with spikes to 70%. So it allows buying one server even though virtually you’re running three. It’s expensive to really analyze the utilization of the compute resources and hardly anyone does it, but the waste of money involved with it can be tremendous.

The consolidation ratios 8:1 and 12:1 are referring to this VM to physical server ratio. To me, an 8:1 consolidation ratio says to we’ve bought 8 times as many servers as was needed, but didn’t do the necessary engineering to install them on a single server. The reason why ineffective use of VM resources has happened, is the same reason as before — it’s just been pushed into another level of abstraction (VM utilization rather than physical server utilization). Often the limitations are memory based rather than CPU based. CPU continues to go under utilized.

I think that cost models that effectively sell the VM by the hour, in fact rely on that incomplete usage. They have a much larger aggregation consolidation factor because the VM’s now themselves still don’t use a lot of resources, but customers are paying for it. So the cost of that VM has to do with how much actual physical compute power being used. When you own the compute nodes as in a private cloud the CPU and memory is a fixed cost and it’s not variable. You’re paying so much per month for compute power and you analyze how much of that compute power you’re using. If you are still only using say 30% of the available compute power that goes to show that your VM’s are not efficient and then you can go do something about that. Optimizing your software and architecture results in using more compute power per VM without paying for more compute nodes (which could be sitting idle)

At Kurtosys we are in tune with these issues, and the result in savings and infrastructure costs have been substantial. Of course, this translates into providing better costs to our customers because our underlying compute costs are managed better.