Working with Azure during high demand: What to do when the cloud won’t scale

Image of blue background with Azure logo and a lock Image of blue background with Azure logo and a lock

In times of exceptional resource demand (like we're facing with COVID-19), Microsoft takes necessary actions to secure cloud resources for critical organizations and functions. This can be frustrating for users who need to scale usage, however, there are actions that can ease this pain and ensure you're still being a good corporate citizen.

There are several large "promises" we've come to expect when we think about cloud computing platforms like Microsoft's Azure, including major advantages in the areas of:

  • Cost
  • Scale
  • Performance
  • Security
  • Speed
  • Productivity
  • Reliability

We've all come to enjoy getting to market faster, being more agile in our development, five nines uptime, etc. but one of the most important features (to me) is large-scale scalability.

However, during this time of exceptional demand on cloud infrastructure, it is particularly important to understand a few things: how provisioning works, how demand is changing for current needs, the ways in which Microsoft is currently allocating and prioritizing resources, and the best ways to secure the computing resources necessary to maintain essential functions in a crisis.

Looking for more guidance on Azure? Register for Deploy, an expert-led, one-day online event focused on Microsoft Azure governance.


The truth about cloud capacity

If I could guess, the first thought that most people have around cloud computing—which may be the fault of marketing departments in cloud computing companies—is that cloud providers have unlimited computing power across their data centers to provision at any time. I wish!

Cloud computing is a business like any other, and they do offer huge capacity to make you a happy customer! Trust me, they do have A LOT in there, but like anything else in this world, nothing is infinite.

But, businesses exist to make a profit and one of their particularly difficult challenges is around efficient capacity planning. This is both an art and a science; planning and provisioning just enough computers to sustain a bit more than actual demand without having too much costly computing power just sitting in a corner doing nothing—and not generating revenue.

When is capacity an issue?

Over the past decade or so of working with Azure, I've seen a few times where this "fortune telling" of capacity planning has been off, leading to capacity issues in a particular geographic Azure region. Usually it's for a very short period of time and localized to that specific region.

Again, it is rare to see this happen, and normally, it's not an issue at all.

However, in a world learning to cope with COVID-19, people are forced to be at home, with a lot of them doing remote work, remote meetings, while others are still gaming, streaming, etc. This is adding capacity demands across all cloud providers that no-one has ever experienced.

Microsoft’s response to the COVID crisis

Since late February, when COVID-19 started having a major impact in Europe, Azure scalability has been a problem in that region. Similarly, as the virus has spread across the Atlantic, North and South America are now seeing issues. And as of writing, capacity issues have started to appear in Brazil and some US regions.

Importantly, Microsoft has identified that until it can ramp up overall capacity, it needs to ensure first-line responders and mission-critical organizations can run their increased or newly critical workloads. If working at home has changed our day-to-day, we can easily imagine how demands have rapidly changed across the systems that enable health care providers, first-responders, government, etc. to do their critical work.

On March 28th, Microsoft made the following announcement to clarify how they are keeping these people top of mind, deciding to prioritize computing demand in the following way:

Our top priority remains support for critical health and safety organizations and ensuring remote workers stay up and running with the core functionality of Teams.

Microsoft Azure Blog Tweet this

"Specifically, we are providing the highest level of monitoring during this time for the following:

  • First Responders (fire, EMS, and police dispatch systems)
  • Emergency routing and reporting applications
  • Medical supply management and delivery systems
  • Applications to alert emergency response teams for accidents, fires, and other issues
  • Healthbots, health screening applications, and websites
  • Health management applications and record systems"

What will be impacted by capacity limits?

Firstly, you won't have problems with resources already running if you leave them as is.

Where you may see issues, however, is when you attempt to create a new resource, scale up (larger compute), scale out (more instances), or start a deallocated resource (usually a virtual machine that is deallocated). To illustrate this, let's look at a hypothetical scenario:

You have a virtual machine that has been running for months. During a maintenance procedure, you need to delete (for any number of good reasons) the VM and then recreate it only two minutes later. If there is a restriction for the VM family size in that particular region—even if you had a VM of the same size running non-stop for months—you may not be able to create it, even if it is only seconds after deleting your existing VM.

Identify both critical and nonessential resources

I want to put emphasis on one thing, it's important that we all think twice about what we consider critical. This is important for reasons of securing our critical items, but also of being a responsible cloud citizen.

For example, if you have five environments for a particular solution, certainly not all of them are truly critical.

At all times it's a good practice to target and reduce waste in the cloud. In a crisis, like with COVID-19, where demand is very high, you could deallocate or even delete some of those non-essential resources/environments to free them up for others who need them more. You may even be looking to tighten your budget, and the less resources you use, the less costly it will be. A win-win.

Here are some questions you can ask yourself to help determine if a workload/resource…

  • ...is critical:
    • Would you experience any negative financial or reputational impact if this resource were not running?
    • Would one of your teams be unable to work at all if this resource were not running?
  • ...could be downscaled:
    • Is less than 50% of this resource being used? CPU, RAM, IO and network?
  • ...could be deleted:
    • Are there multiple workloads for the same stage (development, integration, QA, staging, testing, UAT)?
    • Do you use it less than once every few weeks/months?
    • Is this resource used less than 5% over a 30-day period?
    • Is it an expired resource? (e.g. website with an expired HTTPS certificate)

For more on cost and resource optimization, you can also look to tools like Microsoft's Azure Advisor or ShareGate Overcast to help lighten the load.

How to secure your critical resources

Let's assume that at this point, if you've run through these questions and leveraged the tools above, whatever you have left can be considered required for business continuity.

As stated above, you won't run into capacity issues if you leave your Azure resources untouched. For critical workloads, consider taking the following actions to reduce the chances of them being impacted during downtimes. However, I have made a note for each, indicating the effect these changes may have on your bottom line:

  • Run your compute 24/7. If you deallocate resources, Azure may require them elsewhere and if there are none available, you might not be able to start them when needed.
    Result: This will cost you more
  • Similarly, turn off automated tools or scripts that stop/start or scale in/down computing resources. You might not be able to get back to where you were when needed.
    Result: This will cost you more
  • Pay attention to Infrastructure as Code (IaC) tools that you are using. Look to tools like ARM templates—which are idempotent and non-destructive—over others like Terraform (and it's not the only one) that might delete a resource to create it again a few seconds later to apply a configuration change.
    Result: You may not be able to secure this resource again
  • Think twice before scaling down/in your computing resources, you might not be able to scale them up/out again.
    Result: This will cost you more
  • If your solution has autoscaling enabled with a significant difference between the min and max number of instances, consider increasing the minimum number to reduce the gap between them. If scaling from that amount to the max number of instances would fail, you'll have a little bit more computer power as a minimum.
    Result: This will cost you more

Conclusion

This should be no surprise, but if you increase the number of compute hours/instances to keep them secure for your needs, your invoice will increase. This may be necessary for your needs or the needs of your clients, so it is crucial to evaluate your situation carefully and act on what you consider critical.

With that in mind, it's still important to be a good corporate citizen; don't be selfish and use only what you really need. These extra resources can be provided to organizations that need them for the important job of saving lives.

With all that said, be sure to take care of yourselves, your teams, and those around you.

If you have any questions about how the current situation may impact your deployments, please drop them in the comments below.

This article was co-authored by Microsoft MVP Stephane Lapointe (https://www.codeisahighway.com/) & the ShareGate content team.


Thursday, May 7 | 9am-5pm ET

Discover Azure governance best practices from the experts at this one-day virtual event

Recommended by our team

What did you think of this article?