From the dev team: Detecting anomalies in Azure costs with TensorFlow and Azure Functions

Creating Azure cost anomaly detection Creating Azure cost anomaly detection

ShareGate Overcast’s Azure cost anomaly detection feature is one of our customer favourites. In this blog, our dev team explains how they built this feature for our Azure cost intelligence platform.

Here at ShareGate Overcast, we have access to a lot of Azure cost data. Every day, we scan users’ cloud costs data to compute potential savings, forecast future costs, and display the data in a user-friendly manner.

A few years ago, we asked ourselves: How can we use that data even more? Could we try to interpret the data and extract meaningful insights from it?

With the help of Hugo Tremblay-Ledoux and Étienne Labrie-Dion from the GLab—our innovation department—we set out to detect anomalies in a user’s spending. This would potentially help discover wrong configurations in the cloud, bugs that have been deployed, and even cyber attacks! We called the project: Oddity.

In this blog, I’m going to walk you through how we developed ShareGate Overcast’s anomaly detection.


One of the main issues facing IT departments in 2020 is unchecked cloud spending. Learn how to identify cloud waste, stop the bleeding, and build an Azure cost management strategy for your whole team in our on-demand webinar.

Overcast logo white

WEBINAR

Stop wasting money in Azure:

Build an efficient Azure cost management strategy for your business
Overcast - Stop wasting money in Azure

Getting the Azure spending data we needed

The data we had consisted of users’ Azure resource cost history. In addition to the cost history of the resources, we had access to some metadata like the resource type, the configurations, and a log of the actions performed on the resource.

Screen shot of Azure resource usage

A cost history, as seen in ShareGate Overcast.

However, we did not have a dataset of anomalies that could be used to try things like supervised learning on a neural network.

We considered looking at our own data, identifying anomalies manually and using that as a starting point. But that kind of manual work would take a lot of time and would probably not generate enough samples to train a neural network.

Screenshot of a spike in Azure resource usage cost

We can see one or two anomalies in the past 6 months.

Our approach to identifying Azure cost anomalies

We realized that we knew what some anomalies looked like. One anomaly pattern is a sharp, unpredicted increase in costs. We called those “cost spikes.” To identify those, we could try to predict a resource’s costs (a function we already had that we discussed in this blog). We figured that if our predictions were too far off, that would mean we had an anomaly.

To achieve this, we trained a neural network using historical data to predict the costs of the next few days. Then, all we had to do was to compare the predictions with the actual costs and determine if the actual costs were unpredictable enough to be considered anomalous.

However, we realized that the algorithm would often make poor predictions in the days following anomalies. That’s because in a normal Azure cost pattern, when your costs increase, they don’t come down, but in the case of anomalies, it does!

For example, let’s say you have a storage account that costs you $10 a day. On one day, perhaps your backup routine failed and ended up attempting backups multiple times. The cost on that day went up to $50 because of this.

You fixed the problem and the day after you observe the usual $10 a day. But the algorithm has seen very few examples of this since the usual cost pattern is to go up and stay up. As a result, it predicted $40 for the day following the anomaly.

Azure resource usage increases

Costs generally go up, not down.

To circumvent that, we predicted every day of the given data and took the errors of the previous days into consideration to determine if an anomaly occured.

This makes the algorithm sensitive to errors when it’s predicting correctly and allows for more errors in the days following an anomaly or when the costs are unpredictable.

Graph of Azure resource usage cost predictions

The orange zone is the sensitivity of the algorithm. It goes up after an anomaly, then gradually comes back down.

Writing the code to detect Azure cost anomalies

Our anomaly detection runs inside Azure Functions. We won’t go into detail about how we set up an Azure Functions project, since there are already lots of good tutorials about that (we like this one from Microsoft).

I will mention that we used the azure-functions-core-tools CLI package that you can install with chocolatey and the VSCode extension ms-azuretools.vscode-azurefunctions to get started quickly. They are well-documented, and they did all the heavy lifting for us until we needed a proper deployment pipeline.

Also, we used python libraries like Numpy, Pandas, and TensorFlow (specifically, TensorFlow Lite—more on that later).

Once you have an Azure Functions project set up, you’ll have an __init__.py file, which is your functions entry point. In there, the code is quite simple.

We parse the request, call our algorithm, format the response, and send it back. It looks a bit like this (I’ve removed the logging and error handling for clarity).

All the anomaly detection work was done inside the detect_cost_spikes function, which we put in another file named oddity.py.

Oddity

The first thing detect_cost_spikes does is load the TensorFlow model in memory.

We used a global variable as an optimization. As it turns out, Azure Functions can keep their global context between executions, so by doing this we avoided having to load the model on every call. This was really good for performance since we planned on calling our Azure Function a lot in a short amount of time.

Microsoft gives offers helpful info about the global context in its documentation.

Next, we did a bit more parsing on the input data. At first, we simply parsed the JSON, but we needed a bit more formatting.

Then, we had three steps. First, we made predictions based on the current costs. Second, we analyzed which days had anomalies. Finally, we formatted the results to have something nice to output.

Here’s what the detect_cost_spikes method looks like:

Generating predictions

The generate_predictions method uses our trained model to predict costs. In order to detect anomalies in the given cost history, we predicted the cost for each day, as mentioned earlier.

To achieve this, we created every possible continuous subsequence of fixed size with the cost history. We then ran these subsequences through the model.

Note that we used normalization since our model was trained on normalized data.

Finally, we returned the predictions, but also the dates and costs associated with the predictions. Since we created subsequences of size MODEL_INPUT_SIZE and we predicted the following day, the first MODEL_INPUT_SIZE dates had no predictions, so we didn’t return those.

Detecting Azure cost anomalies

overcast logo
Get visibility into what’s going on in your Azure environment and how to keep costs down.

Top features

check

Savings recommendations

check

Custom views business mapping

check

Top cost drivers and resource administrators

Now that we had the cost history and the predictions for each day, we were able to detect which costs are anomalous. To do this, we first determined the prediction error on each day.

Then, we figured out the standard deviation of the errors. We used that to determine which errors are acceptable and which aren’t. However, we used a few tricks to compute the std.

Firstly, we computed the exponentially weighted moving standard deviation of these errors using Pandas .ewm().std() function. What this did was give us an std that takes into account recent history.

In other words, the std will be larger if a large error happened recently and will slowly diminish while errors are small. This is useful to prevent detecting multiple anomalies when costs go crazy in a short period of time.

Secondly, we put a lower bound on the std values. We did this because we noticed that the std would often drop near 0, and then a little noise in the data would be considered an anomaly. We picked a lower bound that gave us good results.

Finally, we generated a mask by comparing the errors to the stds. If the error exceeds a threshold, we put True in the mask; otherwise we put False. We chose the threshold empirically.

The output of this was a vector with a True or False for each day, where True means there’s an anomaly on that day.

Return format

Our last operation was to return the dates at which we found anomalies. We simply use the mask we created earlier to do this.

We also return more information, like the cost on that day, the error, and the std to give our customers more contextual data that will help them find and fix the issue.

That’s all for the code!

About the TensorFlow model

I won’t go into detail about our prediction model, since that would be an article in and of itself, but here’s an overview.

We used a TensorFlow CNN with multiple Conv1D layers using ReLU activations and BatchNormalization. We trained it to predict the costs of a given Azure resource for the next day based on its cost history.

For performance reasons, we chose to use a window of 84 days (12 weeks) and to predict the next 4 days (but we only consider the first).

Here’s a quick glance at our configuration, for those who are curious:

Deploying anomaly detection with Azure Functions

Once everything was working fine locally, it was time to deploy! We use Azure DevOps, so we built a release pipeline using the “Install Azure Func Core Tools (Preview)” task and an “Azure CLI” task. Using the Func tools CLI, publishing is as easy as:

Image016

The Function runs on a Linux Function Consumption plan. This gives the benefit of almost infinite scaling for free. In theory, the number of instances is supposed to scale based on the number of incoming requests automatically. So we went ahead and did a load test, but, unfortunately, it didn’t quite work.

As we observed the response times and the number of instances, we noticed something very odd. While responses were fast, anytime a new instance was created, all the other instances stopped processing requests until the new instance was ready, which would take about 10 seconds.

This would cause timeouts on most of the queued requests, which would even crash some of the instances, requiring new ones again.

With these frequent downtimes, running anomaly detection was taking about 30 to 45 minutes, just for our environment. This was very problematic, since we wanted to run anomaly detection on all our users’ environments, every day.

We found out that the problem was the size of our package dependencies. The biggest one was TensorFlow. To fix this, we implemented the TensorFlow Lite version, which was about ten times smaller than the normal version.

Now with the smaller dependencies, the Azure Function was running blazingly fast! It only took a few minutes to run anomaly detection on our environment, and the scaling was no longer an issue.

We could hammer the service with millions of requests, and we’ve seen the instance count go up to 50+ in a question of minutes, handling around 800 requests per second and scaling back down again automatically when the work was done.

At the moment, we run a bit over 2 million requests every night and it takes about an hour to process. And the good news is, it only costs about $30 a month, with half of it being spent on our Application Insights monitoring!

Anomaly detection in ShareGate Overcast

We integrated this engine in the new Anomalies section of Overcast. With a bit of UI magic, here’s what it looks like:

ShareGate Overcast detects anomalies in Azure costs

An anomaly, as seen in ShareGate Overcast.

We scan all of our ShareGate Overcast users’ environments every night and if we detect an anomaly in their spending, they’ll automatically receive an email from us with information on the affected resource’s name, type, administrator, resource group, and subscription as well as the last action taken on it before the spike.


I hope you enjoyed this incursion into the development process for the ShareGate Overcast anomaly detection feature. We learned a lot while building this feature and we’re always looking for ways to optimize it.

If you have suggestions for us, if you’d like to know more, or if you have any other question, really, just let us know in the comments!


Recommended by our team

What did you think of this article?

Stop wasting money in Azure: Watch our on-demand webinar to build an efficient Azure cost management strategy for your business