Troubleshooting guide using Azure Storage Metrics
Fortunately, Microsoft offers an easy-to-use tool called Azure Storage Metrics to measure cloud performance, usage, availability, latency, and more. These metrics are broadly categorized into capacity and requests.
Capacity provides information related to the storage capacity and includes metrics such as the total number of containers used, the total number of objects stored, and so on. This service, though, is currently available only for the blob service.
Requests, on the other hand, is related to the number of requests executed by the service. Typically, it includes the total ingress/egress, average latency, total number of failures, and so on.
Both these categories of metrics give you an insight into the health of your platform at any time. Specific metrics related to the network tell you if there is a problem with the network, and the possible cause for the same.
However, remember to turn on the diagnostics on your platform, as the metrics are turned off by default. While doing this, make sure you also set the retention period, which is 90 days by default. If you don't want the system to hold information for you that long, or if you want it for a longer time, make the adjustments accordingly.
Now that you have a basic idea of storage metrics, let's look into some simple problems that occur on Azure, and how you can troubleshoot them using the information collected by Azure Storage Metrics. While this doesn't cover all the possible problems that can occur, it is sure to give you an idea of how to debug your platform.
If a request takes too much time to complete, it needs to be investigated. Look through the values of two parameters called AverageE2ELatency and AverageServerLatency.
Sometimes, you may encounter a situation where your AverageE2ELatency is significantly higher than AverageServerLatency. In general, AverageE2ELatency refers to the successful request made to Azure storage, and it includes the time the client takes to send the data and receive acknowledgement from the storage service. AverageServerLatency, on the other hand, is the time it takes to process a request, and doesn't include network latency.
Always, your AverageE2ELatency will be higher than AverageServerLatency, but sometimes the difference can be extremely high. For example, if your AverageServerLatency time is 2.17 minutes and your AverageE2ELatency is 32.40 minutes, then something is definitely wrong.
Possible problems can be due to a limited number of threads or low resources such as CPU and bandwidth. This problem can be fixed by making the client more efficient. You can do this by different ways such as making asynchronous calls to the storage service or by using a larger system with more cores and memory.
If the client's system is not over-worked, look into network problems such as packet loss. You can use specific network tools like Microsoft Message Analyzer to look into these network-related issues.
Besides troubleshooting this particular problem, the movements of AverageE2ELatency and AverageServerLatency can give deep insights into your application's performance.
For example, a low AverageE2ELatency and AverageServerLatency, but a high client latency means there is a delay in the requests sent to the storage service. It could be because of limited connections on the client. A good way to troubleshoot this program would be to look at the ClientRequestId to see if it is making multiple retries to send requests.
On the other hand, if your AverageServerLatency is high with respect to tables, it simply means poor table design, and you'll probably have to rework on them.
Unexpected delays in message delivery
If you experience a big delay between the time an application sends a message to the queue and the time it becomes available to read, look for the value of a parameter called AverageServerLatency. Again, if this value is significantly higher than its average, consider taking these steps to resolve the issue.
- First off, verify if there is a clock skew between the worker role that adds message to the queue and the one that reads from the queue.
- Next, check if the application is able to successfully add messages to the queue.
- If the first two are not the cause, look at the worker role reading the message. Does it fail?
- Sometimes, the queue client will fail to respond with an acknowledge, so it will remain in the invisible queue until the timeout period ends.
- Lastly, the number of worker roles may not be sufficient to process all the messages in the queue.
- In most cases, one of the above reasons will be the cause of unexpected delay.
A throttling error occurs when you go way beyond the scalability limits of your storage. This is, in fact, a built-in protection mechanism that ensures that no single client uses a substantial amount of cloud resources, at the cost of others.
To identify a throttling error, look at the PercentThrottlingError metric, that'll show the number of requests that failed with throttling. If you monitor the value of this metric, you should see it increase only when the number of requests to a storage goes up significantly. Sometimes, it may also return a "503 Server Busy" or "500 Operation Timeout" message to the client.
To fix this problem, identify if the error is transient or permanent. A transient error is when the value of PercentThrottlingError increases when the requests are high, and to fix it, just implement an exponential back off strategy. This should bring down the load immediately.
If the errors are permanent, then it means you're experiencing a permanent increase in transaction. The best way to handle this is to increase your queue size. If you're seeing these throttling errors in your table, consider a different partition scheme to spread your transactions across multiple partitions. While doing this, just remember that all these partitions contribute to the scalability limit.
When the storage device detects a network error, the value of a metric called PercentNetworkError increases. This metric is an aggregation of different metrics like NetworkError, AnonymousNetworkError, and SASNetworkError.
This network error occurs mostly when a client has disconnected from the network before the timeout expires. If this is happening frequently, investigate the client code to see why it is disconnecting from the storage device.
So, these are some of the common problems can you resolve by looking at your Azure storage metrics.
To conclude, Azure storage metrics gives an insight into the health of your cloud platform and the apps using it. This is why, it's best to examine them when you're facing a problem, as it can lead you to the cause of the problem. The above few examples should give you an idea of the power of these storage metrics, and how you can use them effectively to make the most of your cloud platform.
So, get familiar with these different metrics and what they mean, so you can troubleshoot problems quickly, and more importantly, optimize the cloud platform for your business.