Keeping Track of Your Service Health
As a service owner, finding subtle issues in your application before they become major live site incidents could be the difference between a quiet night and a sleepless one, and between a happy customer and a very unhappy one.
In order to ensure our service runs smoothly, our team applies several strategies including active and passive monitoring as well as a daily cadence where we analyse the operational and business telemetry generated by the service.
The service is instrumented with Application Insights which provides a very rich semantic model that describes the activities that take place within the service. This includes client-side page views, server-side requests, dependencies, exceptions, traces (commonly known as logs) and other more customised data types.
Typically, we keep track of daily average requests duration, number of exceptions and overall service availability as service health indicators. However, traces are often overlooked for detection purposes, typically used only for diagnosing the root cause of issues. This made me wonder whether traces could be used as an early warning sign that something may be off with our service.
The Case for a Trace
Traces are generally unstructured, but there is one thing that we do know – their severity level. The idea was to keep track of the ratio between the “good” traces (Info and Verbose) and the “bad” traces (Warning, Error and Critical). The assumption is that error traces may happen on a regular basis at a certain extent due to any number of reasons (transient network issues for instance). It is only when their relative proportion rises significantly that a real problem may be growing underneath the surface. If only we could keep track of such a metric.
Luckily, Application Insights Analytics makes it very easy to run such analysis. I start off by opening our Application Insights resource and clicking on the Analytics icon.
That in turn opens Application Analytics, where I click open a new tab to start querying my data.
Analytics is a very powerful language which allows you to query your telemetry data in a SQL-like manner. The first thing I would like to do is find out how many traces the service emits daily.
traces | where timestamp > startofday(now() - 7d) | summarize count() by bin(timestamp, 1d) | render timechart
Traces is the name of the table that holds all my traces. I limit my query to the last week, since that’s what I am going to use as a baseline for reference. Next, the summarize statement counts the number of traces daily.
Click on GO and you will get a chart similar to this
Now what I would actually like to know is how many of those traces were what I have defined as “good” (Verbose and Info) or “bad”.
| summarize Verbose = countif(severityLevel == 0), Info = countif(severityLevel == 1), Warning = countif(severityLevel == 2), Error = countif(severityLevel == 3), Fatal = countif(severityLevel == 4) by bin(timestamp, 1d) | extend Bad = (Error + Fatal + Warning), Good = (Verbose + Info)
I have defined each trace verbosity level with the proper name, and used the extend operator to define the new verbosity categories I have defined. Here is how the result looks like
Luckily for us, the number of bad traces is very low compared to the good ones, but has the ratio changed during the last week? I will use the extend operator again to define combined metric.
| extend Ratio = (todouble(Bad) / todouble(Good + Bad))*10000 | project timestamp , Ratio
The Ratio was increased by a factor of 10000 in order to improve readability by working with integers. It doesn’t really change anything since we’re mostly interested in changes in the ratio rather than with its actual value. Using the project operator allows me to focus only on the metric I actually care about.
The ratio is at a bit over 20, and has remained fairly steady for the prior week, but has it always been at that level? Let’s increase the time window to 30 days and have a look.
Well, clearly something happened in the middle of this period that had significantly increased the ratio of warning and error traces compared to the benign ones. Fairly quickly we realized that this was due to a code bug introduced in the latest version deployed that day. Luckily the only impact was wasted compute cycles and a lot of unwanted telemetry sent to Application Insights, but we may not be that lucky next time. The question is whether we can automate this kind of analysis, and be alerted of spikes or trends in metrics of interest.
Don’t Patternize Me!
The first question that we must address is how to define and detect this increase “pattern”. For that purpose we will use several tools provided by Application Insights Analytics. We will create a time series upon which we will apply a linear regression and then seek out any trends.
First of all we will convert our data points into a time series by using the make-series operator
| make-series any(Ratio) default=0 on timestamp in range(startofday(now() - 30d) , startofday(now()), 1d) by 'TraceSeverity'
The make-series operator converts a table of timestamps and values into a single “row” of data which represents the entire time series. The ‘any(Ratio)’ simply means pick out any value for this timestamp. make-series supports a wide array of aggregation functions, but we don’t need it in this case, since our data is already pre-aggregated. Next we define the time range and bins (1 day) that we would like our time series contain. The ‘by clause’ can be used to group series by a certain dimension. We don’t have any of those in this case, so I am just providing a descriptive name. Note that the use of make-series in this case is somewhat naive since all the data has been already prepared. We just need the data structure to change. Once we’ve got that covered, we can move on to the linear regression.
We will start with a simple single line regression and see how that works out. For that purpose we will use the series_fit_line function.
| project timestamp, RatioSeries, series_fit_line(RatioSeries)
And here is what we got:
The series_fit_line function provides various values for the linear regression including the R square which indicates how well the line fits the time series, and the slope of the line. In this particular case the R square we got is relatively low (0.68) whereas 0.8 is generally considered a reasonable threshold, so the fit isn’t that great. The visual chart supports the same conclusion.
Simple line fits are great for identifying “smooth” and gradual changes, but in this case the pattern is more of “step” than a line. In that case we may try to use a linear fit of 2 lines. Luckily, there is a function series_fit_2lines for that as well .
| project timestamp, RatioSeries, series_fit_2lines(RatioSeries)
Bingo. This time the line fit is significantly better and the rsquare at 0.98 reflects that as well.
In order to get a simple Yes/No answer, just use iff and apply the thresholds that are most relevant to your case. I use a minimum rsquare of 0.8 and any positive slope (for the single line case). The full example for the single line fit case is given below for reference. It can be augmented to cover the 2 lines case by using similar constructs.
// //Define thresholds relevant to the specific metric let minRsquare = 0.8; let minSlope = 0; // // Classify traces as either Good or Bad traces | where timestamp > startofday(now() - 30d) | summarize Verbose = countif(severityLevel == 0), Info = countif(severityLevel == 1), Warning = countif(severityLevel == 2), Error = countif(severityLevel == 3), Fatal = countif(severityLevel == 4) by bin(timestamp, 1d) | extend Bad = (Error + Fatal + Warning), Good = (Verbose + Info) // // Determine the ratio between good traces and all traces | extend Ratio = (todouble(Bad) / todouble(Good + Bad))*10000 | project timestamp , Ratio // // Create a time series | make-series RatioSeries=any(Ratio) default=0 on timestamp in range(startofday(now() - 30d) , startofday(now()), 1d) by 'TraceSeverity' // // Apply a linear regression to the time series | project timestamp, RatioSeries, series_fit_line(RatioSeries) | project timestamp, RatioSeries, series_fit_line_RatioSeries_rsquare, series_fit_line_RatioSeries_slope // // Test whether the line fit matches the defined thresholds | project PatternMatch = iff(series_fit_line_RatioSeries_rsquare > minRsquare and series_fit_line_RatioSeries_slope > minSlope, "Rising Trend", "No Match")
Can I have the Auto, Mate?
Now that we have established how to use Application Insights Analytics to identify trends and step patterns in any metric that we are interested in, how do we go about automating the whole thing. We could of course pin the chart to an Azure dashboard and inspect it daily for trends, but that doesn’t scale very well and might not even be a feasible option.
An automatic alert based on the results of an analytics query is one of the most frequently asked features in Application Insights and has already been set in motion. Until that feature is generally available you can use this cool technique suggested by Sergey Kanzhelev.
The approach presented allows finding value change patterns in any type of metric that can be defined in Application Insights. It can be automated and used as an alert or as a first indicator of a change in the key performance indicators that interest you, which could make the difference between a live service and a broken one.
You can find the sample data and query discussed above in the Application Insights Analytics demo portal along with many other cool examples.