This is by no means a big deal, nor am I trying to take credit. But it is something I setup a few days ago and I was pleased to see it in action today, so wanted to post it somewhere. :)
So as I said earlier I have been reading up on Azure monitoring these past few days. I needed something to aim towards and this was one of the things I tried out.
When you install the “Agent Health” solution it gives a tile in the OMS home page that shows the status of all the agents – basically their offline/ online status based on whether an agent is responsive or not.
The problem with this tile is that it only looks for servers that are offline for more than 24 hours! So it is pretty useless if a server went down say 10 mins ago – I can keep staring at the tile for the whole day and that server will not pop up.
I looked at creating something of my own and this is what I came up with –
If you click on the tile it shows a list of servers with the offline ones on top. :)
I removed the computer names in the screenshot that’s why it is blank.
So how did I create this?
I went into View Designer and added the “Donut” as my overview tile.
Changed the name to “Agent Status”. Left description blank for now. And filled the following for the query:
1 2 3 4 5 |
Heartbeat | summarize LastSeen = max(TimeGenerated) by Computer | extend Status = iff(LastSeen < ago(15m),"Offline","Online") | summarize Count = count() by Status | order by Count desc |
Here’s what this query does. First it collects all the Heartbeat
events. These are piped to a summarize
operator. This summarizes the events by Computer
name (which is an attribute of each event) and for each computer it computes a new attribute called LastSeen
which is the maximum TimeGenerated
timestamp of all its events. (You need to summarize to do this. The concept feels a bit alien to me and I am still getting my head around it. But I am getting there).
This summary is then piped to an extend
operator which adds a new attribute called Status
. (BTW attributes can also be thought of as columns in a table. So each event is a row with the attributes corresponding to columns). This new attribute is set to Offline or Online depending on whether the previously computed LastSeen
was less than 15 mins or not.
The output of this is sent to another summarize
who now summarizes it by Status
with a count of the number of events of each time.
And this output is piped to an order
to sort it in descending. (I don’t need it for this overview tile but I use the same query later on too so wanted to keep it consistent).
All good? Now scroll down and change the colors if you want to. I went with Color1 = #008272 (a dark green) and Color 2 = #ba141a (a dark red).
That’s it, do an apply and you will see the donut change to reflect the result of the query.
Now for the view dashboard – which is what you get when someone clicks the donut!
I went with a “Donut & list” for this one. In the General section I changed Group Title to “Agent Status”, in the Header section I changed Title to “Status”, and in the Donut section I pasted the same query as above. Also changed the colors to match the ones above. Basically the donut part is same as before because you want to see the same output. It’s the list where we make some changes.
In the List section I put the following query:
1 2 3 4 |
Heartbeat | summarize LastSeen = max(TimeGenerated) by Computer | extend Status = iff(LastSeen < ago(15m),"Offline","Online") | sort by bin(LastSeen,1min) asc |
Not much of a difference from before, except that I don’t do any second summarizing. Instead I sort it by the LastSeen
attribute after rounding it up to 1 min. This way the oldest heartbeat event comes up on top – i.e. the server that has been offline for the longest. In the Computer Titles section I changed the Name to “Computer” and Value to “Last Seen”. I think there is some way to add a heading for the Offline/Online column too but I couldn’t figure it out. Also, the Thresholds feature seemed cool – would be nice if I could color the offline ones red for instance, but I couldn’t figure that out either.
Lastly I changed the click-through navigation action to be “Log Search” and put the following:
1 2 3 |
Heartbeat | summarize LastCall = max(TimeGenerated) by Computer | where LastCall < ago(15m) |
This just gives a list of computers that have been offline for more than 15 mins. I did this because the default action tries to search on my Status attribute and fails; so thought it’s best I put something of my own.
And that’s it really! Like I said no biggie, but it’s my first OMS tile and so I am proud. :)
ps. This blog post brought to you by the Tamil version of the song “Move Your Body” from the Bollywood movie “Johnny Gaddar” which for some reason has been playing in my head ever since I got home today. Which is funny coz that movie is heavily inspired by the books of James Hadley Chase and I was searching for his books at Waterstones when I was in London a few weeks ago (and also yesterday online).