At Datalink, we use a slew of industry leading tools to monitor our customer’s environments. Yes, a “slew” of tools! In our more than 25 years of experience, we have learned that no single tool can monitor an entire environment, completely and effectively.
When IT organizations confine themselves to one tool, they often make sacrifices and trade-offs in certain areas. However, if an IT organization uses multiple tools, they often experience inconsistencies in the escalation process. Having sound management practices around the tools themselves is key to these monitoring tools. That is why the use of multiple monitoring tools is only possible by means of an effective and automated event-based escalation process. The management of these can be a real burden to an IT organization and is why Datalink offers this as a service for our customers in a fully managed approach
The Datalink managed services
organization has an event-based escalation process that each of our customers get plugged into. We constantly evaluate our process and make adjustments, as needed, to ensure effectiveness. Having multiple customer environments contributing to continuous improvements of this process enables Datalink to maintain a higher standard that would not be possible in a single environment. We dedicate time to process creation and evaluation to ensure we deliver the best service possible to our customers. This, in turn, guarantees that our customer’s IT organizations stay relevant to the business. This is not an easy undertaking, but done right, it pays off in the end.
Based on our experience, here are the four key elements to ensure an effective event-based escalation process.
Automation is perhaps the most important key, but done incorrectly, can be cause for failure. The first step is to establish automation goals. Without set goals, your process can spin out of control. Datalink’s goals for automation include:
2. Reduce the noise – “garbage in, garbage out”
Improving systems availability
Removing constraints during growth
Obtaining operational consistency
The quickest way to derail the adoption of any process is to have it create too much noise. In addition to the potential for adoption loss, this can stress engineers out and reduce their work-life quality (a very important factor for Datalink). In addition, garbage can create inconsistency in operations, as some engineers may start to ignore the “garbage” events, causing them to miss real events in the process. This can also have an impact on operational consistency and productivity; a task that is never done and needs to be reviewed regularly. Datalink managed services continually review this to make sure we provide the highest quality to our customers.
3. Timing, Timing, Timing
I cannot stress the importance of this factor enough, as it has the biggest impact on the effectiveness of your escalations. In real estate, the most important factor is, “location, location, and location.” In event-based escalation, the most important factor is, “timing, timing, and timing.” There is a fine line to follow here — escalate too soon and you create unnecessary work; escalate too late, you experience unwanted downtime. The ideal scenario is to evaluate this on a continual basis to ensure the correct reaction.
4. Effective Communication
Effective communication is the next key to a successful automated event-based escalation. The best thing to do is use the same communication paths engineers use to communicate with each other. For our managed services team, it is email, SMS (text), and phone, in that order.
(NOTE: For a more robust process, note the differences in communications based on time of day. For example, engineers may be more likely to respond to a text at 7:00 PM than an email. This would in turn give you a more effective response time.)
The Datalink managed services event-based escalation process
If an event is triggered in one of our customer’s environments, our system goes through specifically defined steps which were developed based on the above mentioned key elements. All of the tasks are completed without human intervention, ensuring operational consistency, as well as improved systems availability.
Jason D. Anderson
Determine event severity. (Critical, Major, Minor, Warning, Informational)
Identify the system(s) affected by the event.
Create an incident in ticketing system.
Route to the correct resources as determined by the system(s) affected. (e.g. Network, Storage, Server)
Based on severity, ascertain the speed and type of escalation. (e.g. Critical events are escalated faster, with more notification paths.)
Based on day and time of day, establish the speed and type of escalation (e.g. After hours, notifications may go to a phone call and SMS, whereas during the day, notifications may go via email and SMS.)