Tracking Is Better Than Notifications

Avatar for evgeny.savitsky

Usually your team use hardware of software monitoring tools. They send notifications when something goes wrong, e.g. CPU level is very high or memory level is very low. These alerts are sent by emails or SMS to make the responsible person aware of trouble with hardware or software which are used for the service to be running. It sounds very simple and reliable to ensure your service availability this way, but there are weaknesses you should also be aware of.

Focus on Critical Issues Only

Definitely, critical issues should be discovered as soon as possible. If your service is broken then development is suspended and team's efforts are focused on the recovering of the service. Email or SMS here is the better choice, but critical issues are rare in contrast to dozen of regular incidents like exceptions, errors in logs, slow queries, low throughput and others you can get from tools like NewRelic. Team should be aware about all this regular incidents to keep the high level of quality of the service. It seems unrealistically to get all of this stuff by email or SMS.

Service is Becoming More Complex and Chatty

When your service evolves more components are required to be monitored and supported. Multiply the number of components on the number of incidents produced by the single component and you will get an estimation of how many incidents your team should handle. Moreover, think about programmatic nature of incidents which is the reason of hundreds of repeated messages. Because of scalability there will be duplicates from each instance of application. Do you like to have couple of hundreds duplicate emails in your mailbox on the workday beginning?

Lack of Visibility

One have no access to mailboxes of his team mates, so has no understanding of people involved in the incident resolution. There are no details on incidents resolution progress also, thus he can't notify clients or other team members on what is the state and have no estimation on when the issue will be resolved. If somebody of your team becomes out of the game you just lose tickets he worked on and the context of these issues. Strong communication between team members can solve this, but it takes too much time.

Team is Growing

With successful service your team will grow. Should you notify any team member with an incident? Or should you choose somebody to be responsible person for all incidents on timely basis for example? Actually, you can do this with the help of escalation rules and tools like PagerDuty. Remember about specific knowledge each team member has, it's obvious single person can't resolve any kind of incident effectively. The bad news responsible person can turn into regular dispatcher and can be demotivated seriously because of monkey work.

Lack of Statistics and Process Metrics

Email apps have no features to classify or aggregate emails by kinds or priority of incidents, they do not track incidents lifecycle also. Thus you can't gather metrics of your incidents resolution process, can't understand process issues and can't improve the process, therefore your team can't provide qualitative service.

Use Proper Tool Instead

To be aware of your service components health, to identify issues proactively and to resolve incidents effectively use more suitable tools like electronic board and incidents tracker which are both parts of DevOpsBoard service. Board holds all incidents automatically gathered from server, software and exceptions monitoring tools and other sources of incidents. DevOpsBoard automatically deduplicates similar incidents and build its context like server name, application version number, stack trace and more.

Any team member can put appropriate incident into his work queue when he is ready to get next task. Other team mates have clear visibility of what is going on, who is working on incidents, can ask on the progress and ETA using comments, change priority and reassign an incident. Tracker provides highly adaptable workflow which consists of all required stages each incident should pass. Tracker persists historical information on all incidents and allows to build charts reflecting the process and it's common metrics.

Board and tracker are very helpful, but who will assign priorities and owners to incidents? DevOpsBoard has useful feature for this. Automatic actions allow to describe the rules how to assign incident owner, how to assign priority or incident type and more. Rules are based on title or description of incidents. Thus, you can set higher priority if original incident has Critical keyword for example or assign DB related issue to your database skilled team mate automatically.

--
Evgeny Savitsky