The Challenge

The client was a national franchise network with a robust e-commerce program, generating over 23 million users sessions/month. The DORIAN Group had previously worked with the client to implement a comprehensive Unified Platform Monitoring system for their operations. New observability tools allowed greater transparency into platform defects and outages, which exposed an opportunity to optimize the client’s incident triage processes.

 

Prior to the project, the client primarily utilized a reactive incident response process: incidents were reported by customers or stores, then repaired with ad hoc solutions and teams based on a manual on-call assignment process.

 

“If there was any issue with our technology, people would have to manually call in and connect with the right person. If the manager or someone forgot to update the number of the ‘on call’ responder, the process became extremely inefficient and we had no way to cleanly escalate issues”

 

The client worked with The DORIAN Group to implement an incident resolution solution that would:

 

  • Create unified incident triage across the entire organization
  • Establish a proactive and preventative approach to incident response: resolve issues before they caused business disruption
  • Establish assignment rules: on-call rotation and service-specific escalation pathways
  • Integrate with existing organization tools such as ServiceNow, Slack, Jira, etc.
  • Institute a blameless and transparent postmortem process to improve operations
 

The primary goal was to resolve incidents quickly by alerting the right team members for each incident and implementing a smooth resolution process.

 

The Solution

With a deep understanding of organizational requirements and structure, The DORIAN Group worked with the client to select their chosen solution: PagerDuty. A platform for operations management, PagerDuty ensured the right team members quickly responded to each incident. The DORIAN Group managed the scoping, development, and implementation process.

 

 

With PagerDuty, the client’s new comprehensive triage process was built as follows:

 

 

  1. DataDog, the client’s unified platform monitoring system, proactively identified incidents prior to customer impact. Rich incident event data automatically dispatched to PagerDuty.
  2. PagerDuty’s resource service specific escalation rules connected the right team members to swarm each incident. Team members were selected based on on-call schedules, service levels, and more. PagerDuty automatically directed notifications, bridged video calls, and handled incident escalations.
  3. After incident resolution, PagerDuty initiated a transparent, blameless post-mortem process to guide root cause analysis (RCA) and encourage iterative improvement. All incident timeline notes automatically fed into the postmortem process, further accelerating the analysis process. 

 

Results

Teams across the organization could now work quickly and collaboratively to resolve incidents faster. The PagerDuty implementation ensured that the right team members were brought onto each incident, and visibility was maintained all the way through resolution.

A new blameless port-mortem process allowed the customer to become more agile, iteratively improving their operations through every new incident process.

 

With these new tools in place, the customer immediately saw results: a 60% reduction in MTTR, measurable customer satisfaction improvements, and reduced operational downtime.

"We were so used to our old processes and it had become difficult to enact real change inside our organization. Having The DORIAN Group bring in a new perspective helped us see everything in a different light. Their leadership and expertise got us to where we are today, and they were able to coordinate tremendous tasks across multiple teams and siloes. Our process became automated: it’s now so easy to connect people, find the right resources, and automatically escalate.”