Skip to main content

Incident clustering

A cluster is a group of incidents whose text is semantically similar within a configurable time window. The agent never sees the cluster mechanism directly. The agent sees a banner on a ticket saying "this looks related to N others." This page explains how the AI decides which tickets belong together and how the cluster grows, ages, and resolves.

What gets compared

Every incoming incident's text - title and description - is compared against recent incidents using NLP embeddings. The match is semantic, not keyword based:

  • "Outlook won't send mail" and "Email delivery failing for marketing team" match.
  • "Cannot reach SharePoint" and "SharePoint site down for finance" match.

This is deliberate. Categories on incoming tickets are filled inconsistently by requestors, so a category-based match would miss most real patterns. Free text is the most reliable signal.

When a cluster forms

A cluster is created when N or more incidents with semantically similar text are observed within a time window. The defaults:

  • N = 4 incidents
  • window = 4 hours

Both are configurable per tenant. A cluster does not require all N incidents to be open at the same moment; what matters is the rolling count inside the window.

Living clusters

Clusters keep growing as new incidents arrive. When a new incident comes in, the agent checks every active cluster first. If the text matches an existing cluster, the incident joins that cluster. A new cluster is only created when nothing matches.

This is important for the agent reading the banner. "N other tickets in the last 3 hours" reflects the current cluster size at the moment the agent opens the ticket, not the size when the cluster was first detected.

For every cluster, the agent writes a plain-language summary describing what the incidents have in common, when the cluster started, and how fast it's growing.

Lifecycle

Each cluster moves through a small set of states:

StateDefinition (default)What it means
EmergingAverage growth rate ≤ 1 incident per hourA pattern is forming but it might still be a coincidence. Visible to Problem Managers; not yet broadcast to agents.
ActiveAverage growth rate > 1 incident per hourThe pattern is real-time. Inline banner appears on every member ticket; cluster shows on the Problem Manager dashboard.
HyperCareSet after a fix is appliedA workaround or fix is in place; the agent is monitoring to see if new incidents stop arriving. Defaults to a 24-hour window.
ResolvedHyperCare window closes with no new incidentsThe cluster moves to Resolved automatically. Past clusters remain searchable for historical pattern matching.
StaleSet when an Emerging cluster stops growingThe pattern looked promising but didn't materialise. The cluster stays in the system in case the pattern resumes, but it is hidden from active surfaces.

All four growth and timing thresholds are configurable. The detail is on the Configuration page.

Historical pattern matching

When a new cluster forms, the agent checks whether the current cluster looks similar to a cluster from the past that was linked to a Problem or Major Incident. If a match is found, the agent silently links the new cluster to the previous record and surfaces the prior resolution to the on-call team:

Similar problem occurred on Feb 12 - resolved as KE-0042 with workaround "Restart transport service."

This shortens time-to-resolution dramatically the second time a known pattern reappears.

What clustering does not do

Clustering does not read change tickets, monitoring logs, or CMDB health data in this release. Those are planned correlation sources for a later milestone. Today, the only signal that drives cluster formation is the text of the incident.