Kymeca
FinOps Engineering

FinOps Engineering: Attribution Intelligence and Anomaly Detection

Mark 9 min read Part 3 of 5 — FinOps Engineering
A street with vividly colored Georgian doors in Dublin, symbolizing the spotting of anomalies and clear attribution.

Cathedral Quarter, Belfast — Korng Sok / Unsplash

The Two Attribution Problems

The tag-based attribution built in Post 1 has two systematic weaknesses that become apparent quickly in production. Both undermine the reliability of cost reports and forecasts if left unaddressed.

The singleton problem. A tag grouping containing only one resource is not a team or department in any meaningful sense. It is usually a misconfiguration, a test resource, or someone who applied a tag that nobody else on the team uses. Treating a single EC2 instance tagged team=johns-test as a legitimate cost centre adds noise to reports, inflates the count of apparent teams, and creates budget orphans — costs that don’t fit neatly into any real organisational unit.

The attribution gap. Untagged resources don’t just represent missing data — they represent real spend that is attributed to nobody. In most organisations, untagged spend starts at 20–40% of total cloud cost and only improves with active effort. Rather than waiting for tagging hygiene to catch up, the correlation engine infers likely ownership from cost behaviour patterns and surfaces suggestions for human review.

Tag-based attribution is only as good as the tags. Attribution intelligence is what you build when you accept that the tags will never be perfect.

Cardinality Filtering

The cardinality filter sits between the raw tag extraction and the daily rollup. It promotes a tag grouping — a specific department/team combination — to a legitimate cost centre only if it contains at least N distinct resources within a rolling window. Groupings below the threshold are reclassified to a __sparse__ category, keeping their costs visible without polluting attribution reports.

Computing tag group cardinality

SQL attribution/cardinality_view.sql
-- Tag groupings with sufficient resource cardinality over rolling 30 days
CREATE MATERIALIZED VIEW valid_tag_groups AS
SELECT
    department,
    team,
    COUNT(DISTINCT "ResourceId")   AS resource_count,
    SUM(effective_cost_usd)          AS rolling_cost_usd,
    MIN(charge_date)                  AS first_seen,
    MAX(charge_date)                  AS last_seen
FROM  cloud_cost_facts
WHERE charge_date >= current_date - 30
  AND  department IS NOT NULL
  AND  team       IS NOT NULL
GROUP BY department, team
HAVING  COUNT(DISTINCT "ResourceId") >= 2  -- minimum cardinality threshold
WITH DATA;

CREATE UNIQUE INDEX ON valid_tag_groups (department, team);

Applying the filter in the rollup

SQL attribution/filtered_rollup.sql
-- Extend the daily rollup to reclassify sparse groups
CREATE MATERIALIZED VIEW daily_cost_attributed AS
SELECT
    f.charge_date,
    f."ProviderName",
    f."ServiceCategory",

    -- Reclassify if group doesn't meet cardinality threshold
    CASE WHEN v.team IS NOT NULL
        THEN COALESCE(f.department, '__untagged__')
        ELSE '__sparse__'
    END AS department,

    CASE WHEN v.team IS NOT NULL
        THEN COALESCE(f.team, '__untagged__')
        ELSE '__sparse__'
    END AS team,

    SUM(f."BilledCost")                              AS billed_cost_usd,
    SUM(COALESCE(f."EffectiveCost", f."BilledCost"))  AS effective_cost_usd,
    COUNT(DISTINCT f."ResourceId")                    AS resource_count

FROM  cloud_cost_facts f
LEFT JOIN valid_tag_groups v
       ON  f.department = v.department
      AND  f.team       = v.team
GROUP BY 1,2,3,4,5
WITH DATA;
Threshold calibration

Start with a minimum cardinality of 2 resources and review the __sparse__ cost percentage after the first week. If sparse costs exceed 10% of total spend, the threshold is probably too aggressive and should be lowered. If legitimate small teams are being incorrectly reclassified, consider adding a secondary condition: a group that has been consistently present for 60+ days is promoted regardless of resource count — it is a small but stable team, not a misconfiguration.

Time-Series Cost Correlation

Resources that belong to the same workload tend to exhibit correlated spend over time. They scale together, they get created and destroyed together, their costs peak and trough in tandem — because they are driven by the same traffic, the same batch schedules, the same deployment events. This behavioural similarity is a signal that tag-based attribution cannot see but correlation can.

The correlation engine computes pairwise Pearson correlation of daily cost time series between each untagged resource and each valid tag group. High correlation suggests probable membership. The output is a ranked list of attribution suggestions for human review — not automatic re-attribution, which would carry too high a false-positive risk.

Building the resource cost matrix

Python attribution/correlation.py
import numpy  as np
import pandas as pd
from dataclasses import dataclass
from .db import fetch_resource_daily_costs, fetch_group_daily_costs

@dataclass
class AttributionSuggestion:
    resource_id:     str
    provider:        str
    service_name:    str
    current_tags:    dict
    suggested_dept:  str
    suggested_team:  str
    correlation:     float    # Pearson r, 0–1
    confidence:      str      # 'high' | 'medium' | 'low'
    window_days:     int


def suggest_attribution(
    window_days:          int   = 30,
    min_correlation:      float = 0.75,
    min_overlap_days:     int   = 10,
) -> list[AttributionSuggestion]:

    # Fetch untagged / sparse resources with sufficient cost history
    untagged = fetch_resource_daily_costs(
        tagged=False, window_days=window_days, min_days=min_overlap_days
    )
    # Fetch daily cost series for each valid tag group
    groups   = fetch_group_daily_costs(window_days=window_days)

    suggestions = []

    for resource_id, r_series in untagged.items():
        best_corr  = 0.0
        best_group = None

        for (dept, team), g_series in groups.items():
            # Align on common dates
            common = r_series.index.intersection(g_series.index)
            if len(common) < min_overlap_days:
                continue

            r_vals = r_series[common].values
            g_vals = g_series[common].values

            # Normalise to remove scale differences between large/small groups
            r_norm = (r_vals - r_vals.mean()) / (r_vals.std() + 1e-9)
            g_norm = (g_vals - g_vals.mean()) / (g_vals.std() + 1e-9)

            corr = float(np.corrcoef(r_norm, g_norm)[0, 1])

            if corr > best_corr:
                best_corr  = corr
                best_group = (dept, team)

        if best_corr >= min_correlation and best_group:
            dept, team = best_group
            meta = untagged_meta[resource_id]
            suggestions.append(AttributionSuggestion(
                resource_id    = resource_id,
                provider       = meta.provider,
                service_name   = meta.service_name,
                current_tags   = meta.tags,
                suggested_dept = dept,
                suggested_team = team,
                correlation    = best_corr,
                confidence     = 'high' if best_corr >= 0.90
                                 else 'medium' if best_corr >= 0.80
                                 else 'low',
                window_days    = window_days,
            ))

    return sorted(suggestions, key=lambda s: s.correlation, reverse=True)
Suggestions, not facts

Correlation does not imply causation or ownership. Two resources that happen to exhibit similar scaling patterns — perhaps driven by the same traffic source rather than belonging to the same team — will produce false positive suggestions. Always route suggestions through a human review step. The platform stores confirmed attributions in an attribution_overrides table that takes precedence over tag-derived values in the rollup.

Storing and Applying Attribution Overrides

When a human confirms a suggestion, the confirmed attribution is written to an overrides table and applied during the next rollup refresh. The override persists until the resource is tagged correctly in the cloud console, at which point the tag takes precedence again.

SQL schema/attribution_overrides.sql
CREATE TABLE attribution_overrides (
    resource_id   varchar      NOT NULL,
    provider      varchar      NOT NULL,
    department    varchar      NOT NULL,
    team          varchar      NOT NULL,
    confirmed_by  varchar,
    confirmed_at  timestamptz  DEFAULT now(),
    expires_at    timestamptz, -- NULL = until tag is corrected
    source        varchar      DEFAULT 'correlation_suggestion',
    correlation   numeric,     -- r value that prompted the suggestion
    PRIMARY KEY (resource_id, provider)
);

-- Apply overrides in the rollup via a JOIN priority chain:
-- 1. attribution_overrides  (human-confirmed)
-- 2. tag-derived department/team  (from FOCUS Tags column)
-- 3. __untagged__ / __sparse__ sentinel
Feedback loop

Every confirmed suggestion should trigger a tagging remediation task — a Jira ticket, a GitHub issue, or a direct Slack message to the resource owner — asking them to apply the correct tag in the cloud console. The override is a temporary fix; correct tagging is the permanent solution. Track override count per team as a proxy for tagging debt.

From Attribution to Anomaly Detection

The attribution intelligence built above has a direct partner: anomaly detection. The correlation engine doesn’t just improve cost reports in retrospect — it feeds in real time into the anomaly alert system. When an anomalous resource turns out to be untagged, the system immediately runs the correlation engine to suggest an owner, rather than routing the alert to an unowned queue.

The alert payload for an untagged anomalous resource includes an attribution hint that routes the alert to the most likely owning team — with the confidence score attached so the team knows whether to treat it as definitive or provisional.

With that handoff established, here is how anomaly detection works at the resource level.

The Detection Model

Post 2 built forecast-based budget alerts — a slow-burn signal that fires when projected end-of-period spend looks likely to breach a budget. That signal operates at the department or team level and gives teams days to act. Anomaly detection is a different and complementary signal: it fires when spend deviates suddenly from expected patterns, operates at the individual resource level, and may require action within hours rather than days.

A team might be tracking well within budget at the aggregate level while a single runaway process — a misconfigured auto-scaler, a loop instantiating clients on every request, a forgotten data export — generates $300/day in unexpected spend. The budget forecast won’t catch this until it accumulates enough to move the aggregate. Anomaly detection catches it the day it happens.

Budget alerts tell you the trend. Anomaly alerts tell you what changed and where to look.

For each resource with sufficient cost history, the detector computes a rolling baseline — the expected cost range based on recent history — and compares each day’s actual cost against that baseline. Days that exceed the upper control limit are flagged as anomalies.

The baseline uses a rolling mean and standard deviation over a configurable window (default 14 days). The control limit is mean + N × stddev where N defaults to 2.5 — a threshold that balances sensitivity against false positive rate for typical cloud spend patterns. Resources with very low average spend are given an absolute floor threshold to prevent spurious alerts on noise.

Python anomaly/detector.py
from dataclasses import dataclass, field
from datetime  import date
import numpy   as np

@dataclass
class ResourceAnomaly:
    resource_id:      str
    provider:         str
    service_name:     str
    service_category: str
    department:       str | None
    team:             str | None
    anomaly_date:     date
    actual_cost:      float
    baseline_mean:    float
    baseline_stddev:  float
    upper_limit:      float
    excess_cost:      float       # actual - upper_limit
    sigma:            float       # how many stddevs above mean
    is_untagged:      bool


def detect_resource_anomalies(
    resource_daily_costs: dict[str, list[tuple[date, float]]],
    resource_meta:        dict,
    window_days:          int   = 14,
    sigma_threshold:      float = 2.5,
    abs_floor_usd:        float = 5.0,   # ignore anomalies under $5 excess
    target_date:          date  = None,
) -> list[ResourceAnomaly]:

    target = target_date or date.today()
    anomalies = []

    for resource_id, history in resource_daily_costs.items():
        # Sort by date, split into baseline window and target day
        history = sorted(history, key=lambda x: x[0])
        baseline_pts = [(d, c) for d, c in history
                        if d < target and d >= target - __import__('datetime').timedelta(days=window_days)]
        target_pts   = [(d, c) for d, c in history if d == target]

        if len(baseline_pts) < 5 or not target_pts:
            continue

        costs         = np.array([c for _, c in baseline_pts])
        mean          = float(costs.mean())
        stddev        = float(costs.std()) + 0.01   # avoid div-by-zero
        upper_limit   = mean + sigma_threshold * stddev
        actual        = target_pts[0][1]
        excess        = actual - upper_limit

        if excess < abs_floor_usd:
            continue

        meta = resource_meta[resource_id]
        anomalies.append(ResourceAnomaly(
            resource_id      = resource_id,
            provider         = meta.provider,
            service_name     = meta.service_name,
            service_category = meta.service_category,
            department       = meta.department,
            team             = meta.team,
            anomaly_date     = target,
            actual_cost      = actual,
            baseline_mean    = mean,
            baseline_stddev  = stddev,
            upper_limit      = upper_limit,
            excess_cost      = excess,
            sigma            = (actual - mean) / stddev,
            is_untagged      = meta.department is None,
        ))

    return sorted(anomalies, key=lambda a: a.excess_cost, reverse=True)
Seasonality

The simple rolling mean/stddev model works well for resources with stable spend patterns. For resources with strong weekly seasonality — a batch job that runs every Sunday — consider computing the baseline using only same-day-of-week historical values, or using day-of-week z-scores rather than raw values. Post 4 will revisit this when we introduce utilisation data that can disambiguate scheduled load from genuine anomalies.

Rolling Up to Team-Level Anomaly Alerts

Individual resource anomalies are rolled up to team level before alerting. This prevents a team from receiving five separate alerts when five resources in the same service all spike simultaneously — usually a sign of a single underlying cause. The rollup groups anomalies by team and publishes one alert per team per evaluation cycle, with all anomalous resources listed in the payload.

This is where the correlation engine from the first half of this post integrates directly. For any untagged resources in the anomaly batch, the rollup immediately runs suggest_attribution and embeds the result in the alert payload as an attribution_hint. The alert is then routed to the suggested team’s channel rather than an orphaned alerts queue.

Python anomaly/alert_rollup.py
from collections    import defaultdict
from .detector      import ResourceAnomaly
from .correlation   import suggest_attribution
from .queue         import publish_anomaly_alert


def roll_up_and_alert(anomalies: list[ResourceAnomaly]) -> None:
    # Group by team (or __untagged__ if no attribution)
    by_team: dict[tuple, list] = defaultdict(list)
    for a in anomalies:
        key = (a.department or '__untagged__', a.team or '__untagged__')
        by_team[key].append(a)

    for (dept, team), team_anomalies in by_team.items():
        total_excess = sum(a.excess_cost for a in team_anomalies)

        # For untagged resources, attempt correlation-based attribution
        attribution_hints = {}
        untagged_resources = [a for a in team_anomalies if a.is_untagged]
        if untagged_resources:
            suggestions = suggest_attribution(
                resource_ids=[a.resource_id for a in untagged_resources],
                min_correlation=0.75,
            )
            attribution_hints = {s.resource_id: s for s in suggestions}

        payload = {
            "event_type":    "SPEND_ANOMALY",
            "department":    dept,
            "team":          team,
            "total_excess_usd": total_excess,
            "resource_count": len(team_anomalies),
            "anomalous_resources": [
                {
                    "resource_id":      a.resource_id,
                    "service_name":     a.service_name,
                    "service_category": a.service_category,
                    "actual_cost_usd":  a.actual_cost,
                    "expected_usd":     a.baseline_mean,
                    "excess_usd":       a.excess_cost,
                    "sigma":            round(a.sigma, 1),
                    "is_untagged":      a.is_untagged,
                    "attribution_hint": (
                        {"department": attribution_hints[a.resource_id].suggested_dept,
                         "team":       attribution_hints[a.resource_id].suggested_team,
                         "confidence": attribution_hints[a.resource_id].confidence}
                        if a.resource_id in attribution_hints else None
                    ),
                }
                for a in sorted(team_anomalies, key=lambda x: x.excess_cost, reverse=True)
            ],
        }
        publish_anomaly_alert(payload)

What a rich anomaly alert looks like

An alert that names specific resources and their excess cost — rather than just reporting a team-level spend increase — gives the receiving engineer an immediate starting point for investigation:

SPEND_ANOMALY · platform-engineering / data-infra
3 resources · +$412 excess

i-0a4b2c8d9e1f · Amazon EC2 · Compute
  Actual $280 vs expected $42 baseline (6.7σ)              +$238

db-prod-analytics · Amazon RDS · Database
  Actual $110 vs expected $58 baseline (3.1σ)              +$52

i-0b9d3f2a1c7e · Amazon EC2 · Compute [UNTAGGED]
  Actual $122 vs expected $0 baseline
  Suggested: data-infra (r=0.91 HIGH)                      +$122

The third resource in this example is untagged but has been attributed to data-infra by the correlation engine with high confidence. The alert routes to that team’s Slack channel rather than an unowned alerts queue, and includes the attribution suggestion so the team can confirm and fix the tag.

Separating Anomaly and Budget Alert Topics

Anomaly alerts and budget forecast alerts are published to separate SNS topics. This separation matters for routing and filtering. Budget alerts are slow-burn signals appropriate for daily digest delivery or manager-facing dashboards. Anomaly alerts may indicate active runaway processes and should route to on-call channels where they will be seen within minutes.

Python anomaly/queue.py
ANOMALY_TOPIC = "arn:aws:sns:us-east-1:123456789:finops-spend-anomalies"

# MessageAttribute filters allow per-team SNS subscriptions:
# - Slack integration subscribes to all teams' anomalies
# - PagerDuty subscribes only to anomalies where excess_usd > 100
# - Team-specific Lambda formats and posts to team Slack channel

MessageAttributes = {
    'department':   {'DataType': 'String', 'StringValue': dept},
    'team':         {'DataType': 'String', 'StringValue': team},
    'excess_bucket':{'DataType': 'String',
                     'StringValue': 'critical' if total_excess > 200
                                    else 'warning' if total_excess > 50
                                    else 'info'},
}
Alert routing pattern

A recommended routing setup: the SNS anomaly topic fans out to a Lambda that reads a team_alert_routing table mapping department/team pairs to Slack channel IDs and PagerDuty service IDs. Teams opt into PagerDuty routing by setting a threshold — “page me only if excess exceeds $200” — preventing alert fatigue from small noise anomalies while ensuring genuine cost incidents are escalated.

Storing Anomaly History

Each detected anomaly is written to a history table. This enables trend analysis — is a resource showing repeated anomalies suggesting a structural problem? — and feeds Post 5’s code analysis layer, which uses anomaly timestamps to correlate with deployment events.

SQL schema/anomaly_history.sql
CREATE TABLE anomaly_history (
    id               uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
    detected_at      timestamptz DEFAULT now(),
    anomaly_date     date        NOT NULL,
    resource_id      varchar     NOT NULL,
    provider         varchar     NOT NULL,
    service_name     varchar,
    service_category varchar,
    department       varchar,
    team             varchar,
    actual_cost      numeric(12,4),
    baseline_mean    numeric(12,4),
    excess_cost      numeric(12,4),
    sigma            numeric(6,2),
    alert_sent       boolean     DEFAULT true,
    resolved_at      timestamptz -- set when spend returns to baseline
);

CREATE INDEX ON anomaly_history (resource_id, anomaly_date);
CREATE INDEX ON anomaly_history (team,         anomaly_date);
CREATE INDEX ON anomaly_history (detected_at);  -- for Post 5 deployment join

What This Adds to the Platform

With attribution intelligence and anomaly detection in place, the platform has meaningfully improved on two fronts. Attribution is more reliable: singleton groupings no longer distort department reports, and untagged resources have a path to correct attribution without waiting for engineers to fix their tags. Anomaly detection provides same-day visibility into cost spikes at the resource level, with the correlation engine wiring the two systems together so untagged anomalous resources are never left without a destination.

The anomaly history table will become the key input to Post 5, where deployment correlation joins these timestamps against git history and CI/CD events to identify the code change responsible for each anomaly.

Post 4 extends the optimisation layer with utilisation data — pulling from CloudWatch, Azure Monitor, and GCP to answer not just “what did you spend?” but “what did you get for it?” — before Post 5 closes the loop entirely.

Attribution and anomaly detection are team sports

Getting these systems tuned for a real organisation — calibrating thresholds, establishing tagging policies, routing alerts to the right channels — requires coordination across engineering, finance, and platform teams. If you’re working through this and want to talk it through, get in touch.