Cathedral Quarter, Belfast — Korng Sok / Unsplash
The tag-based attribution built in Post 1 has two systematic weaknesses that become apparent quickly in production. Both undermine the reliability of cost reports and forecasts if left unaddressed.
The singleton problem. A tag grouping containing only one resource is not a team or department in any meaningful sense. It is usually a misconfiguration, a test resource, or someone who applied a tag that nobody else on the team uses. Treating a single EC2 instance tagged team=johns-test as a legitimate cost centre adds noise to reports, inflates the count of apparent teams, and creates budget orphans — costs that don’t fit neatly into any real organisational unit.
The attribution gap. Untagged resources don’t just represent missing data — they represent real spend that is attributed to nobody. In most organisations, untagged spend starts at 20–40% of total cloud cost and only improves with active effort. Rather than waiting for tagging hygiene to catch up, the correlation engine infers likely ownership from cost behaviour patterns and surfaces suggestions for human review.
Tag-based attribution is only as good as the tags. Attribution intelligence is what you build when you accept that the tags will never be perfect.
The cardinality filter sits between the raw tag extraction and the daily rollup. It promotes a tag grouping — a specific department/team combination — to a legitimate cost centre only if it contains at least N distinct resources within a rolling window. Groupings below the threshold are reclassified to a __sparse__ category, keeping their costs visible without polluting attribution reports.
-- Tag groupings with sufficient resource cardinality over rolling 30 days
CREATE MATERIALIZED VIEW valid_tag_groups AS
SELECT
department,
team,
COUNT(DISTINCT "ResourceId") AS resource_count,
SUM(effective_cost_usd) AS rolling_cost_usd,
MIN(charge_date) AS first_seen,
MAX(charge_date) AS last_seen
FROM cloud_cost_facts
WHERE charge_date >= current_date - 30
AND department IS NOT NULL
AND team IS NOT NULL
GROUP BY department, team
HAVING COUNT(DISTINCT "ResourceId") >= 2 -- minimum cardinality threshold
WITH DATA;
CREATE UNIQUE INDEX ON valid_tag_groups (department, team); -- Extend the daily rollup to reclassify sparse groups
CREATE MATERIALIZED VIEW daily_cost_attributed AS
SELECT
f.charge_date,
f."ProviderName",
f."ServiceCategory",
-- Reclassify if group doesn't meet cardinality threshold
CASE WHEN v.team IS NOT NULL
THEN COALESCE(f.department, '__untagged__')
ELSE '__sparse__'
END AS department,
CASE WHEN v.team IS NOT NULL
THEN COALESCE(f.team, '__untagged__')
ELSE '__sparse__'
END AS team,
SUM(f."BilledCost") AS billed_cost_usd,
SUM(COALESCE(f."EffectiveCost", f."BilledCost")) AS effective_cost_usd,
COUNT(DISTINCT f."ResourceId") AS resource_count
FROM cloud_cost_facts f
LEFT JOIN valid_tag_groups v
ON f.department = v.department
AND f.team = v.team
GROUP BY 1,2,3,4,5
WITH DATA; Start with a minimum cardinality of 2 resources and review the __sparse__ cost percentage after the first week. If sparse costs exceed 10% of total spend, the threshold is probably too aggressive and should be lowered. If legitimate small teams are being incorrectly reclassified, consider adding a secondary condition: a group that has been consistently present for 60+ days is promoted regardless of resource count — it is a small but stable team, not a misconfiguration.
Resources that belong to the same workload tend to exhibit correlated spend over time. They scale together, they get created and destroyed together, their costs peak and trough in tandem — because they are driven by the same traffic, the same batch schedules, the same deployment events. This behavioural similarity is a signal that tag-based attribution cannot see but correlation can.
The correlation engine computes pairwise Pearson correlation of daily cost time series between each untagged resource and each valid tag group. High correlation suggests probable membership. The output is a ranked list of attribution suggestions for human review — not automatic re-attribution, which would carry too high a false-positive risk.
import numpy as np
import pandas as pd
from dataclasses import dataclass
from .db import fetch_resource_daily_costs, fetch_group_daily_costs
@dataclass
class AttributionSuggestion:
resource_id: str
provider: str
service_name: str
current_tags: dict
suggested_dept: str
suggested_team: str
correlation: float # Pearson r, 0–1
confidence: str # 'high' | 'medium' | 'low'
window_days: int
def suggest_attribution(
window_days: int = 30,
min_correlation: float = 0.75,
min_overlap_days: int = 10,
) -> list[AttributionSuggestion]:
# Fetch untagged / sparse resources with sufficient cost history
untagged = fetch_resource_daily_costs(
tagged=False, window_days=window_days, min_days=min_overlap_days
)
# Fetch daily cost series for each valid tag group
groups = fetch_group_daily_costs(window_days=window_days)
suggestions = []
for resource_id, r_series in untagged.items():
best_corr = 0.0
best_group = None
for (dept, team), g_series in groups.items():
# Align on common dates
common = r_series.index.intersection(g_series.index)
if len(common) < min_overlap_days:
continue
r_vals = r_series[common].values
g_vals = g_series[common].values
# Normalise to remove scale differences between large/small groups
r_norm = (r_vals - r_vals.mean()) / (r_vals.std() + 1e-9)
g_norm = (g_vals - g_vals.mean()) / (g_vals.std() + 1e-9)
corr = float(np.corrcoef(r_norm, g_norm)[0, 1])
if corr > best_corr:
best_corr = corr
best_group = (dept, team)
if best_corr >= min_correlation and best_group:
dept, team = best_group
meta = untagged_meta[resource_id]
suggestions.append(AttributionSuggestion(
resource_id = resource_id,
provider = meta.provider,
service_name = meta.service_name,
current_tags = meta.tags,
suggested_dept = dept,
suggested_team = team,
correlation = best_corr,
confidence = 'high' if best_corr >= 0.90
else 'medium' if best_corr >= 0.80
else 'low',
window_days = window_days,
))
return sorted(suggestions, key=lambda s: s.correlation, reverse=True) Correlation does not imply causation or ownership. Two resources that happen to exhibit similar scaling patterns — perhaps driven by the same traffic source rather than belonging to the same team — will produce false positive suggestions. Always route suggestions through a human review step. The platform stores confirmed attributions in an attribution_overrides table that takes precedence over tag-derived values in the rollup.
When a human confirms a suggestion, the confirmed attribution is written to an overrides table and applied during the next rollup refresh. The override persists until the resource is tagged correctly in the cloud console, at which point the tag takes precedence again.
CREATE TABLE attribution_overrides (
resource_id varchar NOT NULL,
provider varchar NOT NULL,
department varchar NOT NULL,
team varchar NOT NULL,
confirmed_by varchar,
confirmed_at timestamptz DEFAULT now(),
expires_at timestamptz, -- NULL = until tag is corrected
source varchar DEFAULT 'correlation_suggestion',
correlation numeric, -- r value that prompted the suggestion
PRIMARY KEY (resource_id, provider)
);
-- Apply overrides in the rollup via a JOIN priority chain:
-- 1. attribution_overrides (human-confirmed)
-- 2. tag-derived department/team (from FOCUS Tags column)
-- 3. __untagged__ / __sparse__ sentinel Every confirmed suggestion should trigger a tagging remediation task — a Jira ticket, a GitHub issue, or a direct Slack message to the resource owner — asking them to apply the correct tag in the cloud console. The override is a temporary fix; correct tagging is the permanent solution. Track override count per team as a proxy for tagging debt.
The attribution intelligence built above has a direct partner: anomaly detection. The correlation engine doesn’t just improve cost reports in retrospect — it feeds in real time into the anomaly alert system. When an anomalous resource turns out to be untagged, the system immediately runs the correlation engine to suggest an owner, rather than routing the alert to an unowned queue.
The alert payload for an untagged anomalous resource includes an attribution hint that routes the alert to the most likely owning team — with the confidence score attached so the team knows whether to treat it as definitive or provisional.
With that handoff established, here is how anomaly detection works at the resource level.
Post 2 built forecast-based budget alerts — a slow-burn signal that fires when projected end-of-period spend looks likely to breach a budget. That signal operates at the department or team level and gives teams days to act. Anomaly detection is a different and complementary signal: it fires when spend deviates suddenly from expected patterns, operates at the individual resource level, and may require action within hours rather than days.
A team might be tracking well within budget at the aggregate level while a single runaway process — a misconfigured auto-scaler, a loop instantiating clients on every request, a forgotten data export — generates $300/day in unexpected spend. The budget forecast won’t catch this until it accumulates enough to move the aggregate. Anomaly detection catches it the day it happens.
Budget alerts tell you the trend. Anomaly alerts tell you what changed and where to look.
For each resource with sufficient cost history, the detector computes a rolling baseline — the expected cost range based on recent history — and compares each day’s actual cost against that baseline. Days that exceed the upper control limit are flagged as anomalies.
The baseline uses a rolling mean and standard deviation over a configurable window (default 14 days). The control limit is mean + N × stddev where N defaults to 2.5 — a threshold that balances sensitivity against false positive rate for typical cloud spend patterns. Resources with very low average spend are given an absolute floor threshold to prevent spurious alerts on noise.
from dataclasses import dataclass, field
from datetime import date
import numpy as np
@dataclass
class ResourceAnomaly:
resource_id: str
provider: str
service_name: str
service_category: str
department: str | None
team: str | None
anomaly_date: date
actual_cost: float
baseline_mean: float
baseline_stddev: float
upper_limit: float
excess_cost: float # actual - upper_limit
sigma: float # how many stddevs above mean
is_untagged: bool
def detect_resource_anomalies(
resource_daily_costs: dict[str, list[tuple[date, float]]],
resource_meta: dict,
window_days: int = 14,
sigma_threshold: float = 2.5,
abs_floor_usd: float = 5.0, # ignore anomalies under $5 excess
target_date: date = None,
) -> list[ResourceAnomaly]:
target = target_date or date.today()
anomalies = []
for resource_id, history in resource_daily_costs.items():
# Sort by date, split into baseline window and target day
history = sorted(history, key=lambda x: x[0])
baseline_pts = [(d, c) for d, c in history
if d < target and d >= target - __import__('datetime').timedelta(days=window_days)]
target_pts = [(d, c) for d, c in history if d == target]
if len(baseline_pts) < 5 or not target_pts:
continue
costs = np.array([c for _, c in baseline_pts])
mean = float(costs.mean())
stddev = float(costs.std()) + 0.01 # avoid div-by-zero
upper_limit = mean + sigma_threshold * stddev
actual = target_pts[0][1]
excess = actual - upper_limit
if excess < abs_floor_usd:
continue
meta = resource_meta[resource_id]
anomalies.append(ResourceAnomaly(
resource_id = resource_id,
provider = meta.provider,
service_name = meta.service_name,
service_category = meta.service_category,
department = meta.department,
team = meta.team,
anomaly_date = target,
actual_cost = actual,
baseline_mean = mean,
baseline_stddev = stddev,
upper_limit = upper_limit,
excess_cost = excess,
sigma = (actual - mean) / stddev,
is_untagged = meta.department is None,
))
return sorted(anomalies, key=lambda a: a.excess_cost, reverse=True) The simple rolling mean/stddev model works well for resources with stable spend patterns. For resources with strong weekly seasonality — a batch job that runs every Sunday — consider computing the baseline using only same-day-of-week historical values, or using day-of-week z-scores rather than raw values. Post 4 will revisit this when we introduce utilisation data that can disambiguate scheduled load from genuine anomalies.
Individual resource anomalies are rolled up to team level before alerting. This prevents a team from receiving five separate alerts when five resources in the same service all spike simultaneously — usually a sign of a single underlying cause. The rollup groups anomalies by team and publishes one alert per team per evaluation cycle, with all anomalous resources listed in the payload.
This is where the correlation engine from the first half of this post integrates directly. For any untagged resources in the anomaly batch, the rollup immediately runs suggest_attribution and embeds the result in the alert payload as an attribution_hint. The alert is then routed to the suggested team’s channel rather than an orphaned alerts queue.
from collections import defaultdict
from .detector import ResourceAnomaly
from .correlation import suggest_attribution
from .queue import publish_anomaly_alert
def roll_up_and_alert(anomalies: list[ResourceAnomaly]) -> None:
# Group by team (or __untagged__ if no attribution)
by_team: dict[tuple, list] = defaultdict(list)
for a in anomalies:
key = (a.department or '__untagged__', a.team or '__untagged__')
by_team[key].append(a)
for (dept, team), team_anomalies in by_team.items():
total_excess = sum(a.excess_cost for a in team_anomalies)
# For untagged resources, attempt correlation-based attribution
attribution_hints = {}
untagged_resources = [a for a in team_anomalies if a.is_untagged]
if untagged_resources:
suggestions = suggest_attribution(
resource_ids=[a.resource_id for a in untagged_resources],
min_correlation=0.75,
)
attribution_hints = {s.resource_id: s for s in suggestions}
payload = {
"event_type": "SPEND_ANOMALY",
"department": dept,
"team": team,
"total_excess_usd": total_excess,
"resource_count": len(team_anomalies),
"anomalous_resources": [
{
"resource_id": a.resource_id,
"service_name": a.service_name,
"service_category": a.service_category,
"actual_cost_usd": a.actual_cost,
"expected_usd": a.baseline_mean,
"excess_usd": a.excess_cost,
"sigma": round(a.sigma, 1),
"is_untagged": a.is_untagged,
"attribution_hint": (
{"department": attribution_hints[a.resource_id].suggested_dept,
"team": attribution_hints[a.resource_id].suggested_team,
"confidence": attribution_hints[a.resource_id].confidence}
if a.resource_id in attribution_hints else None
),
}
for a in sorted(team_anomalies, key=lambda x: x.excess_cost, reverse=True)
],
}
publish_anomaly_alert(payload) An alert that names specific resources and their excess cost — rather than just reporting a team-level spend increase — gives the receiving engineer an immediate starting point for investigation:
SPEND_ANOMALY · platform-engineering / data-infra
3 resources · +$412 excess
i-0a4b2c8d9e1f · Amazon EC2 · Compute
Actual $280 vs expected $42 baseline (6.7σ) +$238
db-prod-analytics · Amazon RDS · Database
Actual $110 vs expected $58 baseline (3.1σ) +$52
i-0b9d3f2a1c7e · Amazon EC2 · Compute [UNTAGGED]
Actual $122 vs expected $0 baseline
Suggested: data-infra (r=0.91 HIGH) +$122
The third resource in this example is untagged but has been attributed to data-infra by the correlation engine with high confidence. The alert routes to that team’s Slack channel rather than an unowned alerts queue, and includes the attribution suggestion so the team can confirm and fix the tag.
Anomaly alerts and budget forecast alerts are published to separate SNS topics. This separation matters for routing and filtering. Budget alerts are slow-burn signals appropriate for daily digest delivery or manager-facing dashboards. Anomaly alerts may indicate active runaway processes and should route to on-call channels where they will be seen within minutes.
ANOMALY_TOPIC = "arn:aws:sns:us-east-1:123456789:finops-spend-anomalies"
# MessageAttribute filters allow per-team SNS subscriptions:
# - Slack integration subscribes to all teams' anomalies
# - PagerDuty subscribes only to anomalies where excess_usd > 100
# - Team-specific Lambda formats and posts to team Slack channel
MessageAttributes = {
'department': {'DataType': 'String', 'StringValue': dept},
'team': {'DataType': 'String', 'StringValue': team},
'excess_bucket':{'DataType': 'String',
'StringValue': 'critical' if total_excess > 200
else 'warning' if total_excess > 50
else 'info'},
} A recommended routing setup: the SNS anomaly topic fans out to a Lambda that reads a team_alert_routing table mapping department/team pairs to Slack channel IDs and PagerDuty service IDs. Teams opt into PagerDuty routing by setting a threshold — “page me only if excess exceeds $200” — preventing alert fatigue from small noise anomalies while ensuring genuine cost incidents are escalated.
Each detected anomaly is written to a history table. This enables trend analysis — is a resource showing repeated anomalies suggesting a structural problem? — and feeds Post 5’s code analysis layer, which uses anomaly timestamps to correlate with deployment events.
CREATE TABLE anomaly_history (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
detected_at timestamptz DEFAULT now(),
anomaly_date date NOT NULL,
resource_id varchar NOT NULL,
provider varchar NOT NULL,
service_name varchar,
service_category varchar,
department varchar,
team varchar,
actual_cost numeric(12,4),
baseline_mean numeric(12,4),
excess_cost numeric(12,4),
sigma numeric(6,2),
alert_sent boolean DEFAULT true,
resolved_at timestamptz -- set when spend returns to baseline
);
CREATE INDEX ON anomaly_history (resource_id, anomaly_date);
CREATE INDEX ON anomaly_history (team, anomaly_date);
CREATE INDEX ON anomaly_history (detected_at); -- for Post 5 deployment join With attribution intelligence and anomaly detection in place, the platform has meaningfully improved on two fronts. Attribution is more reliable: singleton groupings no longer distort department reports, and untagged resources have a path to correct attribution without waiting for engineers to fix their tags. Anomaly detection provides same-day visibility into cost spikes at the resource level, with the correlation engine wiring the two systems together so untagged anomalous resources are never left without a destination.
The anomaly history table will become the key input to Post 5, where deployment correlation joins these timestamps against git history and CI/CD events to identify the code change responsible for each anomaly.
Post 4 extends the optimisation layer with utilisation data — pulling from CloudWatch, Azure Monitor, and GCP to answer not just “what did you spend?” but “what did you get for it?” — before Post 5 closes the loop entirely.
Getting these systems tuned for a real organisation — calibrating thresholds, establishing tagging policies, routing alerts to the right channels — requires coordination across engineering, finance, and platform teams. If you’re working through this and want to talk it through, get in touch.