Kymeca
FinOps Engineering

FinOps Engineering: Forecasting and Budget Alerts

Mark 4 min read Part 2 of 5 — FinOps Engineering
A dramatic silhouette of a lighthouse acting as an early warning system against incoming storms.

Mussenden Temple, Castlerock — K. Mitch Hodge / Unsplash

From Attribution to Accountability

Post 1 established that we can reliably answer “how much did each team spend?” For that answer to drive behaviour, it needs to be compared against a target. That is the budget — a pre-agreed upper bound on what a department or team should spend in a given period. The forecast engine projects current spend trends to the end of the period. The alert queue fires when the projection indicates the budget will be breached.

The goal is never to report overages after they happen. It is to give teams enough warning to act before the period closes.

The Budget Table

Budgets are stored as period-bounded records with a configurable alert threshold and a cost basis field that determines whether the forecast runs against EffectiveCost or BilledCost.

SQL schema/budgets.sql
CREATE TABLE budgets (
    id                  uuid         PRIMARY KEY DEFAULT gen_random_uuid(),
    department          varchar      NOT NULL,
    team                varchar,     -- NULL means applies to whole department
    period_start        date         NOT NULL,
    period_end          date         NOT NULL,
    budget_usd          numeric(12,2) NOT NULL,
    cost_basis          varchar      NOT NULL DEFAULT 'effective',
    alert_threshold_pct numeric      NOT NULL DEFAULT 80,
    created_by          varchar,
    created_at          timestamptz  DEFAULT now(),
    CONSTRAINT valid_period CHECK (period_end > period_start),
    CONSTRAINT valid_basis  CHECK (cost_basis IN ('effective', 'billed'))
);

-- Track alert history for deduplication
CREATE TABLE budget_alerts_sent (
    id             uuid        PRIMARY KEY DEFAULT gen_random_uuid(),
    budget_id      uuid        REFERENCES budgets(id),
    sent_at        timestamptz DEFAULT now(),
    severity       varchar     NOT NULL,
    forecast_pct   numeric     NOT NULL,
    forecast_total numeric     NOT NULL
);
Cost basis choice

Most teams should start with effective — it reflects true amortised consumption and is the right basis for engineering accountability. Set billed only when a budget must reconcile directly to a cloud invoice, typically for external reporting or contractual commitments.

The Forecast Model

The forecast answers one question: given spending so far in this budget period, where will we land by the period end if current trends continue? A weighted linear regression over a rolling window provides this projection. Recent days are weighted higher than older days, so the forecast responds to acceleration in spend without being dominated by a single spike.

Python forecasting/period_forecast.py
from dataclasses import dataclass
from datetime  import date
import numpy   as np

@dataclass
class ForecastResult:
    department:       str
    team:             str
    period_start:     date
    period_end:       date
    cost_basis:       str
    actual_to_date:   float
    forecast_total:   float
    daily_rate:       float    # projected spend per remaining day
    budget_usd:       float
    overage_forecast: float    # negative = under budget
    confidence:       str      # 'high' | 'medium' | 'low'
    days_remaining:   int


def forecast_period(
    daily_costs:  list[tuple[date, float]],
    period_start: date,
    period_end:   date,
    budget_usd:   float,
    cost_basis:   str  = 'effective',
    window_days:  int  = 14,
    dept:         str  = '',
    team:         str  = '',
) -> ForecastResult:

    today          = date.today()
    period_costs   = [(d, c) for d, c in daily_costs if period_start <= d <= today]
    actual_to_date = sum(c for _, c in period_costs)
    days_elapsed   = max((today - period_start).days + 1, 1)
    days_remaining = max((period_end - today).days, 0)

    window = period_costs[-window_days:]
    if len(window) < 3:
        # Too early in period — use simple daily average
        daily_rate = actual_to_date / days_elapsed
        confidence = 'low'
    else:
        xs      = np.arange(len(window), dtype=float)
        ys      = np.array([c for _, c in window])
        weights = np.linspace(0.4, 1.0, len(xs))
        coeffs  = np.polyfit(xs, ys, deg=1, w=weights)
        daily_rate = max(float(np.poly1d(coeffs)(xs[-1])), 0.0)
        confidence = 'high' if len(window) >= 10 else 'medium'

    forecast_total = actual_to_date + daily_rate * days_remaining

    return ForecastResult(
        department=dept, team=team,
        period_start=period_start, period_end=period_end,
        cost_basis=cost_basis,
        actual_to_date=actual_to_date,
        forecast_total=forecast_total,
        daily_rate=daily_rate,
        budget_usd=budget_usd,
        overage_forecast=forecast_total - budget_usd,
        confidence=confidence,
        days_remaining=days_remaining,
    )
Model limitations

The weighted linear model handles steady-state workloads well. For teams with strong weekly seasonality — batch jobs that only run on weekends — it may over- or under-project. Post 3’s anomaly detection layer will flag deviations from expected patterns. For orgs with mature cost history, Facebook’s Prophet library handles trend plus seasonality automatically and is worth adopting when the simpler model proves insufficient.

The Budget Evaluation Loop

The evaluator runs nightly after the daily rollup refreshes. For every active budget it fetches historical costs, runs the forecast, and publishes an alert when the projected total breaches the configured threshold.

Python alerting/budget_evaluator.py
from datetime    import date
from .forecasting import forecast_period
from .queue       import publish_budget_alert
from .db          import fetch_active_budgets, fetch_daily_costs, should_suppress


def evaluate_all_budgets() -> None:
    today   = date.today()
    budgets = fetch_active_budgets(as_of=today)

    for budget in budgets:
        costs = fetch_daily_costs(
            department = budget.department,
            team       = budget.team,
            from_date  = budget.period_start,
            to_date    = today,
            cost_basis = budget.cost_basis,  # effective_cost_usd or billed_cost_usd
        )

        result = forecast_period(
            daily_costs  = costs,
            period_start = budget.period_start,
            period_end   = budget.period_end,
            budget_usd   = budget.budget_usd,
            cost_basis   = budget.cost_basis,
            dept         = budget.department,
            team         = budget.team or '',
        )

        fcst_pct = (result.forecast_total / budget.budget_usd) * 100

        if fcst_pct < budget.alert_threshold_pct:
            continue  # within budget — no alert

        severity = 'CRITICAL' if fcst_pct >= 100 else 'WARNING'

        # Suppress if we already sent this severity and forecast hasn't worsened ≥5%
        if should_suppress(budget.id, severity, fcst_pct):
            continue

        publish_budget_alert(budget=budget, result=result,
                             severity=severity, fcst_pct=fcst_pct)

Alert Queue Design and Routing

Every alert is published to a message queue — AWS SNS, GCP Pub/Sub, or Azure Service Bus depending on your primary cloud. The queue decouples the evaluation engine from delivery destinations. Downstream consumers handle routing to Slack, PagerDuty, email, or any webhook endpoint.

Python alerting/queue.py
import boto3, json
from datetime import datetime, timezone
from .db      import record_alert_sent, fetch_top_services

sns      = boto3.client('sns')
TOPIC    = "arn:aws:sns:us-east-1:123456789:finops-budget-alerts"


def publish_budget_alert(budget, result, severity, fcst_pct) -> None:
    top_services = fetch_top_services(
        department   = budget.department,
        team         = budget.team,
        period_start = budget.period_start,
        limit        = 5,
        cost_basis   = budget.cost_basis,
    )

    payload = {
        "event_type":     "BUDGET_FORECAST_OVERAGE",
        "severity":       severity,
        "timestamp":      datetime.now(timezone.utc).isoformat(),
        "department":     budget.department,
        "team":           budget.team,
        "cost_basis":     budget.cost_basis,
        "period_start":   budget.period_start.isoformat(),
        "period_end":     budget.period_end.isoformat(),
        "budget_usd":     budget.budget_usd,
        "actual_to_date": result.actual_to_date,
        "forecast_total": result.forecast_total,
        "overage_usd":    result.overage_forecast,
        "forecast_pct":   fcst_pct,
        "daily_rate_usd": result.daily_rate,
        "days_remaining": result.days_remaining,
        "confidence":     result.confidence,
        "top_services": [
            {"ServiceName": s.name, "ServiceCategory": s.category,
             "cost_usd": s.cost}
            for s in top_services
        ],
    }

    sns.publish(
        TopicArn  = TOPIC,
        Message   = json.dumps(payload),
        Subject   = (
            f"[{severity}] {budget.department}/{budget.team or 'all'} "
            f"forecast {fcst_pct:.0f}% of budget"
        ),
        MessageAttributes={
            'severity':   {'DataType': 'String', 'StringValue': severity},
            'department': {'DataType': 'String', 'StringValue': budget.department},
        },
    )
    record_alert_sent(budget.id, severity, fcst_pct, result.forecast_total)

Alert flow end-to-end

The nightly evaluation sequence is:

  1. Rollup refreshes (07:00 UTC) — previous day’s FOCUS data is aggregated by team
  2. Evaluator runs (08:00 UTC) — forecast computed for every active budget, thresholds checked
  3. Deduplication check — suppress if same severity sent within 24h and forecast hasn’t worsened 5% or more
  4. Published to SNSMessageAttributes enable per-department topic filter subscriptions
  5. Routed to Slack / PagerDuty — Lambda subscriber reads routing config, formats message with top services breakdown
Alert content matters

An alert that says “you are 85% through your budget” produces no action. An alert that says “you are 85% through your budget, forecast to land at 112%, current burn rate $340/day, top driver is Amazon EC2 Compute at $210/day — here are the 3 largest instances” gives the receiving team something concrete to investigate. The top_services array in the payload is what enables this. Post 3 will enrich this further with specific anomalous resources.

What This Adds to the Platform

With the forecasting layer complete, the platform now actively watches budgets and notifies teams before periods close over budget. The combination of the FOCUS-aligned rollup from Post 1 and the forecast engine here means every alert carries both the historical trend and the projected outcome — not just a point-in-time snapshot.

Post 3 comes back upstream to the attribution layer and addresses the two most common sources of noise in any tag-driven system: singleton groupings that don’t represent real teams, and untagged resources whose ownership can be inferred from cost pattern similarity. These improvements also directly feed the anomaly detection alert payload — when an anomalous resource is untagged, the correlation engine runs immediately to suggest ownership.

Want to talk through your alerting strategy?

Budget thresholds, deduplication windows, and routing logic are all organisation-specific. If you’re building this and want a sounding board — get in touch.