Notebook Code Explanations and Math Reference¶

Lab 00: Warmup¶

Basic Probability Calculation¶

# Normalize raw scores by dividing each value by the total
probs = [v / sum(exp_shifted) for v in exp_shifted]

Math: normalization creates a valid distribution where probabilities sum to 1. This makes outputs comparable across different runs and inputs.

\[ P(i) = \frac{v_i}{\sum_j v_j} \]

Softmax Formula¶

Also check Pre-Activation

\[ P(i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]

Softmax converts raw scores (logits) into probabilities in [0, 1]. It also magnifies relative score differences so the model can rank choices clearly.

Lab 01: Tokenization¶

Token Counting¶

def pseudo_token_count(text):
    return len(text.split())

This notebook uses a simple approximation. Real models use specialized tokenizers. Use this approximation to reason about budget pressure before adding production tokenizers.

Budget Constraint Equation¶

\[ T_{total} = T_{system} + T_{history} + T_{user} + T_{output} \]

Constraint:

\[ T_{total} \le T_{window} \]

Operational decisions:

Short tickets: keep full history.
Long tickets: summarize or trim history.

The key idea is that token budget is a capacity constraint, similar to memory limits.

Lab 02: Probability¶

Normalization Step¶

total = sum(raw_scores.values())
probs = {k: v / total for k, v in raw_scores.items()}

This turns arbitrary scores into valid confidence values for policy thresholds.

Expected Value Formula¶

\[ \mathbb{E}[X] = \sum_i x_i P(i) \]

Example workload estimate:

6 * 0.6 + 14 * 0.25 + 32 * 0.15 = 11.9 minutes per ticket.

Expected value helps estimate staffing load from predicted intent mix.

Lab 03: Logits and Softmax¶

Numerical Stability¶

shift = max(scaled)
exps = [math.exp(x - shift) for x in scaled]

Equivalent identity:

\[ \frac{e^{x-c}}{\sum_j e^{y_j-c}} = \frac{e^x}{\sum_j e^{y_j}} \]

Subtracting a constant avoids overflow but keeps the same probability ratios.

Temperature Scaling¶

\[ P_T(i) = \frac{e^{z_i/T}}{\sum_j e^{z_j/T}} \]

T < 1: sharper, more deterministic.
T = 1: baseline.
T > 1: flatter, more stochastic.

Use lower temperature for routing and higher temperature for exploratory drafting.

Lab 04: Entropy¶

Shannon Entropy¶

\[ H(P) = -\sum_i P(i) \log_2 P(i) \]

Higher entropy means more uncertainty spread across classes.

Normalized Entropy¶

\[ H_{norm} = \frac{H(P)}{\log_2 n} \]

Policy example:

Auto-route only if top probability is high and entropy is low.

Combining both signals reduces overconfident automation mistakes.

Lab 05: Variance and Standard Deviation¶

Variance Formula¶

\[ \mathrm{Var}(X) = \frac{1}{n}\sum_i (x_i - \mu)^2 \]

\[ \sigma = \sqrt{\mathrm{Var}(X)} \]

Mean shows center; variance and standard deviation show stability.

Stability Policy¶

def stability_decision(std, good=0.20, watch=0.35):
    if std <= good:
        return "go"
    if std <= watch:
        return "monitor"
    return "hold"

Low spread: stable for production.
High spread: investigate before rollout.

This keeps quality predictable across repeated runs.

Lab 06: Determinism¶

Argmax Operation¶

idx = max(range(len(p)), key=lambda i: p[i])
choice = labels[idx]

Argmax is deterministic for fixed inputs. It is preferred in high-control workflows where reproducibility matters.

Sampling Operation¶

def sample_choice(labels, p, seed=None):
    rng = random.Random(seed)
    ...

Sampling is stochastic; seeding enables reproducibility. This is useful when you want diversity but still need auditability.

Lab 07: Regression¶

Linear Regression Model¶

\[ \hat{y} = b_0 + b_1 x \]

b0: intercept.
b1: slope.

Interpret slope as "expected output change per one-unit input increase."

Residual Analysis¶

\[ r_i = y_i - \hat{y}_i \]

Persistent residual patterns indicate bias. Residual shape often reveals missing features or non-linear behavior.

Error Metrics¶

\[ \mathrm{MAE} = \frac{1}{n}\sum_i |r_i| \]

\[ \mathrm{RMSE} = \sqrt{\frac{1}{n}\sum_i r_i^2} \]

RMSE penalizes large misses more heavily than MAE.

Lab 08: Classification and Calibration¶

Precision and Recall¶

\[ \mathrm{Precision} = \frac{TP}{TP + FP} \]

\[ \mathrm{Recall} = \frac{TP}{TP + FN} \]

Threshold tuning is a business trade-off between false positives and false negatives.

Calibration Check¶

Compare predicted probabilities vs observed frequencies per bin.

Well-calibrated model: predicted and observed rates are close.

If they diverge, confidence thresholds should be recalibrated.

Lab 09: Correlation vs Causation¶

Correlation Coefficient¶

\[ r = \frac{\mathrm{cov}(x, y)}{\sigma_x\sigma_y} \]

r measures association, not causality. Treat correlation as an investigation signal, not a policy proof.

Confounder Segmentation¶

Segment by severity/queue/time before making policy conclusions.
Use controlled experiments for causal claims.

Segmentation helps reveal confounders before rollout decisions.

Lab 10: Sampling Controls¶

Top-k Filtering¶

pairs = sorted(zip(labels, probs), key=lambda x: x[1], reverse=True)[:k]
renormalize(pairs)

Top-k caps candidate count; top-p caps cumulative probability mass.

Top-p Filtering¶

keep = []
total = 0.0
for label, pr in sorted_pairs:
    keep.append((label, pr))
    total += pr
    if total >= p:
        break

Renormalization¶

\[ P_{new}(i) = \frac{P_{old}(i)}{\sum_{j \in S} P_{old}(j)} \]

Renormalization is required so filtered probabilities still sum to 1.

Lab 11: Guardrails and Thresholds¶

Three-Band Policy¶

def action(p, low=0.55, high=0.80):
    if p >= high:
        return 'auto'
    if p <= low:
        return 'abstain'
    return 'human-review'

Auto: p >= high.
Review: low < p < high.
Abstain: p <= low.

This three-band policy maps confidence directly to operational action.

Threshold Tuning¶

More auto: faster, riskier.
More review: safer, costlier.

Tune by risk tier rather than using one global threshold.

Lab 12: Evaluation in Production¶

Weighted KPI¶

KPI = 0.4 * precision + 0.4 * sla_hit_rate + 0.2 * inverse_override_rate

Weights should reflect business priorities and compliance risk.

Rollout Gate¶

delta = candidate_kpi - baseline_kpi
if delta >= 0.01:
    rollout
elif delta <= -0.01:
    rollback
else:
    hold

Decision bands:

delta >= 0.01: expand rollout.
-0.01 < delta < 0.01: hold and monitor.
delta <= -0.01: rollback.

Predefined gates prevent subjective rollout decisions under pressure.

General Tips for Code and Math¶

Validate assumptions (probabilities sum to 1, thresholds documented).
Write formulas first, then code.
Test edge cases.
Monitor operational metrics, not only prediction outputs.