InfernoCalibNet
  1. CNN/Inferno evaluation
  2. ๐Ÿงฎ Thresholds, Utility and Confusion Matrices
  • Overview
    • Welcome to InfernoCalibNet
  • Data preparations
    • ๐ŸงŠ Data loading, preperation and cleanup CNN
    • ๐Ÿ“ฆ Data loading, preperation and cleanup Inferno
  • CNN/Inferno evaluation
    • ๐Ÿงฎ Thresholds, Utility and Confusion Matrices
    • ๐Ÿงช Clinical utility comparison
  • Clinical Experiments
    • โš–๏ธ Utility-Based Clinical Decision
    • ๐ŸŽฏ Utility Based Evaluation Under Altered Base Rates
    • ๐Ÿงช Calibration Analysis of Neural Network Logits with Inferno
    • ๐Ÿ“ˆ Inferno Mutual Information Between Predictands and Predictors
  • Pipeline examples
    • ๐Ÿ–ผ๏ธ Prediction using Neural Network
    • ๐Ÿ”„ CNN to Inferno Pipeline
  • Notes
    • Metrics

On this page

  • Load data for evaluation
  • Find best threshold from Training Set
  • Evaluate CNN (sigmoid) on test set with best threshold derived from Train Set
  • Calculate and print confussion matrices for thresholds
  • Evaluate (eMatrix) Inferno Improvement
  • ๐Ÿงช Confusion Matrix Comparison and Performance Summary
    • ๐Ÿ“‹ Combined Confusion Matrices for CNN & Inferno
    • ๐Ÿ“Š Performance Summary
    • โœ… Conclusion
  • Report an issue
  1. CNN/Inferno evaluation
  2. ๐Ÿงฎ Thresholds, Utility and Confusion Matrices

๐Ÿงฎ Thresholds, Utility and Confusion Matrices

Author

Maksim Ohvrill

Published

April 28, 2025

This evaluation compares the decision utility of the CNN and Inferno, both operating on raw logits. For the CNN, utility is assessed using two thresholds: the commonly used default and a threshold optimized on the same training set that was used to train the Inferno model. Infernoโ€™s utility is derived from its calibrated probabilistic output. All comparisons are conducted using a fixed utility matrix, constructed based on findings from clinical research literature.

Load data for evaluation

Code
import torch
import numpy as np
import pandas as pd
from rich import print
from torch.nn import BCEWithLogitsLoss

from CNN import CALIB_DIR


def load_logits_and_calculate_loss(csv_path):
    df = pd.read_csv(csv_path)

    logits = df[["LOGIT_EFFUSION", "LOGIT_ATELECTASIS"]].to_numpy(dtype="float32")
    labels = df[["LABEL_EFFUSION", "LABEL_ATELECTASIS"]].to_numpy(dtype="float32")

    logits_tensor = torch.tensor(logits)
    labels_tensor = torch.tensor(labels)

    loss_fn = BCEWithLogitsLoss()
    avg_loss = loss_fn(logits_tensor, labels_tensor).item()

    return df, logits_tensor, labels_tensor, avg_loss


df_full, preds_full, targets_full, loss_full = load_logits_and_calculate_loss(
    CALIB_DIR / "calibration_full.csv"
)

df_train, preds_train, targets_train, loss_train = load_logits_and_calculate_loss(
    CALIB_DIR / "calibration_train.csv"
)

df_test, preds_test, targets_test, loss_test = load_logits_and_calculate_loss(
    CALIB_DIR / "calibration_test.csv"
)

Find best threshold from Training Set

Code
utility_matrix = np.array([
    [1.00, 0.55, 0.60, 0.40],
    [0.90, 0.60, 0.65, 0.75],
    [0.90, 0.65, 0.60, 0.75],
    [0.80, 0.85, 0.85, 0.60],
])

label_index = lambda e, a: e + 2 * a

def find_best_threshold(preds_tensor, targets_tensor, avg_loss):
    probs = torch.sigmoid(preds_tensor).numpy()
    targets = targets_tensor.numpy()

    print("\n--- Threshold Search on Train Set ---")
    print(f"Train BCEWithLogitsLoss: {avg_loss:.4f}")

    thresholds = np.linspace(0.1, 0.9, 81)
    best_utility = -1
    best_threshold = 0.5

    for threshold in thresholds:
        decisions_util = []
        true_vals_util = []
        for prob_vec, true_vec in zip(probs, targets):
            pred_bin = (prob_vec > threshold).astype(int)
            true_bin = true_vec.astype(int)

            pred_idx = int(label_index(pred_bin[0], pred_bin[1]))
            true_idx = int(label_index(true_bin[0], true_bin[1]))

            decisions_util.append(pred_idx)
            true_vals_util.append(true_idx)

        avg_utility = np.mean([utility_matrix[p, t] for p, t in zip(decisions_util, true_vals_util)])

        if avg_utility > best_utility:
            best_utility = avg_utility
            best_threshold = threshold

        if threshold == 0.5:
            print(f"Utility Score at Threshold 0.50: {avg_utility:.4f}")

    print(f"\nBest Threshold for Utility: {best_threshold:.2f}")
    print(f"Utility Score (Train Set): {best_utility:.4f}")

    return best_threshold


best_thresh = find_best_threshold(preds_train, targets_train, loss_train)
--- Threshold Search on Train Set ---
Train BCEWithLogitsLoss: 0.4301
Utility Score at Threshold 0.50: 0.8939
Best Threshold for Utility: 0.28
Utility Score (Train Set): 0.9131

Evaluate CNN (sigmoid) on test set with best threshold derived from Train Set

Code
def evaluate_at_threshold(preds_tensor, targets_tensor, threshold, avg_loss):
    probs = torch.sigmoid(preds_tensor).numpy()
    targets = targets_tensor.numpy()

    print(f"\n--- Evaluation at Threshold {threshold:.2f} ---")
    print(f"Test BCEWithLogitsLoss: {avg_loss:.4f}")

    decisions_util = []
    true_vals_util = []

    for prob_vec, true_vec in zip(probs, targets):
        pred_bin = (prob_vec > threshold).astype(int)
        true_bin = true_vec.astype(int)

        pred_idx = int(label_index(pred_bin[0], pred_bin[1]))
        true_idx = int(label_index(true_bin[0], true_bin[1]))

        decisions_util.append(pred_idx)
        true_vals_util.append(true_idx)

    avg_utility = np.mean([utility_matrix[p, t] for p, t in zip(decisions_util, true_vals_util)])

    # Utility score with best threshold from train set 
    print(f"Utility (Test Set): {avg_utility:.4f}")
    return probs, targets


kde_probs, scatter_targets = evaluate_at_threshold(preds_test, targets_test, best_thresh, loss_test)
--- Evaluation at Threshold 0.28 ---
Test BCEWithLogitsLoss: 0.4488
Utility (Test Set): 0.9077

Calculate and print confussion matrices for thresholds

Code
def print_confusion_matrices(preds_tensor, targets_tensor, best_threshold):
    probs = torch.sigmoid(preds_tensor).numpy()
    targets = targets_tensor.numpy()

    thresholds = [0.5, best_threshold]
    rows = []

    for thresh in thresholds:
        for i, disease in enumerate(["Effusion", "Atelectasis"]):
            preds_bin = (probs[:, i] > thresh).astype(int)
            true_bin = targets[:, i].astype(int)

            tp = int(((preds_bin == 1) & (true_bin == 1)).sum())
            fp = int(((preds_bin == 1) & (true_bin == 0)).sum())
            fn = int(((preds_bin == 0) & (true_bin == 1)).sum())
            tn = int(((preds_bin == 0) & (true_bin == 0)).sum())

            rows.append({
                "Threshold": round(thresh, 2),
                "Disease": disease,
                "TP": tp,
                "FP": fp,
                "FN": fn,
                "TN": tn
            })

    df_summary = pd.DataFrame(rows)
    print(df_summary.to_string(index=False))

print("[bold green]Confussion matrices for thresholds:")
print_confusion_matrices(preds_test, targets_test, best_thresh)
Confussion matrices for thresholds:
 Threshold     Disease  TP  FP  FN  TN
      0.50    Effusion 285  90 162 922
      0.50 Atelectasis 173  91 261 934
      0.28    Effusion 366 185  81 827
      0.28 Atelectasis 323 371 111 654
Code
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(
    context='notebook',
    style='whitegrid',
    palette=sns.diverging_palette(220, 20, n=2),
    font='sans-serif',
    font_scale=1,
    color_codes=True,
    rc=None
)

def plot_kde_scatter_curves(probs, targets, best_threshold, labels=("Effusion", "Atelectasis"), save=False):
    plt.rcParams['axes.grid'] = True
    plt.rcParams['font.family'] = 'Serif'
    plt.rcParams['font.weight'] = 'bold'
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.titleweight'] = 'bold'
    plt.rcParams['xtick.labelsize'] = 'medium'
    plt.rcParams['ytick.labelsize'] = 'medium'
    plt.rc('grid', linestyle='--', linewidth=0.5, color='lightblue')

    for i, label in enumerate(labels):
        p = probs[:, i]
        t = targets[:, i]

        plt.figure(figsize=(12, 5))

        plt.subplot(1, 2, 1)
        sns.scatterplot(x=np.arange(len(p))[t == 0], y=p[t == 0], marker='x', color="#d62728", alpha=0.7)
        sns.scatterplot(x=np.arange(len(p))[t == 1], y=p[t == 1], marker='o', color="#1f77b4", alpha=0.7)
        plt.axhline(y=0.5, linestyle="--", color="black", linewidth=1)
        plt.axhline(y=best_threshold, linestyle=":", color="#741B3C", linewidth=2)
        plt.xlabel("Sample Index", fontweight='bold')
        plt.ylabel("Sigmoid Confidence Score", fontweight='bold')
        plt.title(f"Scatter Plot: {label}", fontweight='bold')

        plt.subplot(1, 2, 2)
        sns.kdeplot(p[t == 0], label="Negative", fill=True, color="#d62728", alpha=0.3, linewidth=2)
        sns.kdeplot(p[t == 1], label="Positive", fill=True, color="#1f77b4", alpha=0.3, linewidth=2)
        plt.axvline(x=0.5, linestyle="--", color="black", linewidth=1)
        plt.axvline(x=best_threshold, linestyle=":", color="#741B3C", linewidth=2)
        plt.xlabel("Sigmoid Confidence Score", fontweight='bold')
        plt.ylabel("Density", fontweight='bold')
        plt.title(f"KDE Plot: {label}", fontweight='bold')
        plt.legend()

        plt.tight_layout()
        if save:
            plt.savefig(f"kde_scatter_{label.lower()}.pdf", format="pdf", bbox_inches="tight")
        plt.show()


plot_kde_scatter_curves(kde_probs, scatter_targets, best_thresh, save=False)

Code
from sklearn.metrics import (
    precision_score,
    recall_score,
    f1_score,
    accuracy_score,
    roc_auc_score,
    average_precision_score,
    matthews_corrcoef,
    roc_curve,
    precision_recall_curve
)
import matplotlib.pyplot as plt

all_roc_data = []
all_pr_data = []
all_rows = []

color_map = {
    "Effusion (Full)": "#1f77b4",
    "Effusion (Train)": "#aec7e8",
    "Effusion (Test)": "#004c6d",
    "Atelectasis (Full)": "#d62728",
    "Atelectasis (Train)": "#ff9896",
    "Atelectasis (Test)": "#7f0000"
}

def print_prediction_stats(preds_tensor, targets_tensor, split_name, loss):
    probs = torch.sigmoid(preds_tensor).numpy()
    targets = targets_tensor.numpy()

    for i, label in enumerate(["Effusion", "Atelectasis"]):
        p = probs[:, i]
        t = targets[:, i]
        b = (p > 0.5).astype(int)

        all_rows.append({
            "Split": split_name,
            "Label": label,
            "Loss": round(loss, 4),
            "Precision": round(precision_score(t, b), 4),
            "Recall": round(recall_score(t, b), 4),
            "F1 Score": round(f1_score(t, b), 4),
            "Accuracy": round(accuracy_score(t, b), 4),
            "ROC AUC": round(roc_auc_score(t, p), 4),
            "PR AUC": round(average_precision_score(t, p), 4),
            "MCC": round(matthews_corrcoef(t, b), 4)
        })

        fpr, tpr, _ = roc_curve(t, p)
        prec, rec, _ = precision_recall_curve(t, p)

        all_roc_data.append((fpr, tpr, f"{label} ({split_name})"))
        all_pr_data.append((rec, prec, f"{label} ({split_name})"))

def show_combined_plots(save=False):
    df_stats = pd.DataFrame(all_rows)
    print("\n", df_stats.to_string(index=False), sep="")

    plt.rcParams['font.weight'] = 'bold'
    plt.rcParams['axes.labelweight'] = 'bold'
    plt.rcParams['axes.titleweight'] = 'bold'

    plt.figure(figsize=(12, 5))

    plt.subplot(1, 2, 1)
    for fpr, tpr, lbl in all_roc_data:
        plt.plot(fpr, tpr, label=lbl, color=color_map.get(lbl))
    plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
    plt.title("ROC Curve", fontweight='bold')
    plt.xlabel("False Positive Rate", fontweight='bold')
    plt.ylabel("True Positive Rate", fontweight='bold')
    plt.legend()

    plt.subplot(1, 2, 2)
    for rec, prec, lbl in all_pr_data:
        plt.plot(rec, prec, label=lbl, color=color_map.get(lbl))
    plt.title("Precision-Recall Curve", fontweight='bold')
    plt.xlabel("Recall", fontweight='bold')
    plt.ylabel("Precision", fontweight='bold')
    plt.legend()

    plt.tight_layout()
    if save:
        plt.savefig("roc_pr_curves.pdf", format="pdf", bbox_inches="tight")
    plt.show()


print_prediction_stats(preds_full, targets_full, "Full", loss_full)
print_prediction_stats(preds_train, targets_train, "Train", loss_train)
print_prediction_stats(preds_test, targets_test, "Test", loss_test)

show_combined_plots(save=False)
Split       Label   Loss  Precision  Recall  F1 Score  Accuracy  ROC AUC  PR AUC    MCC
 Full    Effusion 0.4473     0.7791  0.6732    0.7223    0.8354   0.8879  0.7797 0.6095
 Full Atelectasis 0.4473     0.6496  0.3985    0.4939    0.7520   0.7791  0.6106 0.3598
Train    Effusion 0.4301     0.7850  0.7065    0.7437    0.8507   0.8935  0.7832 0.6405
Train Atelectasis 0.4301     0.6424  0.4176    0.5062    0.7593   0.7929  0.6262 0.3708
 Test    Effusion 0.4488     0.7600  0.6376    0.6934    0.8273   0.8831  0.7619 0.5788
 Test Atelectasis 0.4488     0.6553  0.3986    0.4957    0.7587   0.7759  0.5993 0.3679

Evaluate (eMatrix) Inferno Improvement

โš™๏ธ Method โœ… Advantages โŒ Disadvantages
๐Ÿงฎ Fixed Threshold (e.g., 0.5) - Very simple and fast
- Easy to implement and explain
- No additional computation required
- Same threshold for all labels
- Ignores class imbalance
- Ignores prediction uncertainty
- No adaptation to task-specific costs
๐ŸŽฏ Per-Label Threshold Search - Tailors threshold per label
- Can incorporate task-specific utility via validation
- Usually better than fixed threshold
- Treats labels independently
- Still relies on point estimates
- Slower due to validation loop
๐Ÿง  Inferno (Expected Utility) - Makes joint predictions
- Incorporates uncertainty
- Uses full predictive distribution
- Directly maximizes expected utility
- Models label dependencies
- Especially suited for critical domains (e.g., medicine)
- Higher computational cost
- Requires proper probabilistic modeling and inference algorithms
Code
import subprocess
from pathlib import Path

RSCRIPT_PATH = Path("RScripts/utility.R")
WORKDIR = Path("..").resolve()

subprocess.run(["Rscript", str(RSCRIPT_PATH)], check=True, cwd=WORKDIR)

Registered doParallelSNOW with 10 workers

Closing connections to cores.
๐Ÿ” Average Expected Utility from Inferno Decisions: 0.918163 

๐Ÿ“Š Confusion Matrix - Effusion
    Pred
True   0   1
   0 847 165
   1  96 351

๐Ÿ“Š Confusion Matrix - Atelectasis
    Pred
True   0   1
   0 669 356
   1 130 304
CompletedProcess(args=['Rscript', 'RScripts/utility.R'], returncode=0)

๐Ÿงช Confusion Matrix Comparison and Performance Summary

๐Ÿ“‹ Combined Confusion Matrices for CNN & Inferno

Model Threshold Disease TP FP FN TN
CNN 0.50 Effusion 285 90 162 922
CNN 0.50 Atelectasis 173 91 261 934
CNN 0.28 Effusion 366 185 81 827
CNN 0.28 Atelectasis 323 371 111 654
Inferno - Effusion 351 165 96 847
Inferno - Atelectasis 304 356 130 669

๐Ÿ“Š Performance Summary

CNN at threshold 0.50 provides a more conservative decision boundary with lower false positives but also higher false negatives, particularly in Atelectasis.

CNN at threshold 0.28 boosts recall significantly (TP increases), but at the cost of many more false positives, especially for Atelectasis.

Inferno offers a well balanced tradeoff between sensitivity and specificity across both diseases: - Outperforms CNN on average utility: 0.9182 vs 0.9077 - Achieves higher TP counts than CNN at 0.50, and fewer FP/FN than CNN at 0.28 - Suggests robust calibration and confidence aware decision making without threshold tuning

โœ… Conclusion

Inferno demonstrates improved utility and a more balanced performance profile, making it a compelling option for deployment in scenarios where reliable probabilistic decision making is important. In contrast, CNN provides flexibility through threshold adjustment but depends heavily on careful tuning to manage tradeoffs between sensitivity and specificity.

 

ยฉ 2025 InfernoCalibNet - All Rights Reserved

  • Report an issue