🎯 Utility Based Evaluation Under Altered Base Rates

Author

Maksim Ohvrill

Published

May 1, 2025

This notebook studies how changing the prevalence of clinical conditions (effusion, atelectasis) impacts decision accuracy and utility, using adjusted prior rates in a test set.

🚀 Setup

Code

# -------------------------------
# Setup
# -------------------------------

# Set working directory one level up
setwd("..")
library(inferno)

# Load reusable utilities
source("RScripts/reusableUtils.R")

# Define general output directory
output_dir <- "data/inferno"
if (!dir.exists(output_dir)) {
  dir.create(output_dir)
}

# Define model and test data paths
learnt_dir <- "data/inferno/combinedML50"
setup <- load_metadata_testdata(learnt_dir)
metadata <- setup$metadata
test_data <- setup$testdata

# Define predictand and predictor names
y_names <- c("LABEL_EFFUSION", "LABEL_ATELECTASIS")
x_names <- setdiff(metadata$name, y_names)

📊 Analyze and display the original base rates of different label combinations in the test dataset.

Analyze and display the original distribution of outcomes (“none”, “atel”, “effu”, “both”) in the test set. This shows the natural class imbalance present in the original dataset.

Code

# -------------------------------
# Analyze Current Base Rates
# -------------------------------

# Define all possible label combinations
cases <- expand.grid(LABEL_ATELECTASIS = 0:1, LABEL_EFFUSION = 0:1)
rownames(cases) <- c("none", "atel", "effu", "both")

# Calculate old base rates
old_counts <- apply(cases, 1, function(x) {
  nrow(test_data[
    test_data[["LABEL_ATELECTASIS"]] == x[["LABEL_ATELECTASIS"]] &
      test_data[["LABEL_EFFUSION"]] == x[["LABEL_EFFUSION"]],
  ])
})
names(old_counts) <- rownames(cases)
old_rates <- old_counts / sum(old_counts)
cat("Old base rates:\n")
print(old_rates)

Old base rates:
     none      atel      effu      both 
0.4674435 0.2261823 0.2350925 0.0712817

🔄 Create New Test Set with Altered Base Rates

Resample the calibration test set to impose new, balanced base rates (5% none, 45% atelectasis, 45% effusion, 5% both). The goal is to simulate a base rate shift and test how well the inference adapts under distributional changes. This adjustment forces the model to predict accurately even when conditions differ from the original data.

Code

# -------------------------------
# Create New Test Set with Altered Base Rates
# -------------------------------

# Desired new base rates
new_rates <- c(none = 0.05, atel = 0.45, effu = 0.45, both = 0.05)

# Resample test data
new_counts <- NULL
for (acase in seq_len(nrow(cases))) {
  testcounts <- floor(new_rates * old_counts[acase] / new_rates[acase])
  if (all(testcounts <= old_counts)) {
    new_counts <- testcounts
    break
  }
}
new_test_data <- test_data[0, ]
for (acase in seq_len(nrow(cases))) {
  tochoose <- sample(which(
    test_data[["LABEL_ATELECTASIS"]] == cases[acase, "LABEL_ATELECTASIS"] &
      test_data[["LABEL_EFFUSION"]] == cases[acase, "LABEL_EFFUSION"]
  ), size = new_counts[acase], replace = FALSE)
  new_test_data <- rbind(new_test_data, test_data[tochoose, ])
}

# Validate new base rates
adjusted_counts <- apply(cases, 1, function(x) {
  nrow(new_test_data[
    new_test_data[["LABEL_ATELECTASIS"]] == x[["LABEL_ATELECTASIS"]] &
      new_test_data[["LABEL_EFFUSION"]] == x[["LABEL_EFFUSION"]],
  ])
})
names(adjusted_counts) <- rownames(cases)
adjusted_rates <- adjusted_counts / sum(adjusted_counts)
cat("New base rates:\n")
print(adjusted_rates)

# Save the new test set
write.csv(new_test_data, file.path(output_dir, "calibration_test_newbaserate.csv"), row.names = FALSE, quote = TRUE, na = "")

New base rates:
      none       atel       effu       both 
0.04918033 0.45081967 0.45081967 0.04918033

🧠 Utility-Based Evaluation (Inferno)

Infer probabilities for all data points using Inferno, correcting for the altered base rates by explicitly passing the prior. Calculate expected utilities based on a unit-diagonal utility matrix, reflecting perfect classification. Decisions are made by selecting the action with the highest expected utility for each case, with random tie-breaking to avoid bias. Evaluate consistency and compute the average utility.

Code

# -------------------------------
# Utility-Based Evaluation on New Test Set
# -------------------------------

# Create the 2x2 grid of possible outcomes
y_grid <- setNames(expand.grid(0:1, 0:1), as.list(y_names))

# Extract true labels and predictors from new test set
true_y <- new_test_data[, y_names, drop = FALSE]
x <- new_test_data[, x_names, drop = FALSE]

# Use Pr directly with prior adjustment
probs <- Pr(
    Y = y_grid, X = x,
    prior = new_rates,
    learnt = learnt_dir,
    parallel = parallel,
    quantiles = c(0.055, 0.945), nsamples = NULL
)

# Define utility matrix
outcome_names <- apply(y_grid, 1, function(x) paste0("E", x[1], "_A", x[2]))
e_matrix <- diag(4)
colnames(e_matrix) <- outcome_names
rownames(e_matrix) <- outcome_names

# Compute expected utilities and make decisions
exp_utilities <- e_matrix %*% probs$values
choosemax <- function(x) {
    sample(rep(which(x == max(x)), 2), 1)
}
decisions <- apply(exp_utilities, 2, choosemax)

# Translate true labels
true_values <- apply(true_y, 1, function(x) (x[1] + 2 * x[2]) + 1)

# Accuracy analysis
cat("Distribution of true labels (%):\n")
print(round(table(true_values) / sum(table(true_values)) * 100, 2))

# Check label consistency
true_outcome_names <- apply(true_y, 1, function(x) paste0("E", x[1], "_A", x[2]))
cat("Consistency check (TRUE expected): ", all(true_outcome_names == outcome_names[true_values]), "\n")

# Calculate average yield
avg_yield <- mean(e_matrix[cbind(decisions, true_values)])
cat("Average utility with adjusted base rates (Inferno): ", avg_yield, "\n")


Registered doParallelSNOW with 15 workers

Closing connections to cores.
Distribution of true labels (%):
true_values
    1     2     3     4 
 4.92 45.08 45.08  4.92 
Consistency check (TRUE expected):  TRUE 
Average utility with adjusted base rates (Inferno):  0.7117486

🧮 Evaluate Simple Neural Net Rule

Apply a simple decision rule based purely on the neural network’s raw outputs: predict positive if the logit is greater than or equal to zero. Evaluate the consistency of these predictions and calculate the corresponding average utility. This method ignores the altered base rates, exposing its inflexibility when faced with major distribution shifts.

Code

# -------------------------------
# Evaluate Simple Neural Net Rule
# -------------------------------

# Simple threshold decision
responses_nn <- apply(
    new_test_data[, c("LOGIT_EFFUSION", "LOGIT_ATELECTASIS")],
    1, function(x) 1 * (x >= 0)
)
decisions_nn <- apply(responses_nn, 2, function(x) (x[1] + 2 * x[2]) + 1)

# Neural net label consistency
response_names <- apply(responses_nn, 2, function(x) paste0("E", x[1], "_A", x[2]))
cat("Consistency check for NN outputs (TRUE expected): ", all(response_names == outcome_names[decisions_nn]), "\n")

# Neural net average yield
avg_yield_nn <- mean(e_matrix[cbind(decisions_nn, true_values)])
cat("Average utility for neural net rule: ", avg_yield_nn, "\n")

Consistency check for NN outputs (TRUE expected):  TRUE 
Average utility for neural net rule:  0.5122951

End of Experiment

Inferno achieves an average utility of approximately 0.712 on the altered test set,
whereas the naive neural net rule achieves only about 0.512.

The relative improvement in average utility is:

\[ \text{Improvement} = \frac{0.712 - 0.512}{0.512} \approx 0.39 \text{ (or 39\%)} \]

This represents a 39% relative improvement in average utility when using a flexible, prior-adjusted inference method.

The results clearly demonstrate that: - Inferno adapts successfully to new, shifted distributions, - Simple threshold-based CNN outputs perform poorly without base rate adjustment,

highlighting the crucial role of Bayesian updating and flexible decision making in deployed machine learning systems.