๐ฏ Utility Based Evaluation Under Altered Base Rates
Author
Maksim Ohvrill
Published
May 1, 2025
This notebook studies how changing the prevalence of clinical conditions (effusion, atelectasis) impacts decision accuracy and utility, using adjusted prior rates in a test set.
๐ Setup
Code
# -------------------------------# Setup# -------------------------------# Set working directory one level upsetwd("..")library(inferno)# Load reusable utilitiessource("RScripts/reusableUtils.R")# Define general output directoryoutput_dir <-"data/inferno"if (!dir.exists(output_dir)) {dir.create(output_dir)}# Define model and test data pathslearnt_dir <-"data/inferno/combinedML50"setup <-load_metadata_testdata(learnt_dir)metadata <- setup$metadatatest_data <- setup$testdata# Define predictand and predictor namesy_names <-c("LABEL_EFFUSION", "LABEL_ATELECTASIS")x_names <-setdiff(metadata$name, y_names)
๐ Analyze and display the original base rates of different label combinations in the test dataset.
Analyze and display the original distribution of outcomes (โnoneโ, โatelโ, โeffuโ, โbothโ) in the test set. This shows the natural class imbalance present in the original dataset.
Code
# -------------------------------# Analyze Current Base Rates# -------------------------------# Define all possible label combinationscases <-expand.grid(LABEL_ATELECTASIS =0:1, LABEL_EFFUSION =0:1)rownames(cases) <-c("none", "atel", "effu", "both")# Calculate old base ratesold_counts <-apply(cases, 1, function(x) {nrow(test_data[ test_data[["LABEL_ATELECTASIS"]] == x[["LABEL_ATELECTASIS"]] & test_data[["LABEL_EFFUSION"]] == x[["LABEL_EFFUSION"]], ])})names(old_counts) <-rownames(cases)old_rates <- old_counts /sum(old_counts)cat("Old base rates:\n")print(old_rates)
Old base rates:
none atel effu both
0.4674435 0.2261823 0.2350925 0.0712817
๐ Create New Test Set with Altered Base Rates
Resample the calibration test set to impose new, balanced base rates (5% none, 45% atelectasis, 45% effusion, 5% both). The goal is to simulate a base rate shift and test how well the inference adapts under distributional changes. This adjustment forces the model to predict accurately even when conditions differ from the original data.
Code
# -------------------------------# Create New Test Set with Altered Base Rates# -------------------------------# Desired new base ratesnew_rates <-c(none =0.05, atel =0.45, effu =0.45, both =0.05)# Resample test datanew_counts <-NULLfor (acase inseq_len(nrow(cases))) { testcounts <-floor(new_rates * old_counts[acase] / new_rates[acase])if (all(testcounts <= old_counts)) { new_counts <- testcountsbreak }}new_test_data <- test_data[0, ]for (acase inseq_len(nrow(cases))) { tochoose <-sample(which( test_data[["LABEL_ATELECTASIS"]] == cases[acase, "LABEL_ATELECTASIS"] & test_data[["LABEL_EFFUSION"]] == cases[acase, "LABEL_EFFUSION"] ), size = new_counts[acase], replace =FALSE) new_test_data <-rbind(new_test_data, test_data[tochoose, ])}# Validate new base ratesadjusted_counts <-apply(cases, 1, function(x) {nrow(new_test_data[ new_test_data[["LABEL_ATELECTASIS"]] == x[["LABEL_ATELECTASIS"]] & new_test_data[["LABEL_EFFUSION"]] == x[["LABEL_EFFUSION"]], ])})names(adjusted_counts) <-rownames(cases)adjusted_rates <- adjusted_counts /sum(adjusted_counts)cat("New base rates:\n")print(adjusted_rates)# Save the new test setwrite.csv(new_test_data, file.path(output_dir, "calibration_test_newbaserate.csv"), row.names =FALSE, quote =TRUE, na ="")
New base rates:
none atel effu both
0.04918033 0.45081967 0.45081967 0.04918033
๐ง Utility-Based Evaluation (Inferno)
Infer probabilities for all data points using Inferno, correcting for the altered base rates by explicitly passing the prior. Calculate expected utilities based on a unit-diagonal utility matrix, reflecting perfect classification. Decisions are made by selecting the action with the highest expected utility for each case, with random tie-breaking to avoid bias. Evaluate consistency and compute the average utility.
Code
# -------------------------------# Utility-Based Evaluation on New Test Set# -------------------------------# Create the 2x2 grid of possible outcomesy_grid <-setNames(expand.grid(0:1, 0:1), as.list(y_names))# Extract true labels and predictors from new test settrue_y <- new_test_data[, y_names, drop =FALSE]x <- new_test_data[, x_names, drop =FALSE]# Use Pr directly with prior adjustmentprobs <-Pr(Y = y_grid, X = x,prior = new_rates,learnt = learnt_dir,parallel = parallel,quantiles =c(0.055, 0.945), nsamples =NULL)# Define utility matrixoutcome_names <-apply(y_grid, 1, function(x) paste0("E", x[1], "_A", x[2]))e_matrix <-diag(4)colnames(e_matrix) <- outcome_namesrownames(e_matrix) <- outcome_names# Compute expected utilities and make decisionsexp_utilities <- e_matrix %*% probs$valueschoosemax <-function(x) {sample(rep(which(x ==max(x)), 2), 1)}decisions <-apply(exp_utilities, 2, choosemax)# Translate true labelstrue_values <-apply(true_y, 1, function(x) (x[1] +2* x[2]) +1)# Accuracy analysiscat("Distribution of true labels (%):\n")print(round(table(true_values) /sum(table(true_values)) *100, 2))# Check label consistencytrue_outcome_names <-apply(true_y, 1, function(x) paste0("E", x[1], "_A", x[2]))cat("Consistency check (TRUE expected): ", all(true_outcome_names == outcome_names[true_values]), "\n")# Calculate average yieldavg_yield <-mean(e_matrix[cbind(decisions, true_values)])cat("Average utility with adjusted base rates (Inferno): ", avg_yield, "\n")
Registered doParallelSNOW with 15 workers
Closing connections to cores.
Distribution of true labels (%):
true_values
1 2 3 4
4.92 45.08 45.08 4.92
Consistency check (TRUE expected): TRUE
Average utility with adjusted base rates (Inferno): 0.7117486
๐งฎ Evaluate Simple Neural Net Rule
Apply a simple decision rule based purely on the neural networkโs raw outputs: predict positive if the logit is greater than or equal to zero. Evaluate the consistency of these predictions and calculate the corresponding average utility. This method ignores the altered base rates, exposing its inflexibility when faced with major distribution shifts.
Code
# -------------------------------# Evaluate Simple Neural Net Rule# -------------------------------# Simple threshold decisionresponses_nn <-apply( new_test_data[, c("LOGIT_EFFUSION", "LOGIT_ATELECTASIS")],1, function(x) 1* (x >=0))decisions_nn <-apply(responses_nn, 2, function(x) (x[1] +2* x[2]) +1)# Neural net label consistencyresponse_names <-apply(responses_nn, 2, function(x) paste0("E", x[1], "_A", x[2]))cat("Consistency check for NN outputs (TRUE expected): ", all(response_names == outcome_names[decisions_nn]), "\n")# Neural net average yieldavg_yield_nn <-mean(e_matrix[cbind(decisions_nn, true_values)])cat("Average utility for neural net rule: ", avg_yield_nn, "\n")
Consistency check for NN outputs (TRUE expected): TRUE
Average utility for neural net rule: 0.5122951
End of Experiment
Inferno achieves an average utility of approximately 0.712 on the altered test set,
whereas the naive neural net rule achieves only about 0.512.