⚖️ Utility-Based Clinical Decision

Author

Maksim Ohvrill

Published

April 28, 2025

This notebook compares CNN threshold based predictions with Inferno’s Bayesian probabilistic approach for clinical decision making. It uses both square (diagnosis-based) and non-square (action-based) utility matrices to model realistic clinical choices beyond binary classification. Patient specific expected utilities are calculated to select treatments that best balance risk and benefit, illustrating how uncertainty aware strategies outperform fixed label decisions.

Utility-Based Clinical Decision Experiments Using InfernoCalibNet for Personalized Diagnosis

Code

# ======================================================================================================================
# 📁 Setup: Root Directory, Paths, Parallelism, and Data
# ======================================================================================================================

library(inferno)

# Number of cores to use
parallel <- 15

# Root directory for all data
rootdir <- "../data/inferno"

# Directory with trained Inferno models
learntdir <- file.path(rootdir, "combinedML50")

# Load metadata
metadata <- read.csv(file.path(learntdir, "metadata.csv"))
print(paste("Loaded metadata with", nrow(metadata), "entries"))

# Load test data
testdata <- read.csv(file.path(rootdir, "calibration_test.csv"))
print(paste("Loaded test data with", nrow(testdata), "samples and", ncol(testdata), "features"))

[1] "Loaded metadata with 7 entries"
[1] "Loaded test data with 1459 samples and 12 features"

Decision-Making Based on Probabilities and Utility Matrices

🩺 Clinical Context

An experiment is conducted using made-up clinical scenarios based on chest X-ray findings for pleural effusion and atelectasis. The scenarios are designed to reflect realistic decision-making situations involving hospital transfer, treatment initiation, or observation. Although synthetic, they aim to mirror challenges faced in real-world clinical applications, where uncertainty complicates management.

🧩 Traditional Utility Matrix Approach

In traditional methods, actions are tied to classification outcomes via a utility matrix. Decisions are based on the most probable diagnosis. Uncertainty beyond the chosen label is ignored, reducing robustness in the presence of critical low-probability risks.

🎯 Improved Decision Framework

The proposed method predicts full outcome probability distributions. Expected utilities for each action are calculated by weighting outcomes according to their predicted probabilities. The action with the highest expected utility is selected, keeping uncertainty central to the decision process.

🛡️ Clinical Advantage

Modeling decisions on expected utility rather than classification labels better reflects rational strategies under diagnostic uncertainty. The approach improves patient outcomes by balancing likelihood and severity instead of relying on simplified outcome predictions.

⚙️ Utility Matrix Modeling and Scoring

The utility matrix was manually constructed to reflect plausible clinical priorities.
Actions were scored according to expected clinical benefit:

High scores (close to 1.0) were assigned to actions critical for managing severe outcomes (e.g., hospital transfer for combined effusion and atelectasis).
Moderate scores (0.5–0.7) were given to treatments targeting single conditions.
Lower scores (0.2–0.4) were reserved for supportive care or observation when risk of harm was greater if untreated.

This structured design ensures that decisions are rewarded for matching patient needs and penalized for over or under treatment.

Unlike traditional CNN-based classification, where a utility matrix can only be applied after a fixed, single-label prediction, the present method integrates the full probability distribution before decision-making.
Thus, it is not limited by the binary or categorical nature of CNN outputs and can directly optimize expected clinical benefit under uncertainty, something classification networks cannot achieve without significant external adjustment.

📊 Summary

Aspect	Traditional Approach	Improved Approach
Basis for decision	Most probable outcome	Expected utility
Treatment of uncertainty	Disregarded	Fully integrated
Robustness to error	Low	High
Suitability for clinical practice	Limited	Strong

Code

# ----------------------------------------------------------------------------------------------------------------------
# 🌟 Single Prediction and Clinical Decision for Chest X-rays (Effusion and Atelectasis)
# ----------------------------------------------------------------------------------------------------------------------

# Define predictands (outcome variables) and predictors (input features)
predictands <- c("LABEL_EFFUSION", "LABEL_ATELECTASIS")
predictors <- setdiff(metadata$name, predictands)

# Define possible outcomes
y <- setNames(expand.grid(0:1, 0:1), as.list(predictands))
outcomenames <- apply(y, 1, function(x) paste0("E", x[1], "_A", x[2]))

# Define available clinical actions
actions <- c(
  "Send_to_Hospital",       # For large effusions or complicated atelectasis
  "Start_Drainage_Treatment", # For moderate-large effusions
  "Start_Bronchodilator_Therapy", # For atelectasis predominantly
  "Supportive_Care",          # Mild cases
  "Observe_Closely"           # No severe findings
)

# Define utility matrix: rows = actions, columns = possible outcomes
# Higher scores reflect better alignment with patient benefit
utility_matrix <- matrix(
  c(
    # E0_A0 (No effusion, No atelectasis), E0_A1 (No effusion, Atelectasis), E1_A0 (Effusion, No atelectasis), E1_A1 (Effusion and Atelectasis)
    0.2, 0.3, 0.9, 0.95,    # Send_to_Hospital
    0.1, 0.2, 0.8, 0.85,    # Start_Drainage_Treatment
    0.3, 0.8, 0.4, 0.7,     # Start_Bronchodilator_Therapy
    0.6, 0.5, 0.3, 0.4,     # Supportive_Care
    0.7, 0.6, 0.2, 0.3      # Observe_Closely
  ),
  nrow = length(actions),
  byrow = TRUE
)
rownames(utility_matrix) <- actions
colnames(utility_matrix) <- outcomenames

# Patient index to evaluate
patient_idx <- 100

# Extract patient predictors and true labels
x_patient <- testdata[patient_idx, predictors, drop = FALSE]
true_labels <- testdata[patient_idx, predictands, drop = FALSE]

# Predict outcome probabilities
probs <- Pr(
  Y = y,
  X = x_patient,
  learnt = learntdir,
  parallel = parallel
)

# Calculate expected utilities for each action
expected_utilities <- utility_matrix %*% probs$values

# Decision function: choose the action maximizing expected utility
choose_max_action <- function(x) {
  sample(rep(which(x == max(x)), 2), 1)
}

# Make decision
decision_idx <- choose_max_action(expected_utilities)
final_decision <- actions[decision_idx]

# Enhanced Output for Better Readability
cat("\n========================================================\n")
cat("Single Patient Clinical Decision Report\n")
cat("========================================================\n")

# Show patient predictor values
cat("Patient Predictor Data (Features Only):\n")
print(x_patient)

# Show true labels
cat("\nTrue Labels (Ground Truth):\n")
print(true_labels)

# Show predicted probabilities for each disease state
cat("\nPredicted Probabilities for Outcomes:\n")
print(data.frame(Outcome = outcomenames, Probability = round(probs$values, 3)))

# Show expected utilities for each clinical action
cat("\nExpected Utilities for Clinical Actions:\n")
print(data.frame(Action = actions, Expected_Utility = round(as.numeric(expected_utilities), 3)))

# Show final recommended decision
cat("\nRecommended Clinical Action:\n")
cat(paste0(final_decision, "\n"))


Registered doParallelSNOW with 15 workers

Closing connections to cores.

========================================================
Single Patient Clinical Decision Report
========================================================
Patient Predictor Data (Features Only):
    AGE GENDER VP LOGIT_EFFUSION LOGIT_ATELECTASIS
100  27      M PA       1.267194         -1.452902

True Labels (Ground Truth):
    LABEL_EFFUSION LABEL_ATELECTASIS
100              1                 1

Predicted Probabilities for Outcomes:
  Outcome Probability
1   E0_A0       0.177
2   E1_A0       0.683
3   E0_A1       0.036
4   E1_A1       0.104

Expected Utilities for Clinical Actions:
                        Action Expected_Utility
1             Send_to_Hospital            0.371
2     Start_Drainage_Treatment            0.271
3 Start_Bronchodilator_Therapy            0.687
4              Supportive_Care            0.500
5              Observe_Closely            0.572

Recommended Clinical Action:
Start_Bronchodilator_Therapy

🩺 Interpretation of Single Patient Clinical Decision Report

👤 Patient Overview

The patient is a 27-year-old male with a PA-view chest X-ray. Predictors suggest a high likelihood of pleural effusion (LOGIT_EFFUSION = 1.27) and a low likelihood of atelectasis (LOGIT_ATELECTASIS = -1.45). True labels confirm the presence of both conditions.

📊 Predicted Outcome Probabilities

Outcome	Description	Probability
E0_A0	No effusion, no atelectasis	17.7%
E1_A0	Effusion only	68.3%
E0_A1	Atelectasis only	3.6%
E1_A1	Effusion and atelectasis	10.4%

The model most strongly favors isolated effusion, though the combination of both pathologies carries a non-negligible probability.

🎯 Expected Utilities and Decision

Action	Expected Utility
Send_to_Hospital	0.371
Start_Drainage_Treatment	0.271
Start_Bronchodilator_Therapy	0.687
Supportive_Care	0.500
Observe_Closely	0.572

Start_Bronchodilator_Therapy offers the highest expected utility.

🛡️ Advantage of the Outcome

This action addresses potential airway compromise due to atelectasis while avoiding unnecessary hospitalization or invasive procedures, particularly suitable for a young, otherwise healthy patient. Incorporating uncertainty into the decision-making process helps balance risk and benefit more effectively than strict threshold-based classifications.

✅ Summary: Utility-based decisions enhance patient-centered care by incorporating the full spectrum of likely outcomes and their consequences, especially when conditions coexist with varying probabilities.

Defenition/Testing the utility matrice variability function

Function: `create_patient_ematrix`

The function create_patient_ematrix introduces controlled randomness into the utility matrix:

The diagonal elements (correct classifications) remain fixed.
Off-diagonal elements are perturbed by adding a random value within a specified variance range (e.g., ±15%).
The adjusted utilities are clamped between 0 and 1 to preserve meaningful scoring.
Off-diagonal elements are perturbed by a random value $\epsilon \sim \text{Uniform}(-\text{variance}, \text{variance})$.
Each off-diagonal entry $U_{i,j}$ is updated as:

\[ U_{i,j}^\text{new} = \min(\max(U_{i,j} + \epsilon, 0), 1) \quad \text{for} \quad i \neq j \]

where $U$ is the original utility matrix.

This process ensures that clinical realism is maintained, avoiding invalid or extreme utility values.

Code

# ----------------------------------------------------------------------------------------------------------------------
# 📊 Functions Definition for Matrix Evaluation
# ----------------------------------------------------------------------------------------------------------------------

# Function to create a patient-specific utility matrix with fixed seed support
create_patient_ematrix <- function(base_matrix, variance, seed = NULL) {
  if (!is.null(seed)) set.seed(seed)
  new_matrix <- base_matrix
  n <- nrow(base_matrix)
  for (i in 1:n) {
    for (j in 1:n) {
      if (i != j) {
        perturbation <- runif(1, -variance, variance)
        new_matrix[i, j] <- min(max(new_matrix[i, j] + perturbation, 0), 1)
      }
    }
  }
  return(new_matrix)
}

# Decision function: choose option maximizing expected utility
choosemax <- function(x) {
  sample(rep(which(x == max(x)), 2), 1)
}

# Function to evaluate a matrix (original or perturbed)
evaluate_matrix <- function(ematrix, probs_full, truevalues, patient_ematrices = NULL) {
  if (is.null(patient_ematrices)) {
    exputilities <- ematrix %*% probs_full$values
    decisions <- apply(exputilities, 2, choosemax)
    avgyield <- mean(ematrix[cbind(decisions, truevalues)])
  } else {
    exputilities_varied <- sapply(1:length(patient_ematrices), function(i) patient_ematrices[[i]] %*% probs_full$values[, i])
    decisions <- apply(exputilities_varied, 2, choosemax)
    avgyield <- mean(sapply(1:length(decisions), function(i) patient_ematrices[[i]][decisions[i], truevalues[i]]))
  }
  return(avgyield)
}

🧩 Patient Specific Utility Evaluation

Experiment Overview

The evaluation compares the performance of Inferno and a baseline CNN in predicting clinical outcomes under realistic variability.
Instead of assuming a fixed utility matrix, a patient specific matrix is generated for each case, introducing slight random variation while preserving the ideal structure.

Performance is measured: - Against the original clinical utility matrix, - And against varied matrices simulating patient specific preferences or clinical uncertainties.

Both Inferno predictions and CNN threshold-based decisions are evaluated under these conditions.

This method strengthens the experiment by testing model robustness under mild clinical variability, closely reflecting real world deployment.

Code

# ----------------------------------------------------------------------------------------------------------------------
# 🔄 One-Time Setup (Slow Part)
# ----------------------------------------------------------------------------------------------------------------------

# Full predictors and true labels
X <- testdata[, predictors, drop = FALSE]
trueY <- testdata[, predictands, drop = FALSE]

# Predict outcome probabilities for full dataset
probs_full <- Pr(
  Y = y,
  X = X,
  learnt = learntdir,
  parallel = parallel,
  quantiles = c(0.055, 0.945),
  nsamples = NULL
)

# Map true labels to indices
truevalues <- apply(trueY, 1, function(x) (x[1] + 2 * x[2]) + 1)

# Map true labels to outcome names
trueoutcomenames <- apply(trueY, 1, function(x) paste0("E", x[1], "_A", x[2]))

# Calculate baseline accuracy (predicting the most common outcome)
most_common_value <- which.max(table(truevalues))
baseline_accuracy <- sum(truevalues == most_common_value) / length(truevalues)


Registered doParallelSNOW with 15 workers

Closing connections to cores.

Inferno evaluation code:

Code

# ----------------------------------------------------------------------------------------------------------------------
# 📊 Matrix Definitions and Check for Uniqueness
# ----------------------------------------------------------------------------------------------------------------------

# Clinical utility matrix
ematrix_clinical <- matrix(
  c(
    1.00, 0.55, 0.60, 0.40,
    0.90, 0.60, 0.65, 0.75,
    0.90, 0.65, 0.60, 0.75,
    0.80, 0.85, 0.85, 0.60
  ),
  nrow = 4,
  byrow = TRUE
)

# Identity matrix (bare diagonal)
ematrix_diag <- diag(4)

# Raw prediction performance matrix (reward 1.0 for correct prediction, partial for wrong)
ematrix_prediction <- matrix(
  c(
    1.00, 0.55, 0.60, 0.40,
    0.90, 1.00, 0.65, 0.75,
    0.90, 0.65, 1.00, 0.75,
    0.80, 0.85, 0.85, 1.00
  ),
  nrow = 4,
  byrow = TRUE
)

# Define patient-specific variance level
variance_level <- 0.10

# Generate patient-specific matrices
set.seed(42)
patient_ematrices <- lapply(1:nrow(X), function(i) create_patient_ematrix(ematrix_clinical, variance_level))

# Check if all patient-specific matrices are unique
are_matrices_equal <- function(mat1, mat2) {
  all(abs(mat1 - mat2) < 1e-8)
}

# Compare matrices pairwise
duplicates <- FALSE
for (i in 1:(length(patient_ematrices) - 1)) {
  for (j in (i + 1):length(patient_ematrices)) {
    if (are_matrices_equal(patient_ematrices[[i]], patient_ematrices[[j]])) {
      cat("\nDuplicate matrices found at indices:", i, "and", j, "\n")
      duplicates <- TRUE
    }
  }
}

if (!duplicates) {
  cat("\n✅ All patient-specific matrices are unique.\n")
}

# Define QALY scaling factor based on clinical context (2 years)
scale_factor <- 2


✅ All patient-specific matrices are unique.

CNN evaluation code:

Code

# ----------------------------------------------------------------------------------------------------------------------
# 🤖 Neural Net Evaluation and Structured Printout
# ----------------------------------------------------------------------------------------------------------------------

# Neural Net decisions at sigmoid threshold 0.5
responsesNN_05 <- apply(
  testdata[, c("LOGIT_EFFUSION", "LOGIT_ATELECTASIS")],
  1,
  function(x) 1 * (x >= 0)
)

decisionsNN_05 <- apply(responsesNN_05, 2, function(x) {
  (x[1] + 2 * x[2]) + 1
})

responsenames_05 <- apply(responsesNN_05, 2, function(x) paste0("E", x[1], "_A", x[2]))

cat("\n✅ NN decision naming check (threshold 0.5):",
    all(responsenames_05 == outcomenames[decisionsNN_05]), "\n")

# Neural Net decisions at sigmoid threshold 0.28
logit_threshold_028 <- qlogis(0.28)

responsesNN_028 <- apply(
  testdata[, c("LOGIT_EFFUSION", "LOGIT_ATELECTASIS")],
  1,
  function(x) 1 * (x >= logit_threshold_028)
)

decisionsNN_028 <- apply(responsesNN_028, 2, function(x) {
  (x[1] + 2 * x[2]) + 1
})

responsenames_028 <- apply(responsesNN_028, 2, function(x) paste0("E", x[1], "_A", x[2]))

cat("\n✅ NN decision naming check (threshold 0.28):",
    all(responsenames_028 == outcomenames[decisionsNN_028]), "\n")

# Define QALY scaling factor based on clinical context (2 years)
scale_factor <- 2

# Evaluate NN decisions on multiple matrices
avgyieldNN_05_clinical <- mean(ematrix_clinical[cbind(decisionsNN_05, truevalues)])
avgyieldNN_05_varied <- mean(sapply(1:length(decisionsNN_05), function(i) patient_ematrices[[i]][decisionsNN_05[i], truevalues[i]]))
avgyieldNN_05_diag <- mean(ematrix_diag[cbind(decisionsNN_05, truevalues)])
avgyieldNN_05_prediction <- mean(ematrix_prediction[cbind(decisionsNN_05, truevalues)])

avgyieldNN_028_clinical <- mean(ematrix_clinical[cbind(decisionsNN_028, truevalues)])
avgyieldNN_028_varied <- mean(sapply(1:length(decisionsNN_028), function(i) patient_ematrices[[i]][decisionsNN_028[i], truevalues[i]]))
avgyieldNN_028_diag <- mean(ematrix_diag[cbind(decisionsNN_028, truevalues)])
avgyieldNN_028_prediction <- mean(ematrix_prediction[cbind(decisionsNN_028, truevalues)])


✅ NN decision naming check (threshold 0.5): TRUE 

✅ NN decision naming check (threshold 0.28): TRUE

Code

# ----------------------------------------------------------------------------------------------------------------------
# 📚 Evaluation and Printout
# ----------------------------------------------------------------------------------------------------------------------

# Evaluate matrices
avgyield_varied <- evaluate_matrix(ematrix_clinical, probs_full, truevalues, patient_ematrices)
avgyield_clinical <- evaluate_matrix(ematrix_clinical, probs_full, truevalues)
avgyield_diag <- evaluate_matrix(ematrix_diag, probs_full, truevalues)
avgyield_prediction <- evaluate_matrix(ematrix_prediction, probs_full, truevalues)

# Create results table with consistent names
results_table <- data.frame(
  Evaluation = c(
    "Inferno (clinical matrix with patient-specific variance)",
    "Inferno (standard clinical utility matrix)",
    "Inferno (bare diagonal correctness matrix)",
    "Inferno (raw label accuracy matrix)"
  ),
  Utility = round(c(avgyield_varied, avgyield_clinical, avgyield_diag, avgyield_prediction), 3),
  QALYs = round(c(avgyield_varied, avgyield_clinical, avgyield_diag, avgyield_prediction) * 2, 2)
)

# Print outcome distribution
cat("True outcome distribution (%):\n")
print(round(table(truevalues) / sum(table(truevalues)) * 100, 2))

# Check naming consistency
cat("\n✅ Name consistency check:", all(trueoutcomenames == outcomenames[truevalues]), "\n")

# Print results table
cat("\n--- Inferno Model Evaluation Results Summary ---\n")
print(results_table, row.names = FALSE)

# Print baseline accuracy
cat("\n🌟 Baseline accuracy (predicting most common outcome):", round(baseline_accuracy * 100, 1), "%\n")

# ----------------------------------------------------------------------------------------------------------------------
# 🤖 Neural Net Evaluation and Printout
# ----------------------------------------------------------------------------------------------------------------------

nn_results_table <- data.frame(
  Evaluation = c(
    "Neural Net (threshold 0.5, clinical matrix)",
    "Neural Net (threshold 0.5, patient specific variance)",
    "Neural Net (threshold 0.5, bare diagonal correctness matrix)",
    "Neural Net (threshold 0.5, raw label accuracy matrix)",
    "Neural Net (threshold 0.28, clinical matrix)",
    "Neural Net (threshold 0.28, patient specific variance)",
    "Neural Net (threshold 0.28, bare diagonal correctness matrix)",
    "Neural Net (threshold 0.28, raw label accuracy matrix)"
  ),
  Utility = round(c(
    avgyieldNN_05_clinical,
    avgyieldNN_05_varied,
    avgyieldNN_05_diag,
    avgyieldNN_05_prediction,
    avgyieldNN_028_clinical,
    avgyieldNN_028_varied,
    avgyieldNN_028_diag,
    avgyieldNN_028_prediction
  ), 3),
  QALYs = round(c(
    avgyieldNN_05_clinical,
    avgyieldNN_05_varied,
    avgyieldNN_05_diag,
    avgyieldNN_05_prediction,
    avgyieldNN_028_clinical,
    avgyieldNN_028_varied,
    avgyieldNN_028_diag,
    avgyieldNN_028_prediction
  ) * 2, 2)
)

# Print Neural Net results table
cat("\n--- Neural Net Evaluation Results Summary ---\n")
print(nn_results_table, row.names = FALSE)

True outcome distribution (%):
truevalues
    1     2     3     4 
46.74 23.51 22.62  7.13 

✅ Name consistency check: TRUE 

--- Inferno Model Evaluation Results Summary ---
                                               Evaluation Utility QALYs
 Inferno (clinical matrix with patient-specific variance)   0.854  1.71
               Inferno (standard clinical utility matrix)   0.852  1.70
               Inferno (bare diagonal correctness matrix)   0.645  1.29
                      Inferno (raw label accuracy matrix)   0.918  1.84

🌟 Baseline accuracy (predicting most common outcome): 46.7 %

--- Neural Net Evaluation Results Summary ---
                                                    Evaluation Utility QALYs
                   Neural Net (threshold 0.5, clinical matrix)   0.786  1.57
         Neural Net (threshold 0.5, patient specific variance)   0.786  1.57
  Neural Net (threshold 0.5, bare diagonal correctness matrix)   0.639  1.28
         Neural Net (threshold 0.5, raw label accuracy matrix)   0.884  1.77
                  Neural Net (threshold 0.28, clinical matrix)   0.812  1.62
        Neural Net (threshold 0.28, patient specific variance)   0.812  1.62
 Neural Net (threshold 0.28, bare diagonal correctness matrix)   0.557  1.11
        Neural Net (threshold 0.28, raw label accuracy matrix)   0.908  1.82

📋 Updated Summary Table of Evaluation Results

Model	Matrix Type	Expected Utility (0–1)	QALYs (2 years)	Change Under Variance	Key Observation
Inferno	Patient-specific clinical matrix	0.854	1.71	🔼 +0.2%	Improves slightly with personalization
Inferno	Original clinical matrix	0.852	1.70	–	Baseline performance without variance
Inferno	Bare diagonal correctness matrix	0.645	1.29	–	Drops when ignoring clinical structure
Inferno	Raw label accuracy matrix	0.918	1.84	–	High pure prediction reward, less clinical
CNN (threshold 0.5)	Patient-specific clinical matrix	0.786	1.57	🔽 No change	Stable but lower overall
CNN (threshold 0.5)	Original clinical matrix	0.786	1.57	–	No gain from variance
CNN (threshold 0.5)	Bare diagonal correctness matrix	0.639	1.28	–	Worse strict correctness
CNN (threshold 0.5)	Raw label accuracy matrix	0.884	1.77	–	High pure label matching
CNN (threshold 0.28)	Patient-specific clinical matrix	0.812	1.62	🔽 No change	Slightly better than threshold 0.5
CNN (threshold 0.28)	Original clinical matrix	0.812	1.62	–	Stable
CNN (threshold 0.28)	Bare diagonal correctness matrix	0.557	1.11	–	Poor strict correctness
CNN (threshold 0.28)	Raw label accuracy matrix	0.908	1.82	–	Good for matching, less clinical meaning
Baseline	Predict most common outcome	0.467	–	–	Very low, no learning

📝 Evaluation

Evaluation compared the Inferno model and a convolutional neural network (CNN) across multiple utility matrices designed to reflect different aspects of prediction quality and clinical relevance.

The clinical utility matrix incorporated graded penalties for misclassification, assigning higher rewards to clinically less severe errors and lower rewards to more critical misdiagnoses. This matrix reflects the true impact of prediction errors on patient outcomes. To model individual variation, a patient specific variance version of the clinical matrix introduced controlled random variations to the off diagonal elements while preserving overall clinical structure.

A bare diagonal correctness matrix was also employed. This matrix rewards only exact matches between predicted and true outcomes, assigning a utility of 1 for perfect classification and 0 otherwise. It isolates strict label correctness without considering the varying clinical consequences of errors.

Additionally, a raw label accuracy matrix was used, which fully rewards exact predictions but assigns partial credit for near misses based on similarity in label structure. This matrix captures pure technical classification performance without incorporating clinical weighting, emphasizing raw label agreement over patient outcome importance.

Under patient specific variance, the Inferno model achieved an expected utility of 0.854 (1.71 QALYs), slightly outperforming its result on the original clinical matrix (0.852, 1.70 QALYs). Against the bare diagonal matrix, Inferno’s utility dropped to 0.645 (1.29 QALYs), demonstrating the importance of clinical weighting in evaluating model value. Against the raw label accuracy matrix, Inferno achieved a utility of 0.918 (1.84 QALYs), the highest nominal score but one that does not fully represent clinical impact.

The CNN model was evaluated at two thresholds: the common default of 0.5 and an optimized value of 0.28. At threshold 0.5, CNN achieved an expected utility of 0.786 (1.57 QALYs); at threshold 0.28, performance improved to 0.812 (1.62 QALYs). CNN performance remained static under patient specific variance, indicating limited adaptability to personalized clinical utility variation.

Over the evaluation cohort of 1500 patients, Inferno’s performance corresponds to an estimated 2565 total quality adjusted life years, compared to 2430 QALYs for the CNN at its best threshold. The difference of approximately 135 QALYs equates to providing 135 individuals an additional year of high quality life. These findings emphasize the importance of utility based evaluation in clinical decision support modeling, favoring systems aligned with patient centered outcomes over those optimized purely for label accuracy.

Extrapolation to larger populations underscores the clinical importance of utility based optimization. Generalizing the observed difference between models to a cohort of 150,000 patients would correspond to an estimated 13,500 additional quality adjusted life years gained. This scale of benefit approaches the magnitude commonly associated with major therapeutic innovations, where even small QALY gains justify widespread adoption [1], [2]. Clinical utility, rather than predictive accuracy alone, offers a more meaningful measure of a system’s value [1]. Although a valuation of $100,000 per quality adjusted life year is often used for conservative estimates, willingness to pay studies suggest that the true median value of a QALY may exceed $265,000, with most empirical estimates falling well above $100,000 [3].

Based on the conservative valuation, the 135 QALYs gained in the present cohort equate to an estimated $13.5 million in societal benefit. These findings emphasize the need to align predictive modeling practices with patient centered outcomes and economic efficiency [4].

References

[1]

M. C. Sachs, A. Sjölander, and E. E. Gabriel, “Aim for clinical utility, not just predictive accuracy,” Epidemiology (Cambridge, Mass.), vol. 31, no. 3, pp. 359–364, 2020, doi: 10.1097/EDE.0000000000001173

[2]

P. J. Ngo, S. Cressman, S. Behar-Harpaz, D. J. Karikios, K. Canfell, and M. F. Weber, “Applying utility values in cost-effectiveness analyses of lung cancer screening: A review of methods,” Lung Cancer (Amsterdam, Netherlands), vol. 166, pp. 122–131, 2022, doi: 10.1016/j.lungcan.2022.02.009

[3]

R. A. Hirth, M. E. Chernew, E. Miller, A. M. Fendrick, and W. G. Weissert, “Willingness to pay for a quality-adjusted life year: In search of a standard,” Medical Decision Making, vol. 20, no. 3, pp. 332–342, 2000, doi: 10.1177/0272989X0002000310

[4]

M. Drummond, Methods for the economic evaluation of health care programmes, 4th ed. Oxford, United Kingdom; New York, NY, USA: Oxford University Press, 2015.

Utility-Based Clinical Decision Experiments Using InfernoCalibNet for Personalized Diagnosis

Decision-Making Based on Probabilities and Utility Matrices

🩺 Clinical Context

🧩 Traditional Utility Matrix Approach

🎯 Improved Decision Framework

🛡️ Clinical Advantage

⚙️ Utility Matrix Modeling and Scoring

📊 Summary

🩺 Interpretation of Single Patient Clinical Decision Report

👤 Patient Overview

📊 Predicted Outcome Probabilities

🎯 Expected Utilities and Decision

🛡️ Advantage of the Outcome

Defenition/Testing the utility matrice variability function

Function: create_patient_ematrix

🧩 Patient Specific Utility Evaluation

Experiment Overview

Inferno evaluation code:

CNN evaluation code:

📋 Updated Summary Table of Evaluation Results

📝 Evaluation

References

Function: `create_patient_ematrix`