📝 Evaluation
Evaluation compared the Inferno model and a convolutional neural network (CNN) across multiple utility matrices designed to reflect different aspects of prediction quality and clinical relevance.
The clinical utility matrix incorporated graded penalties for misclassification, assigning higher rewards to clinically less severe errors and lower rewards to more critical misdiagnoses. This matrix reflects the true impact of prediction errors on patient outcomes. To model individual variation, a patient specific variance version of the clinical matrix introduced controlled random variations to the off diagonal elements while preserving overall clinical structure.
A bare diagonal correctness matrix was also employed. This matrix rewards only exact matches between predicted and true outcomes, assigning a utility of 1 for perfect classification and 0 otherwise. It isolates strict label correctness without considering the varying clinical consequences of errors.
Additionally, a raw label accuracy matrix was used, which fully rewards exact predictions but assigns partial credit for near misses based on similarity in label structure. This matrix captures pure technical classification performance without incorporating clinical weighting, emphasizing raw label agreement over patient outcome importance.
Under patient specific variance, the Inferno model achieved an expected utility of 0.854 (1.71 QALYs), slightly outperforming its result on the original clinical matrix (0.852, 1.70 QALYs). Against the bare diagonal matrix, Inferno’s utility dropped to 0.645 (1.29 QALYs), demonstrating the importance of clinical weighting in evaluating model value. Against the raw label accuracy matrix, Inferno achieved a utility of 0.918 (1.84 QALYs), the highest nominal score but one that does not fully represent clinical impact.
The CNN model was evaluated at two thresholds: the common default of 0.5 and an optimized value of 0.28. At threshold 0.5, CNN achieved an expected utility of 0.786 (1.57 QALYs); at threshold 0.28, performance improved to 0.812 (1.62 QALYs). CNN performance remained static under patient specific variance, indicating limited adaptability to personalized clinical utility variation.
Over the evaluation cohort of 1500 patients, Inferno’s performance corresponds to an estimated 2565 total quality adjusted life years, compared to 2430 QALYs for the CNN at its best threshold. The difference of approximately 135 QALYs equates to providing 135 individuals an additional year of high quality life. These findings emphasize the importance of utility based evaluation in clinical decision support modeling, favoring systems aligned with patient centered outcomes over those optimized purely for label accuracy.
Extrapolation to larger populations underscores the clinical importance of utility based optimization. Generalizing the observed difference between models to a cohort of 150,000 patients would correspond to an estimated 13,500 additional quality adjusted life years gained. This scale of benefit approaches the magnitude commonly associated with major therapeutic innovations, where even small QALY gains justify widespread adoption [1], [2]. Clinical utility, rather than predictive accuracy alone, offers a more meaningful measure of a system’s value [1]. Although a valuation of $100,000 per quality adjusted life year is often used for conservative estimates, willingness to pay studies suggest that the true median value of a QALY may exceed $265,000, with most empirical estimates falling well above $100,000 [3].
Based on the conservative valuation, the 135 QALYs gained in the present cohort equate to an estimated $13.5 million in societal benefit. These findings emphasize the need to align predictive modeling practices with patient centered outcomes and economic efficiency [4].