Metrics

🧪 Evaluation Metrics Overview

🎯 Precision

Definition: How many of the predicted positives are actually correct?
Formula: TP / (TP + FP)
Good for: Avoiding false alarms. Important when false positives are costly.

🎯 Recall (Sensitivity)

Definition: How many of the actual positives did the model find?
Formula: TP / (TP + FN)
Good for: Avoiding missed cases. Critical in healthcare to ensure no true cases are overlooked.

🔁 Precision vs Recall

Precision emphasizes correctness of positive predictions.
Recall emphasizes completeness of finding all true positives.
In medical AI, recall is often prioritized, but both matter depending on clinical consequences.

📈 ROC Curve (Receiver Operating Characteristic)

Axes: X = False Positive Rate, Y = True Positive Rate
Good for: Overall class separation ability across all thresholds.
Note: Can be misleading with class imbalance.
How to read: Closer the curve is to the top-left, the better the model. AUC of 1.0 is perfect, 0.5 is random.

📐 AUROC (Area Under ROC Curve)

Definition: Single number summarizing the ROC curve.
Range: 0.0 – 1.0 (1.0 = perfect classifier)
Limitation: Considers both classes equally — not ideal when positives are rare.

📉 PR Curve (Precision-Recall)

Axes: X = Recall, Y = Precision
Good for: Measuring performance on the positive class only.
More useful than ROC when positives are rare (e.g., disease detection).
How to read: The higher the curve stays in the top-right corner, the better. A steep drop in precision shows increased false positives as recall rises.

📊 Average Precision (AP)

Definition: Area under the PR curve.
Not just the mean of precision values.
Best for: Summarizing model performance on minority class detection.

🔁 AUROC vs Average Precision

Metric	Focus	Class Imbalance	Sensitivity to Thresholds
AUROC	Overall discrimination	Poor	Low
Average Precision	Positive class only	Good	High

✅ Metric Use in This Project

AUROC and AP are both reported to evaluate model quality.
PR curves and Average Precision are emphasized due to dataset imbalance.
Precision, Recall, and F1-score help assess decision quality at a 0.5 threshold.
Utility matrix evaluation is included to account for clinical relevance beyond binary metrics.