๐งช Evaluation Metrics Overview
๐ฏ Precision
- Definition: How many of the predicted positives are actually correct?
- Formula: TP / (TP + FP)
- Good for: Avoiding false alarms. Important when false positives are costly.
๐ฏ Recall (Sensitivity)
- Definition: How many of the actual positives did the model find?
- Formula: TP / (TP + FN)
- Good for: Avoiding missed cases. Critical in healthcare to ensure no true cases are overlooked.
๐ Precision vs Recall
- Precision emphasizes correctness of positive predictions.
- Recall emphasizes completeness of finding all true positives.
- In medical AI, recall is often prioritized, but both matter depending on clinical consequences.
๐ ROC Curve (Receiver Operating Characteristic)
- Axes: X = False Positive Rate, Y = True Positive Rate
- Good for: Overall class separation ability across all thresholds.
- Note: Can be misleading with class imbalance.
- How to read: Closer the curve is to the top-left, the better the model. AUC of 1.0 is perfect, 0.5 is random.
๐ AUROC (Area Under ROC Curve)
- Definition: Single number summarizing the ROC curve.
- Range: 0.0 โ 1.0 (1.0 = perfect classifier)
- Limitation: Considers both classes equally โ not ideal when positives are rare.
๐ PR Curve (Precision-Recall)
- Axes: X = Recall, Y = Precision
- Good for: Measuring performance on the positive class only.
- More useful than ROC when positives are rare (e.g., disease detection).
- How to read: The higher the curve stays in the top-right corner, the better. A steep drop in precision shows increased false positives as recall rises.
๐ Average Precision (AP)
- Definition: Area under the PR curve.
- Not just the mean of precision values.
- Best for: Summarizing model performance on minority class detection.
๐ AUROC vs Average Precision
AUROC |
Overall discrimination |
Poor |
Low |
Average Precision |
Positive class only |
Good |
High |
โ
Metric Use in This Project
- AUROC and AP are both reported to evaluate model quality.
- PR curves and Average Precision are emphasized due to dataset imbalance.
- Precision, Recall, and F1-score help assess decision quality at a 0.5 threshold.
- Utility matrix evaluation is included to account for clinical relevance beyond binary metrics.