What is a Good F1 Score? -

In the world of machine learning and artificial intelligence, evaluating the performance of a model is crucial. While accuracy is a commonly used metric, it is often insufficient when dealing with imbalanced datasets. This is where the F1 metric comes into play.

But what exactly is this metric, and what is considered a good value? In this article, we will explore:

What the metric represents and how it is calculated
The role of precision and recall in determining its value
What constitutes a good range in different scenarios
How to improve the performance of your model

By the end, you will have a solid understanding of when and how to use this evaluation metric and what values to aim for based on your specific use case.

1. What is This Evaluation Metric?

This metric is a measure of a model’s accuracy that considers both precision and recall. It is particularly useful in cases where there is an imbalance between classes, such as fraud detection, medical diagnosis, or spam filtering.

It is defined as the harmonic mean of precision and recall: F1=2×extPrecision×extRecallextPrecision+extRecallF1 = 2 \times \frac{{ ext{Precision} \times ext{Recall}}}{{ ext{Precision} + ext{Recall}}}

Where:

Precision = TPTP+FP\frac{TP}{TP + FP} (measures how many of the predicted positive cases are actually positive)
Recall = TPTP+FN\frac{TP}{TP + FN} (measures how many of the actual positive cases were correctly identified)
TP (True Positives): Correctly predicted positive cases.
FP (False Positives): Incorrectly predicted positive cases.
FN (False Negatives): Missed positive cases.

The range is 0 to 1, where:

1 means perfect precision and recall.
0 means the worst possible prediction performance.

2. Understanding Precision and Recall

Before determining what a good value is, it is important to understand the trade-off between precision and recall.

Scenario	High Precision, Low Recall	Low Precision, High Recall
Definition	Model correctly predicts positive cases but misses some true positives.	Model captures most true positives but also misclassifies negatives as positives.
Example	Fraud detection: We want to be very sure before labeling a transaction as fraud.	Medical diagnosis: It’s better to flag more cases for further testing.
Impact	Fewer false positives but more false negatives.	Fewer false negatives but more false positives.

The metric helps balance precision and recall, making it a more comprehensive measure than accuracy.

3. What is a Good Value?

General Interpretation

Score Range	Performance Interpretation
0.9 – 1.0	Excellent model performance, rare in real-world applications.
0.8 – 0.9	Very good performance, suitable for most applications.
0.7 – 0.8	Good performance, acceptable for many use cases.
0.5 – 0.7	Fair performance, may need improvement depending on the problem.
Below 0.5	Poor performance, needs significant improvement.

Industry-Specific Standards

The definition of a “good” value varies by industry and application:

Use Case	Acceptable Range	Reasoning
Spam Detection	0.8 – 0.9	Balance between precision (avoiding false spam flags) and recall (catching spam emails).
Medical Diagnosis	0.85 – 0.95	High recall is critical to minimize false negatives.
Fraud Detection	0.7 – 0.85	Precision is crucial to avoid false alarms, but recall matters too.
Search Engines (Relevance)	0.6 – 0.8	Some irrelevant results are acceptable, but high recall improves user satisfaction.
Speech Recognition	0.7 – 0.9	Errors should be minimized, but perfect accuracy is unrealistic.

The ideal value depends on the trade-offs acceptable for a specific task.

4. How to Improve Your Model’s Performance

If your model is underperforming, consider the following strategies:

1. Balance Precision and Recall

Increase precision by reducing false positives (e.g., stricter classification thresholds).
Increase recall by reducing false negatives (e.g., more inclusive classification rules).
Adjust the classification threshold to optimize for both precision and recall.

2. Handle Class Imbalance

Oversample the minority class or undersample the majority class.
Use class weighting in machine learning models (e.g., class_weight='balanced' in scikit-learn).

3. Use More Advanced Models

Upgrade to more sophisticated models (e.g., random forests, gradient boosting, deep learning).
Train models on more data to improve generalization.

4. Feature Engineering & Selection

Remove irrelevant features and focus on the most informative ones.
Use domain knowledge to create better features.

5. Hyperparameter Tuning

Fine-tune model parameters using grid search or Bayesian optimization.
Try different algorithms and model architectures.

6. Use Cross-Validation

Implement k-fold cross-validation to assess model performance on different data subsets.

Example: Adjusting Classification Threshold in Scikit-Learn

from sklearn.metrics import precision_recall_curve

y_true = [0, 1, 1, 0, 1, 0, 1, 0]
y_scores = [0.2, 0.9, 0.7, 0.3, 0.8, 0.1, 0.85, 0.4]

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
optimal_threshold = thresholds[(precision + recall).argmax()]

Adjusting the decision threshold can help optimize the balance between precision and recall.

Conclusion

The F1 metric is a powerful tool for evaluating models, particularly in imbalanced classification tasks. A good value depends on the specific application, with higher scores required for critical tasks like medical diagnosis and fraud detection, while moderate scores may suffice for search engines and recommendation systems.

To improve your model’s performance, consider balancing precision and recall, handling class imbalance, and fine-tuning hyperparameters. By doing so, you can optimize your machine learning models for better decision-making and real-world applications.

Precision vs. Recall: Which is More Important?
How to Handle Class Imbalance in Machine Learning
Evaluating Machine Learning Models: Accuracy vs. F1 Score

What is a Good F1 Score?