In a previous post, Danna explained confusion matrices. Now she discusses common classification scoring metrics.
In "What is a Confusion Matrix?" I discussed the accuracy metric as a way to assess a classification confusion matrix. There are some cons to using accuracy as a model scoring metric, one being that it is biased towards more frequently observed classes. In this post I am going to explain other classification scoring metrics: precision, recall, and F-score.
The Nexosis API calculates these metrics automatically behind the scenes to select the best model for your dataset, but they are still good to know.
Binary classification
Binary classification classifies elements of a given set into two groups. 1 is a positive classification and 0 is a negative classification. Unless stated otherwise, positives are considered the class of interest and more emphasis is placed on positives. This could be a pro or con depending on the problem.
Predicted: 1 | Predicted: 0 | |
Actual: 1 | True positives | False negatives |
Actual: 0 | False positives | True negatives |
Some definitions from the above confusion matrix.
- True positive (\(TP\)) - The number of correctly classified positives.
- False positive (\(FP\)) - The number of results classified positive but were actually negative.
- True negative (\(TN\)) - The number of correctly classified negatives.
- False negative (\(FN\)) - The number of results classified negative but were actually positive.
- True negative rate - \(\frac{TN}{TN+FN}\)
In addition to these definitions, let's discuss other classification scoring metrics that can be calculated from a binary class confusion matrix.
- Precision - \(\frac{TP}{TP+FP}\) - The number of correctly classified positives out of all observations classified as positive. This metric takes into account all positive classifications. It ignores false negatives so it still might not get all positives correct so shouldn't be used alone.
- Recall - \(\frac{TP}{TP+FN}\) - The number of correctly classified positives out of actual number of positives. This metric is good for when correctly identifying positives is important and negatives aren't. Because it ignores false positives it shouldn't be used alone.
- F-1 Score - \(\frac{2}{\frac{1}{precision}+\frac{1}{recall}} = \frac{2 * precision * recall}{precision + recall}\) - The harmonic mean of the precision and recall. If precision or recall are not equally important, a weighted F-1 Score can be used to weigh one more heavily.
- Harmonic mean - This isn't a classification scoring metric, but we should define it since it calculates the F-1 Score. It is usually appropriate when the average of rates is desired. It can be expressed as the reciprocal of the arithmetic mean of the reciprocals of the given set of observations. Let's calculate the harmonic mean of 3, 7 and 9: \((\frac{3^{-1} + 7^{-1} + 9^{-1}}{3})^{-1} = \frac{3}{\frac{1}{3} + \frac{1}{7} + \frac{1}{9}} = \frac{3}{.587} = 5.11\)
Multiclass classification
In the "What is a Confusion Matrix" post I used a multiclass problem to explain the purpose of a confusion matrix. Classification scoring metrics are calculated differently for multiclass problems. Precision, recall, and F-1 score are all calculated using either the micro or macro average. In the micro-average method, you sum up the individual true positives, false positives, and false negatives for different classes and then use these values in to calculate precision and recall. The macro-average method is more straightforward and is what we use for classification models in the Nexosis API. It is calculated by finding the precision and recall of each class and taking the average. Note that \(k\) is the number of classes.
- Precision, micro method - \(\frac{TP_1+TP_2+\cdots+TP_k}{TP_1+TP_2+\cdots+TP_k+FP_1+FP_2+\cdots+FP_k}\)
- Precision, macro method - \(\frac{Precision_1+Precision_2+\cdots+Precision_k}{k}\)
- Recall, micro method - \(\frac{TP_1+TP_2+\cdots+TP_k}{TP_1+TP_2+\cdots+TP_k+FN_1+FN_2+\cdots+FN_k}\)
- Recall, macro method - \(\frac{Recall_1+Recall_2+\cdots+Recall_k}{k}\)
- F1-Score - Macro and micro F1-score are calculated using the harmonic mean of the micro or macro of the precision and recall respectively. Regardless, the higher the F1-score the better the model and it will always be between 0 and 1.
Keep in mind that these metrics together help you score your classification models against each other.