# CM 3 : Classification - MNIST Dataset = 70_000 images (28x28) of handwritten digits and label ## Binary Classifier Check if 5 or 'Not 5' SGDClassifier: Score of 3-fold-cross-validation = \[0.95035, 0.96035, 0.9604\] DummyClassifier: Score of 3-fold-cross-validation = \[0.90965, 0.90965, 0.90965\] -> Skewed dataset, 90% of instances are 'Not 5' so just saying 'No' gives a 90% accuracy ### Confusion Matrices | TN | FP | | -------------- | --------------- | | FN | TP | With : - TN = model predicts negative, label is negative (OK) - TP = model predicts positive, label is positive (OK) - FN = model predicts negative, label is positive (KO) - FP = model predicts positive, label is negative (KO) ### Precision/Recall formulas Precision (the banker): the model classifies something if he is sure about the prediction; $$ Precision = \frac{TP}{TP+FP} $$ Recall (the doctor): when in doubt, the model will classify the instance into the category; $$ Recall = \frac{TP}{TP+FN} $$ ### F score Combines Precision and Recall in a single metric #### F1 score It is the harmonic mean (more weight to low values) of precision and recall $$ F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{TP}{TP + \frac{FN + FP}{2}} $$ ### Decision function If this score is greater than a threshold, it assigns the instance to the positive class ; otherwise it assigns it to the negative class. (Ex: SGD classifier) (see the curve in the slides) ### Precision/Recall curve Recall as the X axis, and Precision as Y => easy to create a classifier with desired precision ### ROC Curve - ROC = Receiver operating characteristic : common tool used with binary classifier - very similar to precision/recall curve - plots the TP rate (recall) vs the FP rate (also called fall-out) - FPR = ratio of negative instances that are incorrectly classified as positive $$ FPR = 1-TNR $$ - TNR : ratio of negative instances that are correctly classified as negative, it is also called specificity - ROC curve plots the sensitivity (recall) versus 1-specificity - Once again, it is a trade-off - One way to compare classifier is to measure the area under the curve (AUC ROC) - A perfect classifier will have a ROC AUC equal to 1 - A purely random classifier : ROC AUC = 0.5 ### ROC curve or PR curve - Prefer PR curve when : - the positive class is rare - you care more about the false positives than the false negatives - Otherwise use the ROC curve - Example : Considering the previous ROC curve you may think that the classifier is really good but this is mostly because there are few positives (5s) compared to the negatives (non-5s). In constrast, PR curve makes it clear that the classifier has room for improvement. ## Multiclass Classification To distinguish between more than two classes, aka multinomial classifiers - Some classifiers are able to handle multiple classes natively (Logistic reg, Random forest, gaussian NB, ...) - Others are strictly binary classifiers (SGD, SVC, ...) ### How to perform multiclass classification with multiple binary classifiers #### One-versus-all (OVA/OVR) Create a system that can classify the instances into k classes by training k binary classifiers. To classify a new instance : - take the decision for each classifier - Select the class whose classifier outputs the highest score On MNSIT : 10 binary classifier, one per digit: 0-detector, 1-detector, ... #### One-versus-One (OVO) Train a binary classifier for every pair of labels. One to distinguish 0s and 1s, another to distinguish 0s and 2s … For N classes => needs Nx(N-1) / 2 classifiers To classify an image: - run the image through all the classifiers - see which class wins the most duels Each classifier only needs to be trained on the part of the training set containing the two classes it must distinguish Some algorithms (e.g., SVM) scale poorly with the size of the training set ⇒ OvO is preferred because it is faster to train many classifiers on small training set than few classifiers on large training sets ### Error Analysis Use ConfusionMatrixDisplay from the module sklearn.metrics ### Data augmentation Create some new instances in the set by using other instances and tweeking them a bit Eg : shifting the digits in some direction in the mnist dataset ## Multilabel Classification In some cases, you may want your classifier to output multiple classes for each instance. - Face recognition : several people in the same picture. - News : may have several topics (e.g., diplomacy, sport, politics, business). => A system that outputs multiple binary tags is called a multilabel classification system To go further : have a look at ChainClassifier to capture dependency between labels ## Multioutput Classification Multioutput-multiclass Classification or just Multioutput Classification A generalization of multilabel classification where each label can be multiclass (i.e., can have more than two possible values). Example : image denoising