From 5d49c6f6c1e17027bcb94e0672756fbbebf9dd7e Mon Sep 17 00:00:00 2001 From: Martial Simon Date: Fri, 6 Mar 2026 00:32:32 +0100 Subject: feat: ML semaine du 1er mars --- ML/03_classif/recap_cm3.md | 140 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 140 insertions(+) create mode 100644 ML/03_classif/recap_cm3.md (limited to 'ML/03_classif') diff --git a/ML/03_classif/recap_cm3.md b/ML/03_classif/recap_cm3.md new file mode 100644 index 0000000..711acc3 --- /dev/null +++ b/ML/03_classif/recap_cm3.md @@ -0,0 +1,140 @@ +# CM 3 : Classification + +- MNIST Dataset = 70_000 images (28x28) of handwritten digits and label + +## Binary Classifier + +Check if 5 or 'Not 5' + +SGDClassifier: +Score of 3-fold-cross-validation = \[0.95035, 0.96035, 0.9604\] + +DummyClassifier: +Score of 3-fold-cross-validation = \[0.90965, 0.90965, 0.90965\] + +-> Skewed dataset, 90% of instances are 'Not 5' so just saying 'No' gives a 90% accuracy + +### Confusion Matrices + +| TN | FP | +| -------------- | --------------- | +| FN | TP | + +With : + +- TN = model predicts negative, label is negative (OK) +- TP = model predicts positive, label is positive (OK) +- FN = model predicts negative, label is positive (KO) +- FP = model predicts positive, label is negative (KO) + +### Precision/Recall formulas + +Precision (the banker): the model classifies something if he is sure about the prediction; +$$ Precision = \frac{TP}{TP+FP} $$ + +Recall (the doctor): when in doubt, the model will classify the instance into the category; +$$ Recall = \frac{TP}{TP+FN} $$ + +### F score + +Combines Precision and Recall in a single metric + +#### F1 score + +It is the harmonic mean (more weight to low values) of precision and recall + +$$ F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{TP}{TP + \frac{FN + FP}{2}} $$ + +### Decision function + +If this score is greater than a threshold, it assigns the instance to the positive class ; otherwise it assigns it to the negative class. (Ex: SGD classifier) + +(see the curve in the slides) + +### Precision/Recall curve + +Recall as the X axis, and Precision as Y => easy to create a classifier with desired precision + +### ROC Curve + +- ROC = Receiver operating characteristic : common tool used with binary classifier + - very similar to precision/recall curve + - plots the TP rate (recall) vs the FP rate (also called fall-out) + - FPR = ratio of negative instances that are incorrectly classified as positive + $$ FPR = 1-TNR $$ + - TNR : ratio of negative instances that are correctly classified as negative, it is also called specificity +- ROC curve plots the sensitivity (recall) versus 1-specificity +- Once again, it is a trade-off +- One way to compare classifier is to measure the area under the curve (AUC ROC) +- A perfect classifier will have a ROC AUC equal to 1 +- A purely random classifier : ROC AUC = 0.5 + +### ROC curve or PR curve + +- Prefer PR curve when : + - the positive class is rare + - you care more about the false positives than the false negatives +- Otherwise use the ROC curve +- Example : Considering the previous ROC curve you may think that the classifier is really good but this is mostly because there are few positives (5s) compared to the negatives (non-5s). In constrast, PR curve makes it clear that the classifier has room for improvement. + +## Multiclass Classification + +To distinguish between more than two classes, aka multinomial classifiers + +- Some classifiers are able to handle multiple classes natively (Logistic reg, Random forest, gaussian NB, ...) +- Others are strictly binary classifiers (SGD, SVC, ...) + +### How to perform multiclass classification with multiple binary classifiers + +#### One-versus-all (OVA/OVR) + +Create a system that can classify the instances into k classes by training k binary classifiers. +To classify a new instance : + +- take the decision for each classifier +- Select the class whose classifier outputs the highest score + +On MNSIT : 10 binary classifier, one per digit: 0-detector, 1-detector, ... + +#### One-versus-One (OVO) + +Train a binary classifier for every pair of labels. +One to distinguish 0s and 1s, another to distinguish 0s and 2s … +For N classes => needs Nx(N-1) / 2 classifiers + +To classify an image: + +- run the image through all the classifiers +- see which class wins the most duels + +Each classifier only needs to be trained on the part of the training set containing the two classes it must distinguish + +Some algorithms (e.g., SVM) scale poorly with the size of the training set ⇒ OvO is preferred because it is faster to train many classifiers on small training set than few classifiers on large training sets + +### Error Analysis + +Use ConfusionMatrixDisplay from the module sklearn.metrics + +### Data augmentation + +Create some new instances in the set by using other instances and tweeking them a bit + +Eg : shifting the digits in some direction in the mnist dataset + +## Multilabel Classification + +In some cases, you may want your classifier to output multiple classes for each instance. + +- Face recognition : several people in the same picture. +- News : may have several topics (e.g., diplomacy, sport, politics, business). +=> A system that outputs multiple binary tags is called a multilabel classification system + +To go further : have a look at ChainClassifier to capture dependency between labels + +## Multioutput Classification + +Multioutput-multiclass Classification or just Multioutput Classification + +A generalization of multilabel classification where each label can be multiclass (i.e., can have more than two possible values). + +Example : image denoising -- cgit v1.2.3