ML/03_classif/recap_cm3.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140

# CM 3 : Classification

- MNIST Dataset = 70_000 images (28x28) of handwritten digits and label

## Binary Classifier

Check if 5 or 'Not 5'

SGDClassifier:
Score of 3-fold-cross-validation = \[0.95035, 0.96035, 0.9604\]

DummyClassifier:
Score of 3-fold-cross-validation = \[0.90965, 0.90965, 0.90965\]

-> Skewed dataset, 90% of instances are 'Not 5' so just saying 'No' gives a 90% accuracy

### Confusion Matrices

| TN | FP |
| -------------- | --------------- |
| FN | TP |

With :

- TN = model predicts negative, label is negative (OK)
- TP = model predicts positive, label is positive (OK)
- FN = model predicts negative, label is positive (KO)
- FP = model predicts positive, label is negative (KO)

### Precision/Recall formulas

Precision (the banker): the model classifies something if he is sure about the prediction;
$$ Precision = \frac{TP}{TP+FP} $$

Recall (the doctor): when in doubt, the model will classify the instance into the category;
$$ Recall = \frac{TP}{TP+FN} $$

### F score

Combines Precision and Recall in a single metric

#### F1 score

It is the harmonic mean (more weight to low values) of precision and recall

$$ F_1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} = \frac{TP}{TP + \frac{FN + FP}{2}} $$

### Decision function

If this score is greater than a threshold, it assigns the instance to the positive class ; otherwise it assigns it to the negative class. (Ex: SGD classifier)

(see the curve in the slides)

### Precision/Recall curve

Recall as the X axis, and Precision as Y => easy to create a classifier with desired precision

### ROC Curve

- ROC = Receiver operating characteristic : common tool used with binary classifier
  - very similar to precision/recall curve
  - plots the TP rate (recall) vs the FP rate (also called fall-out)
  - FPR = ratio of negative instances that are incorrectly classified as positive
  $$ FPR = 1-TNR $$
  - TNR : ratio of negative instances that are correctly classified as negative, it is also called specificity
- ROC curve plots the sensitivity (recall) versus 1-specificity
- Once again, it is a trade-off
- One way to compare classifier is to measure the area under the curve (AUC ROC)
- A perfect classifier will have a ROC AUC equal to 1
- A purely random classifier : ROC AUC = 0.5

### ROC curve or PR curve

- Prefer PR curve when :
  - the positive class is rare
  - you care more about the false positives than the false negatives
- Otherwise use the ROC curve
- Example : Considering the previous ROC curve you may think that the classifier is really good but this is mostly because there are few positives (5s) compared to the negatives (non-5s). In constrast, PR curve makes it clear that the classifier has room for improvement.

## Multiclass Classification

To distinguish between more than two classes, aka multinomial classifiers

- Some classifiers are able to handle multiple classes natively (Logistic reg, Random forest, gaussian NB, ...)
- Others are strictly binary classifiers (SGD, SVC, ...)

### How to perform multiclass classification with multiple binary classifiers

#### One-versus-all (OVA/OVR)

Create a system that can classify the instances into k classes by training k binary classifiers.
To classify a new instance :

- take the decision for each classifier
- Select the class whose classifier outputs the highest score

On MNSIT : 10 binary classifier, one per digit: 0-detector, 1-detector, ...

#### One-versus-One (OVO)

Train a binary classifier for every pair of labels.
One to distinguish 0s and 1s, another to distinguish 0s and 2s …
For N classes => needs Nx(N-1) / 2 classifiers

To classify an image:

- run the image through all the classifiers
- see which class wins the most duels

Each classifier only needs to be trained on the part of the training set containing the two classes it must distinguish

Some algorithms (e.g., SVM) scale poorly with the size of the training set ⇒ OvO is preferred because it is faster to train many classifiers on small training set than few classifiers on large training sets

### Error Analysis

Use ConfusionMatrixDisplay from the module sklearn.metrics

### Data augmentation

Create some new instances in the set by using other instances and tweeking them a bit

Eg : shifting the digits in some direction in the mnist dataset

## Multilabel Classification

In some cases, you may want your classifier to output multiple classes for each instance.

- Face recognition : several people in the same picture.
- News : may have several topics (e.g., diplomacy, sport, politics, business).
=> A system that outputs multiple binary tags is called a multilabel classification system

To go further : have a look at ChainClassifier to capture dependency between labels

## Multioutput Classification

Multioutput-multiclass Classification or just Multioutput Classification

A generalization of multilabel classification where each label can be multiclass (i.e., can have more than two possible values).

Example : image denoising