summaryrefslogtreecommitdiff
path: root/ML/02_e2e/MACH_2.md
diff options
context:
space:
mode:
Diffstat (limited to 'ML/02_e2e/MACH_2.md')
-rw-r--r--ML/02_e2e/MACH_2.md9
1 files changed, 9 insertions, 0 deletions
diff --git a/ML/02_e2e/MACH_2.md b/ML/02_e2e/MACH_2.md
new file mode 100644
index 0000000..3e2240b
--- /dev/null
+++ b/ML/02_e2e/MACH_2.md
@@ -0,0 +1,9 @@
+MACH: End-to-end ML pipeline
+==
+1. Try a Support Vector Machine regressor (`sklearn.svm.SV`) with various hyperparameters, such as `kernel="linear` (with various values for the C hyperparameter) or `kernel="rbf"` (with various values for the C and gamma hyperparameters). Note that SVMs don't scale well to large datasets, so you should probably train your model on just the first 5,000 instances of the training set and use only 3-fold cross-validation, or else it will take hours. Don't worry about what the hyperparameters mean for now (we will study it further in the course). How does the best SVR predictor perform?
+2. Try replacing the `GridSearchCV` with a `RandomizedSearchCV`.
+3. Try adding a `SelectFromModel` transformer in the preparation pipeline to select only the most important attributes.
+4. Let's create a new pipeline that runs the previously defined preparation pipeline, and adds a SelectFromModel transformer based on a RandomForestRegressor before the final regressor:
+5. Try creating a custom transformer that trains a k-Nearest Neighbors regressor (`sklearn.neighbors.KNeighborsRegressor`) in its `fit()` method, and outputs the model's predictions in its `transform()` method. Then add this feature to the preprocessing pipeline, using latitude and longitude as the inputs to this transformer. This will add a feature in the model that corresponds to the housing median price of the nearest districts.
+6. Automatically explore some preparation options using `RandomSearchCV`.
+7. Try to implement the `StandardScalerClone` class again from scratch, then add support for the `inverse_transform()` method: executing `scaler.inverse_transform(scaler.fit_transform(X))` should return an array very close to `X`. Then add support for feature names: set `feature_names_in_` in the `fit()` method if the input is a DataFrame. This attribute should be a NumPy array of column names. Lastly, implement the `get_feature_names_out()` method: it should have one optional `input_features=None` argument. If passed, the method should check that its length matches `n_features_in_`, and it should match `feature_names_in_` if it is defined, then `input_features` should be returned. If `input_features` is `None`, then the method should return `feature_names_in_` if it is defined or `np.array(["x0", "x1", ...])` with length `n_features_in_` otherwise. \ No newline at end of file