Hand-On Machine LEarning with Scikit-lean, Keras and TensorFlow
Aurelien Geron, 2019
Preparing data:
- corr_matrix = df.corr(), corr_matrix["regressand"] allows to see which feature might be of interest
- scatter_matrix help see which features have predictive power while not necessarily having linear correl
- features can be removed, na can be replaced with median, or na rows can be dropped
- SimpleImpuuter: dropna(subset=[("feature"]), drop("feature"), median=df["feature"].median, df["feature"].fillna(median)
- renormalisations such as house/households or bedroom/room can help uncovering better features
- string data needs to be converted to ordinals (OrdinalEncode), or to bool array OneHotEncoder, the latter is expensive in space, use sparringly
- min-max scaling (normalisation), and standardisation (scale by stdev)