Wiki Wiki Web

Hand-On Machine LEarning with Scikit-lean, Keras and TensorFlow

Aurelien Geron, 2019

Preparing data:

  • corr_matrix = df.corr(), corr_matrix["regressand"] allows to see which feature might be of interest
  • scatter_matrix help see which features have predictive power while not necessarily having linear correl
  • features can be removed, na can be replaced with median, or na rows can be dropped
  • SimpleImpuuter: dropna(subset=[("feature"]), drop("feature"), median=df["feature"].median, df["feature"].fillna(median)
  • renormalisations such as house/households or bedroom/room can help uncovering better features
  • string data needs to be converted to ordinals (OrdinalEncode), or to bool array OneHotEncoder, the latter is expensive in space, use sparringly
  • min-max scaling (normalisation), and standardisation (scale by stdev)