Dec 11, 2020
By Claudio Reggiani
The preprocessing steps described in the previous blog are useful for all datasets, and they ensure high quality data. The next step is to transform that data into a convenient format for the machine learning model we intend to use.
We may use feature engineering to transform or create variables (features) to ease the modeling phase. The fourth phase in the CRISP-DM model, whereas data preprocessing is the third phase. This is an example of the relevance of feature engineering for machine learning. We have a dataset of individuals with age as an independent variable and a binary dependent variable, class, corresponding to whether they performed an action such as buying a product or not.
The objective is to understand whether age provides enough information to distinguish between the two groups of individuals based on class. Age in its original form may be too granular for the machine learning task, whereas grouping ages into ranges may provide better information. In this way, we have just performed the feature engineering step called binning, but more on that later.
Keep in mind that binning may help one machine learning algorithm and be irrelevant for others. It’s important to remember that we need to build a feature engineering pipeline per each machine learning model we would like to use, because we need to prepare the data differently. It is worth highlighting that building the right variables will reduce the complexity of the model, and simple models are generally preferable to complex models.
Among the multitude of feature engineering topics, we introduce feature scaling, how to handle categorical variables, binning and model stacking. To help you follow the article, we use the toy dataset from the previous blog, which is in the following status:
Feature scaling
Some machine learning models rely on Euclidean distances between data points for the prediction task. One entry of a dataset with five variables corresponds to a data point in a five-dimensional space. Differences in magnitudes between values of variables will negatively affect model performance.
There are three strategies to solve this issue, the first two works at variable level (column) whereas the third works at instance level (row):
• Rescaling scales each variables’ value into range [0,1]
• Standardization replaces each variables’ values with their Z score, thus having the properties of a standard normal distribution (a mean of zero and a standard deviation of one)
• Normalization scales every entry of the dataset to a vector with a length of one unit
Machine learning algorithms such as logistic regression, k-nearest neighbors, support vector machines, neural networks and others require feature scaling. This is not the case for tree-based algorithms, such as decision tree and random forest, because variable values are compared with a threshold rather than with another variable.
We correct pages, sessions and products using the rescaling technique, whereas for age, we set the max value to 80 and min value to 18. This is the result:
Categorical variables
Some models may handle categorical variables, while others require them represented differently. Let us first define the types of categorical variables and then explain alternative representations.
Depending on the nature of the variable we can have:
• Ordinal categorical variables, where discrete values have an order. For instance, this can be the case of T-shirt sizes: S, M, L, XL. From the size perspective, S is smaller than M, M is smaller than L and L is smaller than XL.
• Nominal categorical variables, where discrete values do not have an intrinsic
order. Gender (male, female) as well as colors (blue, red, purple) are nominal categorical variables.
It is important to understand that we preprocess them in two different ways.
We transform ordinal variables into a sequence of numerical ordered values. In our case, we map the original values in the following way: S = 1, M = 2, L = 3, XL = 4, because S < M < L < XL and 1 < 2 < 3 < 4 and we make the assumption that the “distance” between S and M is equivalent to M and L, and similarly to L and XL. This assumption may be incorrect for other cases and a different mapping should be considered.
It is a mistake to apply that mapping for nominal variables, because we would introduce an ordering that does not exist. Blue is neither greater nor smaller than red, it is simply different. We transform nominal variables using one-hot encoding, by creating one binary column per each unique value of the discrete variable. The binary column has value 1 when the original entry corresponds to its unique value and 0 otherwise. For instance, the color nominal variable is transformed into three columns:
color_blue, color_red and color_purple. The column color_blue has value 1 when color column has value blue. Likewise, for color_red and color_purple.
In the HEW dataset, the variable city is going through this preprocessing step. There are five different cities, therefore the original variable is converted into five binary variables.
Binning
As we have seen in the previous example with age as independent variable, binning groups data into ranges. Ranges may have fixed width (fixed-width binning) or can adapt to the distribution of the data (quantile binning).
There are a couple reasons we use binning in our preprocessing pipeline: a) intuitively, the high granularity of the data may be irrelevant to the learning process; b) as we introduced in feature scaling, values spanning multiple magnitudes influence model training.
A well-known example of binning performed during exploratory analysis is the histogram. One variable is split according to fixed ranges plotted in the x-axis, whereas the y-axis shows the number of elements in that range.
The result of binning is a categorical variable that may need to go through another preprocessing step (see the previous section on how to handle categorical variables) before feeding the machine learning algorithm.
In our toy dataset, we apply binning to the salary. We have three ranges of salary: [0,50k] is encoded with 1, (50k,100k] is encoded with 2 and (100k, 100k+] is encoded with 3.
Model stacking
The feature engineering techniques presented until this point considered one single independent variable. In this last section, we would like to introduce how we can use machine learning to create a new variable.
We enrich the HEW dataset with an additional independent variable. This new variable is the result of segmenting customers into groups by means of an unsupervised machine learning method (kmeans). Therefore, we used machine learning to create new variables and model the prediction task. This is model stacking.
Conclusion
We now have a dataset that is ready to train a binary classifier, with the task of predicting the purchased variables based on the data. Performing data preprocessing is a journey worth taking to ensure that the machine learning model will have the best input we can possibly feed into it.
Downstream steps, such as the prediction task and model interpretation, will benefit from this as well.