feature selection
  1. Feature Engineering
  2. Exploratory Data Analysis (EDA)
  3. Feature Engineering on Numeric data
  4. Forward selection
  5. Backward elimination
  6. Mixed selection
  7. Regularizing Models
  8. Python code Example

Feature Engineering

For a model to become successful, the variables / parameters that are used to construct the model are critical. In their raw form, the variables may not be (usually are not) in a state where they can be used for modeling.

Feature engineering is the process of transforming data from the raw state to a state where it becomes suitable for modeling. It transforms the data columns into features that are better at representing a given situation in terms of clarity. Quality of the feature in distinctly representing an entity impact the quality of the model in predicting the behavior of the entity

Exploratory Data Analytics (EDA) is the first step towards feature engineering as it is critical to assess the quality of the raw data, plan the transformations required.

Exploratory Data Analytics (EDA)

Some of the key activities performed in EDA include –

  1. Meaningful standardized names to the attributes
  2. Meta information about the data. Describe the column level details such as  what it is, how it was collected, units of measurement, frequency of measurement, possible range of values etc.
  3. List and address the challenges that one will face using the data in its existing form. For e.g. missing values, outliers, data shift, sampling bias
  4. Descriptive stats – spread(central values , skew, tails), mix-up of gaussians
  5. Data distribution across different target classes (if in classification domain)
  6. Outlier analysis and strategy for imputations
  7. Assessing the impact of the actions taken on the data

Some of the key activities performed in EDA include –

  • Transform the raw data into useful attributes by generating derived attributes from existing attributes if the derived attributes are likely to be better than original attributes in information content.
  • Transform the data attributes using valid mathematical transformations such as log transformation of the distribution, if the transformed data is likely to help create simpler model without losing information.

Feature Engineering on Numeric data

  1. Integers and floats are the most common data types that are directly used in building models. Instead, transforming them before modelling may yield better results!
  2. Feature engineering on numerical columns may take the form of-
    – scaling the data if using algorithms that involve similarity measurements based on distance calculations
    – Transforming the distributions using mathematical techniques such as exponential distribution to almost normal using log functions
    – Binning the numeric data followed by binarization for e.g. using one-hot coding
  3. Binning can help make linear models powerful when the data distribution on predictors is spread out though it has a trend
  4. Interaction & Polynomial features – Another way to enrich feature representation, especially in linear models is using interaction features , polynomial features
  5. In the binning example the linear model creates constant value in each bin (intercept), however, we can also make it learn the slope by including the original feature

Feature Selection

  • Suppose you have a learning algorithm LA and a set of input attributes { X1 , X2 .. Xp }
  • You expect that LA will only find some subset of the attributes useful.
  • Question: How can we use cross-validation to find a useful subset?
  • Some ideas:
    – Forward selection
    – Backward elimination
    – Mixed selection

Forward Selection

  • Begin with null model – a model that contains an intercept but no predictors
  • Then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS(or highest R^2)
  • Then add to that model the variable that results in the lowest RSS(or highest R^2) for the new two-variable model
  • Continue this approach until some stopping rule is satisfied

Backward Elimination

  • Start with all variables in the model
  • Remove a variable from the above model and check the increment in RSS (or decrement in R^2) and remove the variable which has least influence, i.e., the variable that is least significant
  • The new (p-1) variable model is fit and the variable with the least significance is removed.
  • Continue this procedure until a stopping rule is reached

Mixed Selection

  • This is a combination of forward and backward selection
  • We start with no variables in the model and as in forward selection, we add the variable that provides the best fit
  • At times, the significance of variables can become low as new predictors are added to the model
  • Thus, if at any point, the significance for one of the variables in the model falls below a certain threshold, then we remove that variable from the model
  • We continue to perform these forward and backward steps until all variables in the model have a sufficiently high significance and all the variables outside the model would have a low significance if added to the model

Regularizing Linear Models (Shrinkage methods)

When we have too many parameters and are exposed to the curse of dimensionality, we resort to dimensionality reduction techniques such as transforming to PCA and eliminating the PCA with the least magnitude of eigenvalues. This can be a laborious process before we find the right number of principal components. Instead, we can employ the shrinkage methods.

Shrinkage methods attempt to shrink the coefficients of the attributes and lead us towards simpler yet effective models. The two shrinkage methods are :

  • Ridge regression is similar to the linear regression where the objective is to find the best fit surface. The difference is in the way the best coefficients are found. Unlike linear regression where the optimization function is SSE, here it is slightly different.

Linear Regression cost function

Ridge Regression with additional term in the cost function

  • The term is like a penalty term used to penalize large magnitude coefficients when it is set to a high number, coefficients are suppressed significantly. When it is set to 0, the cost function becomes same as linear regression cost function.

This brings us to the end of the blog on Feature Selection. If you found this helpful and wish to learn more such concepts, join Great Learning Academy’s pool of free online courses today, and learn the most in-demand skills to power ahead in your career.



Please enter your comment!
Please enter your name here

fifteen − fourteen =