Building Advanced Solutions Using Data Mining and Classification Models

Hi, my name is Kaushal Kishore and am working in one of the largest service-based IT organizations as Associate Vice President – ITSM, SIAM & Data Science and Business Analytics. In my entire career, the majority of the time (almost 13 years), I have spent as a member of the Service Management and Analytics team. During my journey with this esteemed company where I am currently working I had a problem while developing a visualization or a client-facing volumetric trend with deep analysis reports, I always felt the urge to know the business in a broader way and wanted to understand the dynamics of it, so instead of sitting back and only playing around some existing tools I wanted to contribute more to my organization as a frontline resource and tried adding more values.

I used CART or RF Data Mining Classification Model in order to solve the problems. CART and Random Forests are best suited for variables with nonlinear relations and work well with a lot of predictors. CART can be utilized for both Regression and Classification problems. For Regression, the best split points are chosen by reducing the squared or absolute errors. Random Forest creates multiple CART trees based on “bootstrapped” samples of data and then combines the predictions. Usually, the combination is an average of all the predictions from all CART models. A bootstrap sample is a random sample conducted with replacement. Random Forest has better predictive power and accuracy than a single CART model (because random forest exhibit lower variance). Unlike the CART model, Random Forest’s rules are not easily interpretable. Some of CART’s advantages include that the rules are easily interpretable and that it offers automatic handling of variable selection, missing values, outliers, local effect modelling, variable interaction, and non-linear relationships. Missing values in a variable are handled by Surrogate splits. It is a similar split on another variable for the missing values in a variable that doesn’t have the missing values at the same split.

CART is more sensitive to outliers at the target variable (Y) than the predictors (X). It might treat them at the terminal nodes that limit their effect on the tree. CART is more robust to outliers in the predictors (independent variables) due to its splitting process. My recommendation to place the data in a detailed manner was taken into consideration and it has helped a lot in order to reduce time constraints and get more productivity within a short span of time. Eventually, my intention was to evaluate the target or dependent variable in this form of data as “the potential customer.” In this entire process, I will be looking at multiple classification matrices (Accuracy, recall, precision, f1-score, AUC) and ROC curve. While doing the same, I could also figure out which are the most important factors that help a company bring potential customer and customer satisfaction to our organization.

This exercise has helped me in getting confidence and learnings with which I have helped my organisation and team in such a way that it has created a benchmark for me that I am setting to break again in future. This is my ongoing learning because according to me the real fight is with yourself only. You should become better every single day.