Why using CRISP-DM will make you a better Data Scientist?

What is CRISP-DM?

Contributed by: Trisit Kumar Chatterjee

Before we try to understand why adopting CRISP-DM will make you a better Data Scientist lets first try to understand what CRISP-DM is.

Typical analytics projects involve multiple steps like data cleaning, preparation, modelling, model evaluation etc. It may take several months, and thus it is important to have a structure for it.

The structure for analytics problem solving is called the CRISP-DM framework – Cross Industry Standard Process for Data Mining. It is an open standard process model that describes common approaches used by data mining experts and is the most widely-used analytics model.

The six different phases of CRISP-DM are as below:

Business Understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

This model is an idealized sequence of events. In practice, several tasks can be performed in a different order, and it will often be necessary to backtrack to previous tasks and repeat actions. The model does not try to capture all possible routes through the data mining process.

CRISP-DM was conceived in 1996 and became a European Union project under the ESPRIT funding initiative in 1997. The project was led by five companies: Integral Solutions Ltd (ISL), Teradata, Daimler AG, NCR Corporation and OHRA, an insurance company.

Based on current research CRISP-DM is the most widely used form of the data-mining model because of its various advantages which solved the existing problems in the data-mining industries. Some of the drawbacks of this model are that it does not perform project management activities. The fact behind the success of CRISP-DM is that it is an industry, tool, and application neutral.

Also Read: 100+ Data Science Interview Questions

CRISP-DM in Detail

1. Business Understanding

Understanding the business requirement is of paramount importance. You must understand the problem clearly to convert it into a well-defined analytics problem. Only then you can lay out a brilliant strategy to solve it. Else, you’ll be investing all your time and energy in solving the wrong problem statement! One needs to undertake the following steps of analyses:

Determine the business objectives clearly: The primary activity you must do in any project is to find out exactly what you’re trying to accomplish. That’s less obvious than it sounds. Many data miners have invested time on data analysis, only to find that their management wasn’t particularly interested in the issue they were investigating.

Assess the situation: Here you get into more detail on the issues associated with your business goals. You will go deeper into fact-finding, building out a much detailed explanation of the issues outlined in the business goals task.

Determine the goals of data analysis: Achieving the business objective often requires action from many people, not just the data miner. So now, you must define your little part within the bigger picture. If the business objective is to reduce customer attrition, for example, your data-mining goals might be to identify attrition rates for several customer segments and develop models to predict which customers are at greatest risk.

Produce a project plan: Now you lay down every step that you, the data miner or the data scientist, intend to take until the project is accomplished and the results are presented and reviewed.

2. Data Understanding

The data understanding phase goes hand in hand with the business understanding phase and encourages the focus to ascertain, assemble, and scrutinize the data sets that can help you achieve the project goals. This phase also has four tasks:

Collect initial data: Obtain the required data and (if needed) load it into your analysis tool eg. SAS, SPSS, Jupyter Notebook (Python), R.

Describe data: Examine the data and document its surface properties like data format, number of records, or field identities:
1. Check data volume and examine its gross properties.
2. Accessibility and availability of attributes
3. Attribute types, range, correlations and identities.
4. Understand the meaning of each attribute and attribute value in business terms.
5. For each attribute, compute basic statistics (e.g., distribution, average, max, min, standard deviation, variance, mode, skewness).

Explore data: Find insights from the data. Query it, visualize it, and identify relationships among the data.

Verify data quality: Couple of activities are list below:

Identify special values and catalogue their meaning.
Does it cover all the cases required? Does it contain errors and how common are they?
Identify missing attributes and blank fields. Meaning of missing data.
Do the meanings of attributes and contained values fit together?
Check spelling of values (e.g., same value but sometimes beginning with a lowercase letter, sometimes with an uppercase letter).
Check for plausibility of values, e.g. all fields have the same or nearly the same values.

The steps 2, 3 & 4 combined is what you do as Exploratory Data Analysis.

3. Data Preparation

This stage, which is often referred to as “data wrangling” or “data munging”, has the objective is to develop the final data set(s) for modelling. Covers all activities to construct the final dataset from the initial raw data. Data preparation tasks are likely to be performed multiple times and not in any prescribed order. Tasks include table, record and attribute selection as well as transformation and cleaning of data for modelling tools.

It has five tasks:

Select data: Determine which data sets will be used and document reasons for inclusion/exclusion.

Clean data: Very often Data miners invest the most part of their task in this. Without it, you’ll likely fall victim to garbage-in, garbage-out. A common practice in this task is to correct, impute, or remove erroneous values. Couple of common activities listed below:
1. Correct, remove or ignore noise.
2. Decide how to deal with special values and their meaning.
3. Aggregation level, missing values, etc.
4. Are there Outliers?

Construct data: Extract new attributes that will be helpful. For example, derive someone’s body mass index from height and weight fields.

Integrate data: Create new data sets by combining data from multiple sources.

Format data: Re-format data as necessary. For example, you might convert string values that store numbers to numeric values so that you can perform mathematical operations.

Overall, data understanding is the phase where you are actually doing Data cleaning, imputation and Feature engineering – a process where you use domain knowledge of your data to create additional relevant features that increase the predictive power of the learning algorithm and make your machine learning models perform even better!

4. Modelling

As the primary step in modelling, select the actual modelling technique that needs to be used. Even though you may have already selected a tool during the Business Understanding phase, this task refers to the specific modelling technique, e.g., decision-tree or random forest building, or neural network generation with backpropagation. In case several techniques are applied, execute this task separately for each of the techniques.

Here you’ll likely build and assess various models based on several different modelling techniques. This phase has four tasks:

Select modelling techniques: Determine which algorithms to try (e.g. regression, ensemble models, neural net).

Generate test design: Pending your modeling approach, you might need to split the data into training, test, and validation sets.

Build model: Couple of common steps in this phase:
1. Set initial model parameters and document reasons for choosing those values.
2. Run the selected technique on the input dataset.
3. Post-process data mining results (example. editing rules, display trees).
4. Record parameter settings used to produce the model.
5. Describe the model, its special features, behavior and interpretation.

Assess model: Generally, multiple models are competing against each other, and the data scientist needs to interpret the model results based on domain knowledge, the pre-defined success criteria, and the test design. The outcome of this step frequently leads to model tuning iterations until — per the CRISP-DM guide — “the best model(s)” are found.

5. Evaluation

In this stage, you thoroughly evaluate the model and review the steps executed to construct the model to ascertain that it properly achieves the business objectives. A key objective is to determine if there is some important business issue that has not been sufficiently considered. At the end of this phase, a decision on the use of the data mining results should be reached Whereas the assess model task of the modelling phase focuses on model accuracy and the model’s ability to generalize, the evaluation phase looks more broadly at which model best meets the business and what to do next. The evaluation phase has three tasks:

Evaluate results: Understand data mining result. Check impact for data mining goal. Do the models meet the business success criteria? Which one(s) become approved models for the business? Rank results according to business success criteria. Check result impact on initial application goal.

Review process: Summarize the process review (activities that missed or should be repeated). Was anything overlooked? Were all steps properly executed? Identify failures, misleading steps, possible alternative actions, unexpected paths

Determine next steps: Analyze potential for deployment of each result. Estimate potential for improvement of current process. Recommend alternative continuations. Refine process plan. Also, it is very important at this stage to take a decision. According to the results and process review, it is decided how to proceed to the next stage (remaining resources and budget). Rank the possible actions. Select one of the possible actions.

6. Deployment

A model is not particularly useful unless the customer can access its results. The complexity of this phase varies widely. The knowledge gained will need to be organized and presented in a way that the customer can use it. However, depending on the requirements, the deployment phase can be as simple as generating a report or as complex as implementing a repeatable data mining process across the enterprise. This final phase has four tasks:

Plan deployment: Develop and document a plan for deploying the model. Couple of common points considered:
1. How will the knowledge or information be propagated to users? How will the use of the result be monitored or its benefits measured?
2. How will the model or software result be deployed within the organization’s systems? How will its use be monitored and its benefits measured (where applicable)?
3. Identify possible problems when deploying the data mining results

Plan monitoring and maintenance: Develop a thorough monitoring and maintenance plan to avoid issues during the operational phase (or post-project phase) of a model.
1. What could change in the environment? How will accuracy be monitored?
2. When should the data mining model not be used anymore? What should happen if it could no longer be used? (Update model, new data mining project)
3. Will the business objectives of the use of the model changeover time?

Produce final report: The project team documents a summary of the project which might include a final presentation of data mining results.

Review project: Conduct a project retrospective about what went well, what could have been better, and how to improve in the future. Couple of common activities taken up are listed:
1. Interview people involved in the project. Interview end users. What could have been done better? Do they need additional support? Summarize feedback and write the experience documentation
2. Analyze the process (what went right or wrong, what was done well and what needs to be improved.).
3. Document the specific data mining process (How can results and experience of applying the model be fed back into the process?). Abstract from details to make the experience useful for future projects.

Why does CRISP-DM make you a better Data Scientist?

Now that we have thoroughly drilled down the different phases of CRISP-DM let’s see what the benefits of it are:

CRISP-DM provides a uniform framework for
1. Guidelines
2. experience documentation
This methodology is cost-effective as it includes a number of processes to take out simple data mining tasks and the processes are well established across industry.
CRISP-DM encourages best practices and allows projects to replicate.
This methodology provides a uniform framework for planning and managing a project.
Being cross-industry standard, CRISP-DM can be implemented in any Data Science project irrespective of its domain.

CRISP-DM has been the de-facto industry standard process model for data mining, with an expanding number of applications across a wide array of industries. It is extremely important that every data scientist and data miner must understand the different steps of this model.