Perfection is achieved only by making mistakes. Same holds true when you work with machine learning algorithms to build models. Most of the time, it is not obvious how to proceed and navigate at the beginning and professionals are bound to make mistakes, especially those who are a novice in the domain. Here is a list of most common mistakes that are committed while working with machine learning algorithms. Hopefully, you will learn and draw valuable insights from this article that you could apply in your work.
5 Common Machine Learning Errors:
Machine Learning Error 1: Lack of understanding the mathematical aspect of machine learning algorithms
Mathematics is a big part of machine learning as it helps in deciphering the most efficient way to describe the problem with least ambiguity. It is also important to understand the behaviours of the systems and models. When one ignores the mathematical treatment of algorithms, it can lead to many problems including, but not limited to:
- Adopting a limited interpretation of an algorithm
- Using inefficient optimisation algorithms without knowing the nature of optimisation being solved
Mathematical treatment of algorithms comes with mastery. If you are implementing advanced algorithms starting from scratch and including internal optimisation algorithms, then it is important to learn the mathematical aspects of the algorithm.
Machine Learning Error 2: Data Preparation and Sampling
Data cleansing is the most time-consuming part of machine learning projects and takes up to 60% of the time. This is followed by data ingestion, that takes up almost 20% of the time. Hence, as much as 80% of the time is consumed in developing machine learning algorithms is consumed working with data which is enough to establish its importance.
One important aspect of data cleansing is treating missing values in the dataset. Common techniques to examine and fix the columns with missing values are mean, mode, or median. But in some cases, these might not be the right metrics to use and we would need to look beyond to something else.
Also, in the case of classification, one needs to consider the class structure of the data set. Here, introducing a new ‘Undefined’ category would work, or a better option would be to use machine learning algorithms to predict missing values.
Any kind of mistake in choosing an algorithm for treating null values could distort the final results. Hence, splitting the process into various individual steps would help to reduce this risk. Another good way to approach this is by introducing a combination of strategy and factory design patterns while working with machine learning algorithms.
Choosing the right features in feature extraction is critical. When one chooses the right features, it ensures:
- Better results
- Flexibility to choose less complex models
- Flexibility to choose less optimal model parameters
Feature extraction directly relates to model selection. No one wants to introduce bias into their models that would result in overfitting. Hence, any mistakes in feature extraction will directly impact the accuracy of machine learning algorithms and the overall model.
Keeping a record of all the assumptions you make will help in identifying the source of the problem. One can always go back and refer to these assumptions and see what is causing the mistake that has been encountered.
Essentially, there could be two types of sampling errors:
- Using a limited number of samples that introduce measurable biases in training and testing
- Selecting a non-representative sample from the data set, hence the proportion of characteristics are not obtained.
Machine Learning Error 3: Implementing machine learning algorithms without a strategy
It is said that you can lose yourself in an algorithm. Machine Learning is all about algorithms and each of them is a complex system itself. Practitioners need to understand the problem statement first, create a strategy about how to solve the problem, and then pick a set of algorithms they feel will help provide the best results.
Here’s what you can do:
- Swap machine learning algorithms and try them out on your problem
- Tune them up to limit and move on when they do not seem to solve desired purpose
- Learn more about each algorithm you use, but know when to stop
- Use a systematic approach, design tuning experiments, and automate their execution and analysis
- Stop fiddling with different algorithms and follow a systematic approach
- Focus on the goal and the result to be delivered from the project, see what can help to achieve a given set of predictions
Machine Learning Error 4: Implementing everything from scratch
Without a doubt, there is a lot to learn when you try and build machine learning algorithms and models from scratch. But it is not always feasible and you need to know where to draw the line. There are scenarios where you need to implement a technique because no other algorithm is suitable or available for implementation. In all other cases, you can fall back to the algorithms which are ready and available to use for your machine learning project.
When you implement an algorithm from scratch, it could have bugs, could function slow, not deal with edge cases, have a memory hog, or worst of all might be wrong. What can you use instead?
- A general-purpose library that handles all the edge cases
- Highly optimised libraries that occupy less memory
- A graphical user interface to avoid coding at all
Implementing everything from scratch is a slow and tedious process which will substantially reduce the efficiency and accuracy of a machine learning model. Hence, avoid doing it.
Machine Learning Error 5: Ignoring outliers
The question is not whether one should ignore outliers or not. It is when can one ignore outliers and when cannot. Outliers can be an important aspect or could be completely ignore based on the context of the problem at hand.
For example, if you are looking at pollution forecast and you encounter some spikes caused by some kind of sensor error, you could safely ignore them and remove those values from data.
Talking about machine learning algorithms, some of them are more sensitive to the outliers as compared to others. While Adaboost puts tremendous weights on outliers, a decision tree simply counts outliers as one false classification.
Hence, depending on the context, if you decide that outliers are important and cannot be ignored, you should use algorithms/models that give adequate importance to them. On the other hand, if you decide that outliers can be ignored, then use algorithms/ models that do not give much weightage to them.
If you want to pursue a career in the field of Machine Learning, then upskill with Great Learning’s PG Program in Machine Learning.0