Contributed by: Manaspreet Kaur
LinkedIn profile: https://www.linkedin.com/in/manaspreet-kaur-352007102/
Data science, as a skill, has existed for at least half a decade, if not more. It has only consistently evolved to include concepts like Machine Learning, Artificial Intelligence, Internet of Things, and more. Today, data science is not a simple technology, but a wide collection of capabilities and techniques to train machines in assisting humans for better decision making.
It is not hard for any business to understand that they can only do better or innovate when they can predict the demand. What is hard is to be able to do it. Ironical, isn’t it? Well, you could have an idea, or tons of ideas of what could make your business successful in the next three months, or next six months, but you don’t know what the world will be in those next three or six months. The current COVID-19 situation has particularly taught us so.
Also Read: Global AI Initiatives to fight COVID-19
In the text that follows:
- We will start by understanding how data science is useful in the current scenario when the world is fighting the COVID-19 pandemic (or, how we have an opportunity to convert the crisis into better opportunities)
- How data science will continue to be useful in other scenarios beyond the COVID-19 world
- What can prospecting data science professionals do to place their foot strongly in the industry?
From making predictions about the spread of the virus considering the impact of treatment and protective measures to optimizing the use of health care resources, data science is and will continue to find a lot of applications. You might have read about Dr. Fauci, the American physician, amongst others, who have been predicting the spread of COVID-19 in the US and its impact. Before the lockdown was implemented and the US was hitting its peak, the prediction of the number of deaths went up to 2L at a point. Later, the prediction came down to 1.4L this week as safety measures such as social distancing were implemented. Meaning to say, the live data being generated on a day-to-day basis was fed into data science models that were churned (refreshed) much faster than they were ever before.
How did the US make these predictions?
After WHO made the following statement which marked the conversion of a ‘flu-like disease’ (as mentioned by some of our world leaders) to the world’s biggest worry which has not been seen by 3 living generations, the US decided to utilize data science to fight this pandemic and upgrade their preparedness.
But the data scientists realized the major problem- the sample size used for testing. The health officials in the US agreed that the sample size needs to be diverse, consisting of symptomatic and asymptomatic patients. Since the ratio was unknown, they decided to rely upon the data of mortality rate published by other countries and work backward. ‘Reverse Engineering’ if you will.
So, the US started to move backward. It started with the published data for the number of deaths in other countries and with the use of Machine Learning, graphs, and NLP, it extrapolated the data related to social distancing, population density, demographics, health facilities, etc. to their country.
Therefore, at one time, the total number of deaths in the US was estimated to be 2L. There was a fatality ratio of 2.68%, which is consistent with published data from many countries. If social distancing is more effective than assumed, these calculations will change. Finally, when the US relied on data science, they were able to track their response effectively and followed a proactive approach based upon facts and data. And they continue to do so.
What this story tells us is that data science techniques can be used to understand, analyze, and predict how the virus spreads. Shows us which measures have been the most effective in preventing or slowing down its further spread. In the short term, this can be correlated to what is being known as flattening the curve, which in turn lessens the burden on the healthcare systems. As said above, data science techniques are very diverse, but in the current pandemic situation, a few of these are proving to be critically useful. Optimization, Machine learning, Graph Data Management, and NLP i.e., natural language processing. These techniques can either work independently or together with each other.
Optimization, as we know in simple terms, is minimizing cost or maximizing the use of resources taking into consideration the variables and constraints involved. The COVID-19 situation opens many doors for the use of optimization techniques. It can not only help hospitals in scheduling/rescheduling surgeries but also help them in making the best use of equipment available in places where the demand is much higher than the supply. Given a limited set of rooms and a limited set of healthcare professionals, optimization techniques can be used to determine how to minimize the delay of equipment and treatment in reaching the most critical patients.
Machine Learning, which we know as a plethora of algorithms that help find patterns in the data to generate predictions. Undoubtedly, the accuracy of the ML algorithms depends on the availability of data. COVID-19 has generated a lot of data in a very small amount of time that can be used by the ML algorithms. Recently, a COVID-19 community map was developed that points to the most vulnerable areas where the potential patients could be based on the outbreaks. It then maps these areas to different factors such as access to transportation, the socio-economic status, location of nearby hospitals, food sources, etc. to determine the extent of risk. However, we need to be cautious as this data that we are talking about is as unclean as all the other data sources we have seen or worked in the past. No government can guarantee 100% accurate data. You shouldn’t be surprised to hear the future data scientists say: “Data science is 80% preparing data, 20% complaining about preparing data”.
Also Read: Data Scientist Salary in India
Graphs are simply nodes and networks (or links) between these nodes that explain how the nodes interact with each other. While a lot of researchers are trying to create a vaccine for COVID-19 soon, it is also imperative for them to understand how the vaccine might interact with the existing drugs. This becomes more important as we know that COVID-19 puts the patients with existing ailments like respiratory diseases and diabetes etc. are a higher risk. The researchers will need to understand the interactions that these vaccines could have with the existing medications. Graphs can be used to represent these interactions.
NLP uses techniques to analyze textual, audio, and video sources to understand the hidden meaning from heaps of fragments of natural language. In the current situation, NLP could be used to analyze texts from news and social media to detect possible outbreaks, and more information around it, as quickly as possible.
The essence of all of this is simply that the applications of data science are beyond what this document can contain. From early detection and monitoring to epidemic forecasting, to targeted mitigations and hospital resources optimization, the list goes on. Data science offers very powerful technologies that can help us in the battle against this pandemic. However, the data scientists should not be overconfident about their abilities. They should make sure that the healthcare experts are as much involved as the technology is. At the same time, the data and analytics leaders must ensure that they respond to the situation responsibly, since it is changing with every passing day.
We would now shift gears to understand data and analytics in the world beyond COVID-19. Undoubtedly, the pandemic has left a lot of organizations in a difficult situation. At the same time, the pandemic has also taught us how we can leverage the power of data and analytics to address challenges effectively. With everything around us changing, a lot of models that were based on historical data might no longer be valid. This will call for reinforcement learning and create a demand for data analytics more than before.
So, what do you exactly need to be a data scientist? The first essential is to know the basics of statistics and machine learning algorithms – algorithms that you have practiced enough on diverse data sets so much that you have them at hand to be applied to the next set of problems that you might come across. The second is to have a business understanding. Before you start to apply a model, think of what outcome does the business wants to achieve? Do they simply want to predict something in the business with very good accuracy (the answer to which is a neural network) or they want to understand the impact of each component in the business (the answer to which is regression)? More than the outcome, do you understand business as it functions? Do you believe your ML model will add value to the business?
Well, value add could mean a lot of things, but mostly if you could uplift the profit or revenue by the predictions and prescriptions. Always remember, the business would not care about the details of your model and why you weren’t able to achieve good accuracy, instead the business will be more interested if you can prove that your model is valuable. (Well, if all the fancy-sounding names sound fascinating yet unknown to you, there is no better time to get familiar with them, than now!)
Talking of statistical analysis and machine learning, let’s dive in a little to understand the minimum viable knowledge we need (because remember, we will never know everything!)
Do you need to be a Ph.D. in statistics? Certainly, not. You must know how to implement cluster analysis or regression analysis or test a hypothesis, using a language of your choice (data science isn’t knowing too many languages, it is being able to use them to implement an algorithm. While expertise is for programmers, familiarity is enough for data scientists). At the same time, you must also know which method and data you need to have the right information that your business needs.
Once you are competent with statistics and a few ML algorithms, move to data visualization. Your business leaders should be able to interpret what you have produced. All the outcomes from a good model and a brilliant mind will not be useful nor get translated to a business value unless everyone understands it. However, remember data visualization is not always a Pareto chart or a heat map or something complex. Even a bar graph is a visualization. Put yourself in the shoes of your business and see if the visualization is conveying the meaning you want it to, and does it provide actionable insights? Also, like ML, you don’t need to know a lot of visualization tools. You don’t have to excel in programming, you don’t have to excel in tools, you’ve to Excel in your field!
While you are building Machine Learning models and trying to co-relate them to the business understanding, you might come across situations where the data speak exactly the opposite of that the business leaders do. The ML algorithm might tell you that the x component in your business is not important and must no longer be used, but the business leaders might have had an experience in the past where the x component was very profitable. In such situations, the skill you’ll need is soft skill.
You will need to come to a point of negotiation where you can explain to them why you are making a certain point and if they continue to disagree, explain the outcome it might have. The leaders understand the business more than the data scientists, and they probably have been in the industry longer than us. Depending on the size of the data you are working with, you might also need to understand the impact of running a model on your laptop. If you are trying to run a random forest with 1000 trees on a data that has 100,000 rows, does your laptop have enough power to not crash? To answer such questions, a basic knowledge of software engineering principles might be helpful.
You would have realized by now that data science isn’t just ML algorithms or statistics. It is equally a business. While business understanding has always been crucial, it will be more so in the post COVID-19 world. Every little decision that the business makes will be thought of in some correlation with COVID-19. But is that enough? It is good, but it isn’t. Even after you have mastered the skills (at least to some extent), you will face some challenges across all your machine learning assignments.
Firstly, the understanding of problem definition. To everyone who is somewhat familiar with data science, I would have heard the need to understand the problem statement. It might sound an easy task to do, but the fact is, it isn’t. That “I need to improve my client retention” is not a problem statement, instead, that “I need to understand what impacts client retention and have measures to improve it” is one. The COVID-19 situation will make it even more challenging, as every problem, the statement will talk about “Given the COVID-19 situation, I need to..”. The summary is to be sound with the problem statement to be solved before getting to your systems and start developing code to solve it. Talk to business, note down some key points, define the questions you want to answer, and the answers you are seeking, before starting to look at the data.
Secondly, you would need to realize that everyone utilizing your work is not a data person. While the instinct of almost every data scientist is to build complex models, the likelihood of people understanding your work is inversely proportional to the complexity involved. Keep things as simple as you could, so even the non-data people find them intuitive. Well, there are 10 kinds of people in this world: those who understand binary and those who don’t.
Finally, testability. Any solution that you develop should be testable. While testing ML algorithms is easy, testing optimization processes aren’t. In the post COVID-19 world, this will be more important than before. You will not only need to be sure that you can test your solution, but you will also need to ensure that your solution works in the post COVID-19 world. While the important aspects of employee retention in a pre-COVID-19 world would have been promotions, growth, etc., the flexibility of working from home will be added to the list now. This is just an example. As mentioned above, COVID-19 has brought in the need to re-look at all the models we’ve built and added an extra layer to the ones we will build in the future.
In the pandemic situation, it is also imperative to understand how jobs in the industry will be impacted. Let’s face it, the pandemic will hit us all. Either in terms of job cuts, or salary cuts, or anything else. It is up to us how we come out of our situation. Whether or not will data scientists be impacted, is more of an organizational stability question, than a profile-based question. Someone in a supply chain analytics industry will be more impacted than someone in a telecom analytics industry.
If you are one of those whose current job got impacted by the pandemic, or your offer got deferred, or your interview process is delayed – try to look out for freelance roles. While every industry will look at optimizing costs, the freelancing roles will have more demand than they did before. Companies will prefer to hire someone for 6 months to do a project rather than hiring him full-time. Look out for such opportunities and grab them!
The last piece of advice is one from D.J. Patil (google if you don’t know him yet): “The best thing data scientists have in common is unbelievable curiosity”. Data science is so vast and so much evolving that you will never be able to learn everything, so learn what you find interesting. Secondly, always carry an element of doubt with what the ML algorithms say. Remember, they can never replace humans. If you ask an ML model: “If all my friends jumped off a bridge, should I follow them?”, the answer would be yes!0