Megan Bloemsma

the Future of data science is software engineering

The role of data scientist will change in the next 5 years. The bulk of data science projects entail the same subjects: forecasting and solutions being enriched with image or voice recognition. These projects are available out-of-the-box through prebuilt AI such as the Cognitive Services from Microsoft or Google’s AI Building Blocks, with many more appearing on other clouds and marketplaces.
The next step from prebuilt AI is usually custom code projects written in Python or R. Teams build their own machine learning models but even here technologies such as AutoML have simplified the process.

The need for true full-fledged data science experts will diminish in a couple of years. After all: one can reinvent the wheel only so many times.

Being a traditional data scientist myself, this is a hard pill to swallow. With data scientists being expensive and difficult to find (although this shortage is decreasing each year – source) there are only a handful of companies that have the budget to build a data science team… and with mixed results.

As discussed in my MLops article the ROI for data scientists is difficult to achieve. Main challenges are that models do not make it to production, cannot be standardized and (despite the promise of autonomous machines) require too much manual, expensive labour. From a non-technical perspective most models are not applicable to the business: they prove hypotheses, but cannot be applied to the company’s product or solution. The models do not add value for the customers, or for the company internal.

One must ask themselves: what is the added value of a data scientist? If a data scientist or team is not making a return of investment… then why are they there?

Photo by mentatdgt on Pexels.com

the Citizen Data Scientist

At the same time big tech companies have coined the term ‘citizen data scientist‘: “a person who creates or generates models that use advanced diagnostic analytics or predictive and prescriptive capabilities, but whose primary job function is outside the field of statistics and analytics“. They make use of ‘data science as a service’ and are able to use out-of-the-box AI. They do not have knowledge of the workings behind the prebuilt AI, but are able to use the tools to get quick results.

And for a lot of simple (and popular) use cases this works fine. I’m not in favour of letting the citizen data scientist build a machine learning model but I don’t see an issue with having them apply a pretrained image recognition model to their solution (always taking ethics into account of course).

When we’re looking at the added value of a data science team
the citizen data scientist is able to play a role here. On top of that they are usually cheaper and more knowledgeable of the business use cases… which might even make them more effective in some cases.

So what added value are we left with? Building custom code models that add value to the business and customers, with models that are maintained and monitored to guarantee quality assurance.

Data scientist 2.0

Data science adds value in a company. But I’m not convinced that traditional data scientists do. Looking at the skill gap, I believe that software engineering fills in a lot of the holes.

Adding software engineering skills to the data scientist’s curriculum makes them responsible for more than just the model-making. It requires a shift in mindset and, more importantly, a shift in working: from academic to a practical approach to application of knowledge.

And a lot of software engineering topics are immediately applicable in data science work: analytics, databases, file sharing and synchronization, visualization and lifecycle processes – just to name a few. With MLops the shift is already happening: it makes it the data scientist’s responsibility to look at version control and testing, among other things.

Focus

Now don’t get me wrong: any department should have focus. The data science team should not be responsible for the entire process, starting with shaping an use case together with customers and ending with monitoring and patching the productionized model.

But the traditional data scientist and its team are picking up too little of the responsibility now.

And I’m saying this as a traditional data scientist.
Indeed there is a lot of work involved in setting up hypothesis, cleaning and wrangling the data and creating a model. And we have created roles to deal with each part of this process – from data wrangler to machine learning engineer. But if all this work doesn’t lead to (commercial) value for the company then we have a legitimacy problem.

I plead that the data scientist should be taught software engineering skills, and that there needs to be a mindset shift. Instead of testing hypothesis and aiming for the highest accuracy there should be a business value to the models – even if it means the model only reaches 60% accuracy (for some cases).

The end goal changes from “a data scientist’s goal is to make conclusions from data” to “a data scientist’s goal is to use data to improve experiences, or add value to customer’s or the company itself using models”.

It’s only with this change that we can truly add value to businesses.