In plain English what does a data scientist do?
By Anshuman Gupta – VP Data Science, MiQ
A blog series explaining some of the concepts, processes and technologies we need to do our jobs – in plain English.
To people who don’t know what it is, data science can seem like magic.
A mystical process where latter-day sorcerers cast bewitching spells that transform croaking, slimy datasets into dashing, handsome insights.
But, while data science is complex and involves a whole bunch of specialist skills and smart technologies, its reputation as this unknowable, cryptic thing isn’t really deserved.
So, let’s bust some myths and find out what data science is – in plain English.
Understanding your world with data
One of the biggest misconceptions of data science is that it’s a tool or a process you can just ‘apply’ or ‘use’ to find a solution to a problem.
But the clue’s right there in the name – it’s a science. It’s an entire discipline, and that’s important to understand.
Just as a traditional scientist applies the scientific method to do things like build rockets and make new medicines, a data scientist uses the data-scientific method to interpret the world in specific ways.
So, if you ask ‘what does a data scientist do?’, it’s a bit like asking ‘what does a scientist do?’ In other words, it very much depends.
Every data scientist will have a bunch of hard skills in mathematics, statistics and computer science. But to do their job, they also need a deep understanding of the principles and concepts of their particular industry. In our case, that’s marketing and advertising, but it might also be finance or medicine or aeronautics or any other industry where numbers matter (which is all of them).
The job of a data scientist is to apply everything they know about the former (their data science skills) to the latter (what they know about their industry), so everyone else in their business can do their jobs in a more data-driven way.
Beyond gut feelings
To give an example from our field, let’s look at demographic segmentation. That’s a marketing concept, based on the understanding that not everyone wants all the same stuff. Your marketing team decides that your key audience is 35-50 year old women, so they make sure all your marketing speaks to people like that.
That’s fine as far as it goes, but bringing a data scientist to the party means you can do two things. First, it means you can prove (or disprove, as the case may be) your hypothesis, so you’re not basing all your marketing on gut-feelings.
Second, it means you can be way more granular. There’s likely to be much more that makes your customers your customers than simply their gender and their very approximate age. Maybe where they live, or other places where they shop, or the kind of TV shows they watch are as or maybe even more important to define your target audience.
And that’s where a data scientist can step in and help you understand your customers (and in general the world around you) using data.
From the theory to the practice
Okay, but when a data scientist gets into work every day, what do they sit down and do?
Well, it involves crisps. Or more specifically CRISP-DM, the snappy acronym for the Cross-Industry Standard Process for Data Mining.
If you’re a non-data-science person, this is where you might find your mind slipping away because it gets all horrible and technical. But stick with me, it’s not so bad.
CRISP-DM is basically the data science equivalent of the scientific method.
In the scientific method, you start with a hypothesis, devise tests to check that hypothesis, run those tests, then look at the results to see if they prove or disprove it.
It’s the same in data science, but with a couple more steps.
Step one – Understanding the business problem
The first stage is determining what you want to find out and why knowing it will be useful. For us, that might be finding out a brand’s customer lifetime value or working out the difference between customers who complete an online purchase and almost-customers who drop out at the ‘basket’ stage.
Step two – Understanding the data
Next, you work out the data you’ll need to do that. And just like with traditional science, you’re always building on the work other people have done before, so you have to look at what else has been done in the area before.
Step three – Preparing the data
Once you’ve worked out the datasets you need, you have to get them ready for analysis. This means cleaning the data (making sure you remove incomplete fields etc) and removing outliers that could potentially distort your analysis. For instance, if 99% of your customers are in the US, you might want to remove the remaining 1% if you want insights that apply to the majority of your audience.
Step four – Model Building
Model building is the heart of the process where you apply a statistical algorithm to your prepared dataset to find insights.
But what does that mean? Metaphor incoming.
Think of your data like a load of eggs, flour and sugar. An algorithm is basically just a set of instructions. So, in this case a recipe. You apply the recipe to your ingredients, and, hey presto, you get your cake/insights.
And just like a recipe, model building needs refinement. You start out with a pre-built algorithm – your first attempt – then tweak it and tune it to get better and better results each time you bake your cake/model your data.
Step five – Evaluating the insights
To check your insights are meaningful, you need to validate them and circle back to the original problem statement that you had started with
Once again, it’s useful to refer to traditional science here. Just like scientists running an experiment have a ‘control’ to isolate a single variable while holding all the others constant, data scientists have a control group of data. In a data science experiment, you’ll usually make 80% of the data ‘training data’ (the stuff you run models on) and leave the other 20% as ‘testing data’.
At the end of the process, you can test your insights against the testing data to see if they’re valid.
Step six – Put it into practice
If the business problem has been adequately addressed, it’s time to apply the insights in the real world. In our industry, that means passing customer and business insights on to our traders who can then run better performing marketing campaigns for our clients. On the other hand, if the accuracy of the model is not as expected then it is back to understanding the data and in some cases even back to revisiting the business problem. No one said it would be easy right!?!
That’s my quick introduction to the world of data science. Of course, there’s loads of stuff we didn’t get into here, especially around the technology we use. The process of analysing massive data sets is where scary terms like artificial intelligence and machine learning start flying around. And that’s without getting into the unstructured world of natural language processing and neural networks.
But really, they’re not all that scary either. Look out for Plain English guides to them coming soon…