Statisticians in the Data Science Era
If you are exploring the hot field of Data Science today, you probably know by now that there is no unique way to define data science. Everyone’s perspective comes from their respective backgrounds. Statisticians think they are data scientists while CS folks think they are.
I would not go into the debate.
Both have truth in their arguments as the typical problems they solve are quite different. There are some exceptions where their workflows overlap. Let me provide some thoughts from my perspective.
I am a statistician by training and I have been in the analytic world for over 15 years. This includes time spent at universities teaching graduate level statistics courses as well as working in the healthcare industry as a data scientist.
Three types of Statisticians
Like many scientific disciplines, statisticians contribute to the science in two different ways.
First, through theoretical contribution to the scientific literature to advance the knowledge. In this type of work, statisticians focus on developing new methods to solve new problems or improving existing methods. In statistical science, any new theoretical development takes several years, sometimes decades, to gain popularity among the mainstream practitioners. Part of it is due to the availability of software packages or lack thereof that the practitioners are most familiar with. And there are some other reasons too.
Secondly, the works of applied statisticians and applied researchers including epidemiologists who focus on solving problems that need an immediate solution. Most of the time they utilize already developed statistical methods in their workflow. Here’s where most of the overlap happens between the works of applied statisticians and practitioners of statistics.
However, there is no strict boundary between applied and theoretical statisticians. Often, applied statisticians develop new methods to solve an existing problem. Thus, the third kind of statisticians are those who fall somewhere in between theoretical statisticians and applied statisticians.
Towards Data Science
There are, however, some philosophical differences between these two types of statisticians. Theoretical researchers often think they are superior to their counterpart, while the applied folks think they are the ones whose contributions have a real impact. Both are wrong, in my honest opinion.
Compared to their peers, Applied Statisticians find their views better aligned with the theme of today’s data science. This is perhaps because of the way they were trained during their academic years.
Applied statisticians are taught and trained to use data to bring insights out of it. They are trained not only to fit a model and know how it works but also to communicate with their customers in non-technical terms. Intuition is all that takes priority over theoretical derivations. As Jo Hardin points 1,
.. when teaching, for example, the Neyman-Pearson lemma, the intuition behind how we know what we know (and why it matters) is vastly more fundamental for the students’ future research capabilities than the detailed steps of the proof.
I think it is vital for a student of statistics to learn and think this way to become successful in today’s job market.
Data Scientist Vs Data Engineers
The difference between many types of data scientists is often confusing even to the statisticians. Sometimes, data scientists and data engineers are considered equally. In fact, the type of work they do is quite different.
Let me clarify one thing–data scientists do not necessarily design the databases. It is the data engineers who do that. Data scientists analyze the data to drive business whereas data engineers develop and maintain that architecture. Data scientists or statisticians do not build or maintain databases. Most large organizations have their separate team of engineers who develop and maintain the data-science platforms (DBs, Hadoop, etc.).
Do the statisticians need to know a bit of database? Yes, a basic understanding is good enough. Most of the time all you will do is pull some tables from different sources and join them to create a working data set. For that, you need to understand basics and you do not have to be a data architect for that. If you are working in a startup company, then you may be required to understand in greater depth, though as they have less manpower.
How big of a deal knowing how to efficiently join tables? Not so big. Anyone with a decent knowledge of data and SQL can learn to do it with some reading and practice. But you have to have the mindset of learning in the first place.
Use of Statistics
In my work, I find statistics invaluable although most of the time we do not use many advanced techniques. And this is by far true for most organizations who utilize data for decision-making. High-level analysis such as modeling and machine learning stay at the very top level of the application pyramid. You still need to lay the foundation using basic analysis and visualizations.
How media is portraying data science today (such as deep learning and AI) is what perhaps 1% of all the analytics an organization needs. Many large organizations do not hire people for that. They purchase a solution instead as that is more cost-effective. For many problems where deep learning is not applicable, they need analytic people with decent statistical literacy.
Data Science as it’s being understood today solves problems that are, for the most part, quite different than the problems that would need statisticians’ help to solve. Being on both sides of the aisle I can see how they complement each other and how they both are relevant.
Let me know what’s your thoughts are. And please share this article if you find it useful.
Jo Harding (2017). Expectations and Skills for Undergraduate Students Doing Research in Statistics and Data Science, AMSTAT Newsletter ↩︎