Data Scientists Without Data Engineering Skills Will Face the Harsh Truth
Although the role of the data scientist is still evolving, data remains at its core. Setting the right expectations for what you will do as a data scientist is important, and, to be sure, knowing the tools of data engineering will get yourself ready for the real world.
Photo by Ben White on Unsplash.
You have probably read an article about the difference between a data scientist and a data engineer. I always thought the distinction was clear. Data engineers make the data ready for use, and then data scientists work on that data.
However, my opinion on this distinction has changed dramatically after I started working as a data scientist.
Everything in data science starts with data. Your machine learning model is just as good as the data fed into it. Garbage in, garbage out! A data scientist cannot do some magic to create a valuable product without proper data.
The proper data is not always readily available for data scientists. In most cases, it will be the responsibility of the data scientist to convert the raw data to a proper format.
Unless you work for a big tech company that has separate teams of data engineers and data scientists, you should possess the ability and skills to handle some data engineering tasks. These tasks cover a broad range of operations, and I will elaborate on this in the remaining part of the article.
What is the difference anyway?
I would like to state my opinion on the relationship between the job of a data engineer and a data scientist.
A data engineer is a data engineer. A data scientist should be both a data scientist and a data engineer.
It may seem like an arguable statement. However, I would like to emphasize that my opinion was different before I started working as a data scientist. I used to think of data engineers and data scientists as separate entities.
In the remaining part of the article, I will try to explain what I mean by a data scientist should be both a data scientist and a data engineer.
For instance, data engineers do a set of operations known as ETL (extract, transform, load). It covers the procedures for collecting data from one or more sources, applying some transformations, and then loading it into a different source.
I would definitely not be surprised if a data scientist is expected to perform ETL operations. Data science is still evolving, and most companies do not have clearly separated data engineer and data scientist roles. As a result, a data scientist should be able to perform some data engineering tasks.
If you expect to only work on running machine learning algorithms with ready-to-use data, you will face the harsh truth soon after you start working as a data scientist.
You may have to write some stored procedures in SQL to preprocess the client data. It is also possible that you receive the client data from a few different sources. It will be your job to extract and combine them. Then, you will need to load them into a single source. In order to write efficient stored procedures, you need extensive SQL skills.
The transforming part of ETL procedures involves many data cleaning and manipulation steps. SQL may not be the best choice if you work with large-scale data. Distributed computing is a better alternative in such cases. Therefore, a data scientist should also be familiar with distributed computing.
Your best friend in distributed computing might be Spark. It is an analytics engine used for large-scale data processing. We can distribute both data and computations over clusters to achieve a substantial performance increase.
If you are familiar with Python and SQL, you won’t have a hard time getting used to Spark. You can use Spark features with PySpark, which is a Python API for Spark.
When it comes to working with clusters, the optimal environment is the cloud. There are various cloud providers, but AWS, Azure, and Google Cloud Platform (GCP) lead the way.
Although the PySpark code is the same for all cloud providers, how you set up the environment and create clusters change between them. They allow for creating clusters using both scripts or the user interface.
Distributed computing over clusters is a whole different world. It is nothing like doing analysis on your computer. It has very different dynamics. Evaluating cluster performance and choosing the optimal number of workers for a cluster will be your predominant concerns.
Conclusion
Long story short, data processing will be a substantial part of your job as a data scientist. By substantial, I mean more than 80% of your time. Data processing is not just cleaning and manipulating the data. It also involves ETL operations which are thought to be the job of a data engineer.
I strongly recommend getting familiar with ETL tools and concepts. It would be of great help if you had a chance to practice them.
It would be a naive assumption to think you will only work on machine learning algorithms as a data scientist. It is an important task too, but it will only consume a small part of your time.
Original. Reposted with permission.
Related: