Software Engineering Best Practices for Data Scientists
This is a crash course on how to bridge the gap between data science and software engineering.
By Madison Hunter, Geoscience BSc Undergrad Student
Photo by Joonas kääriäinen from Pexels
I didn't really understand why people would complain about the code that data scientists produce until I began writing data analysis code myself.
Coming from an education in software development, I felt that I had a good grasp of coding best practices and I was confident in my ability to write clean code. Then it came time for me to write my first data analysis.
You guessed it.
An absence of functions, unclear variable names, spaghetti code, not a single hint of a unit test, and a severe lack of style ensued, leaving me with the equivalent of a dog’s breakfast to try to reason out after I had gotten my code to work.
At last, I finally understood what all of those engineers were complaining about.
Data science isn’t a field that naturally stems from computer science, and it’s reflected in the wide variety of backgrounds that data scientists hold. Many data scientists don’t even have a degree in computer science, as they often come from other unrelated fields including mathematics, the sciences, engineering, business, medicine, and more.
Therefore, it’s no wonder that data scientists aren’t known for always having the cleanest code.
I’m not saying that data scientists need to be able to write entire complex libraries. Instead, data scientists should be able to produce clean code that can be updated, debugged, and moved into a production environment with few swear words coming from the engineering department.
The importance of writing clean code isn’t just for the sake of others either. In many companies, the budget isn’t big enough to hire data scientists and software engineers, which means that data scientists are often responsible for creating production-ready code. Therefore, one's own sanity should count as a factor for maintaining and developing clean, easy-to-use, and reusable code.
Many will argue that good code is subjective. However, there are generally four standards that everyone can agree on that define code written using best practices:
- Good code should be efficient — this means that you squeeze every little bit of speed and efficiency out of your Python code, even when there is none.
- Good code is maintainable — this means that you can maintain the code and that others can easily understand and maintain your code.
- Good code is readable and well-structured — this means that anyone should be able to look at your code and understand what you’re trying to accomplish without having to try too hard.
- Good code is reliable — clean code is good code, and good code is reliable such that it isn’t prone to bugs or random glitches.
Check out these best practices that you can implement to attain the above four standards.
Use descriptive variable names.
I’ve often seen data analysis code written using variable names such as x, or y, or the English wording for any number of mathematical variables.
While this works when writing out mathematical formulas, it doesn’t come across as clearly when you’re reading it in someone’s code.
When it comes to data science, I like to say exactly what my variables are. For example:
null_hypothesis
standard_deviation_paired_differences_population
degrees_of_freedom
observed_frequency
test_statistic_chi_squared
Anyone who has taken a course in statistics will be able to understand what my variables are, and what they are doing in my equations. These basic variables can then be made more descriptive when it comes to specific calculations or uses.
The trick is to try to be as descriptive as the situation allows. Of course, there will be instances when the computations are so complex that it wouldn’t make sense to write such descriptive variables. However, until that time comes, try to be descriptive.
Use functions.
Functions, regardless of your programming methodology (object-oriented, functional, etc.), are vital to keeping your code clean, concise, readable, and DRY (which I’ll talk about later on).
Functions aren’t always intuitive for non-computer-science-graduate-data-scientists because code will run correctly without them.
However, the best programmers are said to be the laziest programmers. Why? Because they write the least amount of code and often write the cleanest code to produce a solution. They work as little as possible to get their code to work and they often produce the most concise solutions. This involves using functions.
Here are some tips on how to write great functions that keep your code clean:
- The function should only do one thing. Not 10. Not 20. Just one thing.
- Functions should be small. Some argue that functions should contain no more than 20 lines of code, but this is an arbitrary number. Try your best to keep your functions short and to the point.
- Functions should be written so you can read them and understand their logic in a top-to-bottom fashion — just like how you read a book.
- Try to keep the number of function arguments to a minimum (less than 3 is preferred). If you’re requiring more than three function arguments, ask yourself if the function is only completing one task.
- Make sure the logic found within functions is properly indented and that the code for the entire function is properly blocked. This will help others see what code is inside the function, and where the function starts and stops.
- Use descriptive names.
Use comments and write proper supporting documentation.
It can be easy to overlook this step when you’re in the middle of building life-altering models, but the fact remains that if no one else can use them, then how life-altering are they really?
My personal rule of thumb for comments:
- Write a comment at the top of the code file, giving a brief description of the goal of the code.
- Write a comment at the top of each function describing its inputs, outputs, and what logic it's performing.
- Write a comment at the top of any logic that I don’t totally understand so I can better organize my thoughts. This helps during debugging and code refactoring.
When it comes to writing proper supporting documentation, your documentation doesn’t have to be any more elaborate than a READme file. At its most basic, good documentation should include:
- The goal of the code.
- Instructions on how to install and use the code.
- Explain tricky parts of your code in long-form, taking time to describe exactly what the code does line by line, and perhaps even why you elected to write your code that way.
- Screenshots to help with debugging and troubleshooting.
- Helpful links to external supporting documentation that can further describe and explain parts of your logic or code.
Use a consistent coding style and become familiar with the syntax conventions for the languages you use.
I’m guilty of not always following the proper conventions for a given language.
After primarily learning to code in C# when I was in university, I became familiar and comfortable with its conventions and I then went on to very lazily apply those same conventions to every other language I came across.
Don’t be like me. Instead, take the time to learn the syntax conventions of each new language, and force yourself to use those conventions properly. This will not only help you immerse yourself in the language, but it will also help you write cleaner code, and will help you communicate with other developers using the same language.
Here’s an example of code that completes the same function, written using two different languages and the appropriate conventions for each (keep in mind that there is a lack of indentation caused by this formatting):
C#:
// This is a comment in C#. I am going to write some code that prints out a value if the inputted number is correct.int year = 0;
Console.Write("\nEnter the year that C# became a language: ");
year = Convert.ToInt32(Console.ReadLine());if(year == 2000)
{
Console.Write("That is correct!");
}
else
{
Console.Write("This is incorrect.");
}
Python:
# This is a comment in Python. This code is going to do the same job that the code above just did.year = int(input("Enter the year that Python became a language: "))if year == 1989:
print ("That is correct!")
else:
print ("That is incorrect.")
This silly code doesn’t really do much of anything, but it kind of gives you an idea about how different the conventions are for each language.
Use libraries.
Using pre-existing libraries is a huge time saver, especially when it comes to coding data analyses. Python is chock full of libraries that can handle every request a data scientist could throw at it. Not only are these libraries already coded for you, but they’ve already been debugged and are production-ready.
Check out this article that outlines the top data science libraries and their abilities. I’ve summarized it here, focusing on the seven libraries I’ve found most useful:
NumPy
- Basic array operations: add, multiply, slice, flatten, reshape, index.
- Advanced array operations: stack arrays, split into sections.
- Linear algebra.
Pandas
- Indexing, manipulating, renaming, sorting, and merging data frames.
- Basic CRUD (Update, Add Delete) functionalities on columns from a data frame.
- Input missing files and handle missing data.
- Create histograms or box plots.
Matplotlib
- All data visualizations, including line plots, scatter plots, bar charts, histograms, pie charts, stem plots, and spectrograms, to name a few.
TensorFlow
- Voice and sound recognition.
- Sentiment analysis.
- Face recognition.
- Time series.
- Video detection.
Seaborn
- Determine correlations between variables.
- Analyze uni-variate or bi-variate distributions.
- Plot linear regression models for dependent variables.
SciPy
- Common scientific computing, including linear algebra, interpolation, statistics, calculus, and ordinary differential equations.
Scikit-Learn
- Supervised and unsupervised learning algorithms for classification, clustering, regression, dimensionality, model selection, and pre-processing.
Use a good IDE.
Jupyter is not meant to write production-ready code.
While there are ways to wrangle Jupyter Notebooks into creating production-ready code, I find that it’s just easier to use it for the early stages of data exploration and analysis, and then developing the production-level code in a proper IDE.
IDEs come with simple tools that help keep your code clean, such as code linting, auto-formatting, syntax highlighting, lookup functionality, and error catching. Furthermore, there are tons of extensions for IDEs that can help save you time while writing code.
Some IDEs you can check out include:
- Visual Studio Code (my personal favorite)
- PyCharm
- Atom
- Spyder
Keep your code DRY.
DRY (Don’t Repeat Yourself) is a software engineering best practice that aims to keep your code clean, concise, and to the point.
The goal is to not repeat any code. What this means, is that if you’re noticing that you’re writing the same lines of code over and over, you need to turn that code into a function that you only write once. This function can then be called multiple times.
If you’re having trouble keeping your code DRY, just use my favorite rule:
- If this is the first time, code it.
- If this is the second time, copy it.
- If this is the third time, make it a function or a class.
Write unit tests.
Unit testing is a type of software testing that involves testing individual components in the code. Unit tests are written to ensure that each piece of code performs as expected. The tests are written so that a small section of code executing a particular functionality is tested to ensure that it is completing its job accurately. The “units” that are tested could be individual functions, procedures, or entire objects.
A great example of unit testing would be to test how a function responds upon receiving different argument types. The results of this test would ensure that you’ve planned for every eventuality and that the function won’t throw any un-explained errors later on.
Check out this article by Microsoft on how to write unit tests:
While this article only describes how to write unit tests in C#, the lessons and examples apply to any language.
Don’t become obsessed with one-liner solutions.
One-liner solutions are great if you want to impress your friends or if you want to write a viral article on Medium. Who doesn’t want to trade stocks using one line of code?
While it’s impressive to create a solution using one line of code, you’re also opening yourself up to a world of hurt when it comes to debugging and refactoring your code.
Instead, try to keep it to one or two function calls per line, and make sure your logic is easily understandable to anyone reading your code. If you can’t explain what a line of code is doing in simple terms, you’ve probably made that one line too complicated.
It must be said that the number of lines of code that a developer writes is not a good indication of their ability or prowess — don’t listen to anyone who tells you otherwise.
Final thoughts.
This article highlights the most important best practices that data scientists can implement to make their code production-ready.
Your code is your responsibility, so it makes sense to take some time to ensure that you’re writing code that is efficient, simple, and easily understandable. While there are tons of other tips and tricks for writing clean code, the ones listed above are the most simple ones that you can apply starting today.
Bio: Madison Hunter is a Geoscience BSc undergrad student, Software Dev graduate. Madison produces ramblings about data science, the environment, and STEM.
Original. Reposted with permission.
Related:
- Software Engineering Tips and Best Practices for Data Science
- 15 Habits I Learned from Highly Effective Data Scientists
- Software engineering fundamentals for Data Scientists