Getting Started Cleaning Data
In order to achieve quality data, there is a process that needs to happen. That process is data cleaning. Learn more about the various stages of this process.
Background vector created by rawpixel.com - www.freepik.com
Data Cleaning is part of the pre-processing stage and is a vital step that needs to be taken before the data mining stage can occur.
Data quality is the measure of how fit a data set is to serve its specific purpose and how trusted it is to make trusted decisions. It is made up of characteristics such as accuracy, completeness, consistency, validity, and timeliness.
In order to achieve quality data, there is a process that needs to happen. That is data cleaning. Data cleaning consists of different types of techniques based on the type of data and the problems identified in the data.
To summarise, incorrect data is either correct, removed, or imputed.
Irrelevant Data
This is data that is not needed, has no use, or does not apply to the problem that is trying to be solved.
For example, if we are analysing data about patients' lifestyle choices and how it affects their BMI, their address of phone nothing is irrelevant as a variable. However, their smoking status or medical history is important and relevant to solving the problem.
You have to be certain that a piece of data is not important and has no relevance to the problem. At this point, you may drop it. Otherwise, further, explore the data and find a correlation between the variables.
If you are still unsure, ask someone that is an expert in the field or higher up in the company. They may see the relevance in a specific variable that you may not.
Duplicated Data
Duplicates are data that is repeated in the dataset. This normally occurs due to:
- The user may submit the same answer twice by accident
- A request was made twice, for various reasons such as wrong information or purely just submitting a request twice.
- There is a combination of data from different sources
Examples like this should simply be removed from the data, as it will affect your analysis process.
Data Type
There are various data types, such as string, integers, and floats. Making sure the data is in the correct type conversion is important to your analysis stage. Make sure answers either ‘True’ or ‘False’ are stored as a boolean data type. Make sure a patient's name is stored as a string data type.
This can easily be checked by doing summary statistics, known as data profiling. It gives you a general statistical view of the data, helping you identify missing values, the type of data in each variable, and so on.
If some data values cannot be converted to a specific data type, they should be converted to NaN values or another value, indicating that the value is incorrect and needs to be resolved.
Syntax Errors
Syntax errors are mistakes, such as spelling, punctuation, or incorrect values.
White Space
Whitespace is characters that are used for spacing, and have an "empty" representation. White spaces should be removed if it is at the beginning or at the end. For example:
" hello world " => "hello world"
Pad Character
A pad character is a character that is purely used to fill empty space in a string to create a uniform length to keep everything aligned in a dataset. The below example converts a 3 digit integer to a 6 digit integer.
123 => 000123
String Typos
String typically has the most mistakes and errors because they can be entered in various ways. You could say ‘Hello’ or mistakenly types ‘Hllo’. For example, a survey has been conducted to understand the demographics of a city more. If there is not a drop-down menu for choosing your gender, data can come back looking like this:
Gender: Male F Man Female Fem MAle
There are various solutions to this, one way is manually mapping each value to either “male” or “female”
dataframe['Gender:'].map({'Man': 'Male', 'Fem': 'Female', ... }]
Ways to Clean Data
Let’s look into the different ways you can clean raw data.
1. Standardise
Putting data in the same standardised format helps you clean the data and identify errors in the data. For example, making sure all string values are either in lower case or upper case to stop confusion is a good start.
Ensuring that numerical values are all the same measurement in that specific column, to a single unit. For example, a patient's weight can typically be in pounds (lbs) or kilograms (kg). Setting them all to the same measurement makes the analyst's life a little bit easier.
2. Scaling
Scaling the data means transforming it so that it fits within a specific scale, like 0-100 or 0-1. By scaling we can plot, compare, and analyse the data better. For example, converting the number of visits a user comes to the gym can be converted from an integer to a percentage, showing how much they made use of the gym for a specific period.
3. Normalisation
Normalisation is the process of reorganising the data so that there is no duplicated data and that the data is stored in one place. The point of normalisation is changing your observations so that they can be described as a normal distribution.
Many people confuse standardising data and normalising data as the same, however, they are not. Normalisation typically rescales the values into a range of [0,1], whereas Standardisation typically rescales data to have a mean of 0 and a standard deviation of 1.
4. NaN / Missing Values
It’s hard to avoid missing values, so we have to deal with them somehow. However, ignoring them just makes your problem worst. There are different ways you can deal with them.
- Drop: The easiest solution is to drop either the row or column if the missing values occur at random.
- Impute: Imputing the missing value is calculated is based on other observations.
- You can use methods such as statistical values such as mean and median. However, these are not guaranteed as ‘unbiased’ data.
- Another method is using Linear regression. You can use a line of best fit on existing data between two variables to fill in the missing value.
Filling in missing values is arguably controversial when data is used to make important decisions, statistical analysis, and provide facts to the rest of society. The data that is missing has value and is informative. For example, if a survey is conducted and users from a specific age group or religious group refuse to answer a question. The reason behind the missing values is important and informative to the survey conductor when asking an analyst to find correlations and outputs.
Dropping or imputing missing values is not the same as default values. Flagging these are important and may allow further analysis for current and future purposes.
The Next Steps
After the pre-processing stage, which includes dropping or imputing data; re-evaluating the data, and making sure that the cleaning process has not violated any rules or parameters is important.
Passing data on or moving onto the next stage without having reported the quality of the data is as important as the cleaning process. There are software and libraries that can detect and report these changes, showing if any rules were violated.
Reporting the errors in the data allows the business to identify why they occurred in the first place, if the data is of use anymore and how it can be avoided in the future.
Conclusion
You may sit there for hours on ends trying to clean data, to the point you may get frustrated. However, there is no point analysing bad data. Band-aid solutions do not deal with the cause of the problem. You have to understand the problem at hand and figure out what is the best way to fix it.
Nisha Arya is a Data Scientist and freelance Technical writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.