Common Data Problems (and Solutions)
Let’s have a look into some of the common problems with data and the solutions for them.
Shubham Dhage via Unsplash
The boom in Data Scientists is purely due to the large amounts of data being able to provide us with solutions to our real-life problems. However, when it comes to Data Science, the theory that is being put into practice is not always the same as reality.
As a Data Scientist, it is very normal to receive large amounts of data that have issues with it and require heavy data cleaning, model design, and model execution. The problems are purely due to the complexity and scope of the data that is being used to answer a question. Problems in the data can be the number of features, the errors, the characteristics, and more.
When handling problems with your data, it is vital that the issues are handled correctly and efficiently.
So let’s have a look into some of the common problems with data and the solutions for them.
Not enough Data
The main component for a Data Scientist is Data; it’s part of their title. Without data, the movement of Data Science is limited, which would be a problem for a world that’s heavily dependent on data now.
If there is not enough data, it becomes a problem as it is an important element for training algorithms. If the data is limited, it can lead to inaccurate and inefficient outputs, costing the company a lot of time and resources. However, there are solutions to generating more data to train your model.
1. Randomly Generate
If you are aware of the values you are looking for, there is a possibility of randomly generating those values.
A way to do this is by pseudo-random, which is generated with a pseudo-random number generator (PRNG). This is an algorithm which generates seemingly random but still reproducible data. Reproducibility is the ability to obtain consistent results using the same data and code as the original study.
You can also use input analysis to understand the distribution of the data, which you can then imitate or replicate values based on the data distribution.
2. Data Augmentation
Data Augmentation is a strategy used to significantly increase the diversity of data available for training models, without having to collect new data. For example; techniques such as cropping, padding, and horizontal flipping are used heavily to train large neural networks.
However, you need to remember the hypothesis and the task that is trying to be solved, to make sure you are generating valid inputs.
3. Avoid Overfitting
When working with smaller datasets, the likelihood of overfitting increases automatically. To remind you: Overfitting is a modeling error that occurs when a function is too closely aligned to a limited set of data points.
In order to avoid overfitting, you can use techniques such as feature selection, regularisation, or cross-validation.
Too much Data
There is a problem with not having enough data, but there is also a problem with having too much data. Having more data does not guarantee that your model will produce accurate outputs.
Factors such as computational power, time taken and the overall output. You may realise that using less data allows your model to perform more efficiently and faster. There are things you can take into consideration when working with a lot of data.
1. p-value
According to Wikipedia, p-value is:
"is the probability of obtaining test results at least as extreme as the results actually observed, under the assumption that the null hypothesis is correct."
In Layman’s terms, p-value is the probability that the results from your sample data occurred by chance, therefore a low p-value is good. Anything < 0.5% is statistically significant allowing us to reject the null hypothesis. The p-value depends on the size of the data being tested: the larger the sample size, the smaller the p-value. If the sample size is large, there is a higher chance of finding a significant relationship, if one exists. As the sample size increases the impact of random error is reduced.
If your p-value is too high due to having too much data, you can take several samples from the data to train your model and generate better p-values.
2. Computational Power
Computational power is the ability of a computer to perform a certain task with speed and accuracy. When training a model with a large dataset, it can become very difficult as you will need a lot of computational power and memory to be able to process the data. It requires too much time and the process would not be efficient as a whole. A solution is:
Stratified sample
Stratified sampling is a method that involves the division of data into smaller sub-groups known as strata. In stratified random sampling, or stratification, the strata are formed based on unique attributes or characteristics such as income or age group in census data.
How Much Data is Considered a Good Amount?
It is difficult to determine what is considered a "good" amount of data as there is no set rule. It all depends on the type of data, the complexity of the problem and other factors such as costs. However, below is a solution you can take into consideration when deciding an effective sample size:
- Samples - Taking sub-samples of the data, ensuring that a certain percentage of each variable is covered. An example of this is what we covered above, Stratified sampling.
However, it all comes down to the type of machine learning problem you are trying to solve. For example:
- Image Classification Problem - This requires tens of thousands of images, in order to create a classifier with accurate outputs.
- Sentiment Analysis Problem - Due to the number of words and phrases, this also requires thousands of example texts. An N-gram model is built by counting the frequency of word sequences that appear in corpus text and then estimating the probabilities.
- Regression Problem - many researchers have suggested that you have 10 times as many observations than you do features. For example, If we have three independent variables, then it would be good to have a minimum sample size of 30
There's Still No Set Rule
There is a rough guideline that people follow from researchers and other Data Scientists who have been creating models for years, however these are not intended to be taken as set in stone or golden rules.
Depending on the correlation between different variables, you may need more data, but you may also only need less data. During the first few stages of the workflow, you should ask yourself these questions:
- What’s my desired timeframe?
If you are trying to predict what will happen in the next 5 years, having data only worth 1 year is not going to be efficient enough. You will need a minimum of 5 years and you need to ensure that there is not a lot of missing data, to produce accurate results.
- What is the granularity of my data?
If I need a year's worth of data, does my data correspond to that? Is my data in days, weeks or months? A year's worth of data could be 365 points of data (days), 52 points of data (weeks), or 12 points of data (months). All are equally valid, however it is about your problem at hand and what you require.
Nisha Arya is a Data Scientist and freelance Technical writer. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.