Easy Synthetic Data in Python with Faker
Faker is a Python library that generates fake data to supplement or take the place of real world data. See how it can be used for data science.
Image by geralt on Pixabay
Real data, pulled from the real world, is the gold standard for data science, perhaps for obvious reasons. The trick, of course, if being able to find the real world data needed for a project. Sometimes you get lucky the data you need is available at your fingertips, in the form you need, and in the amount you want.
Often, however, real world data isn't enough for a project. In such a case, synthetic data can be used either in place of real data or to augment an insufficiently large dataset. There are lots of ways to artificially manufacture data, some of which are far more complex than others. There are a number of relatively simple options for doing so, however, and Faker for Python is one such solution.
From Faker's documentation:
Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.
Faker can be installed with pip:
pip install faker
And importing and instantiating an instance of Faker to use is done like this:
from faker import Faker fake = Faker()
Basic Faker Usage
Faker is very straightforward to use. Let's first have a look at a few examples.
For our first trick, let's generate a fake name.
print(fake.name())
Deborah Brooks
How about a few other data types?
# generate an address print(fake.address()) # generate a phone number print(fake.phone_number()) # generate a date print(fake.date()) # generate a social security number print(fake.ssn()) # generate some text print(fake.text())
123 Danielle Forges Suite 506 Stevenborough, RI 36008 968-491-2711 1974-05-27 651-27-9994 Weight where while method. Rock final environmental gas provide. Remember continue sure. Create resource determine fine. Else red across participant. People must interesting spend some us.
Faker Optimization
What if you're interested in locale-specific data? What if, for instance, I'm interested in generating Spanish names of the type one would find in Mexico?
fake = Faker(['es_MX']) for i in range(10): print(fake.name())
Diana Lovato Ing. Ariadna Palacios Salvador de la Fuente Margarita Naranjo Alvaro Prado Melgar Tomás Menchaca Conchita Francisca Velázquez Zedillo Ivonne Ana Luisa Bueno Ramiro Raquel Vélez Urbina Porfirio Esther Irizarry Varela
Generated output can be further optimized by using Faker's use weighting:
The Faker constructor takes a performance-related argument called
use_weighting
. It specifies whether to attempt to have the frequency of values match real-world frequencies (e.g. the English name Gary would be much more frequent than the name Lorimer). Ifuse_weighting
isFalse
, then all items have an equal chance of being selected, and the selection process is much faster. The default isTrue
.
Let's see how this works with some US English examples:
fake = Faker(['en_US'], use_weighting=True) for i in range(20): print(fake.name())
Mary Mckinney Margaret Small Dominic Carter Elizabeth Gibson Kelsey Garcia Chelsea Bradford Robert West Timothy Howe Gary Turner Cynthia Strong Joshua Henry Amanda Jenkins Jacqueline Daniels Catherine Jones Desiree Hodge Shannon Mason DVM Marcia West Dustin Parrish Christopher Rodriguez Brett Webb
Compare this with a similar output without use weighting:
fake = Faker(['en_US'], use_weighting=False) for i in range(20): print(fake.name())
Mr. Benjamin Horton Miss Maria Hardin Tina Good DDS Dr. Terry Barr MD Meredith Mason Roberta Velasquez Mr. Tim Woods V Marilyn Conway Mr. Dwayne Leblanc III Dr. Dan Krause IV Mia Newman DVM Thomas Small Joseph Holmes Dr. Tanner Zhang Alan Dixon Miss Rebecca Davila DVM Joseph Becker MD Dr. Erin Pugh PhD Mr. Ernest Juarez Ross Thompson
Note, for instance, the disproportionate number of doctors in the generated "sample."
A More Comprehensive Example
Let's say we want to generate a number of fake customer data records for an international company, we want to emulate real world distribution of naming, and we want to save this data to a CSV file for later use in a data science task of some sort. To clean up the data, we will also replace the newlines in the generated addresses with commas, and remove the newlines from generated text completely.
Here's a code excerpt that will do just that, exhibiting the power of Faker in a few lines of code.
from faker import Faker import pandas as pd fake = Faker(['en_US', 'en_UK', 'it_IT', 'de_DE', 'fr_FR'], use_weighting=True) customers = {} for i in range(0, 10000): customers[i]={} customers[i]['id'] = i+1 customers[i]['name'] = fake.name() customers[i]['address'] = fake.address().replace('\n', ', ') customers[i]['phone_number'] = fake.phone_number() customers[i]['dob'] = fake.date() customers[i]['note'] = fake.text().replace('\n', ' ') df = pd.DataFrame(customers).T print(df) df.to_csv('customer_data.csv', index=False)
id name ... dob note 0 1 Burkhardt Junck ... 1982-08-11 Across important glass stop. Score include rel... 1 2 Ilja Weihmann ... 1975-03-24 Iusto velit aspernatur nemo. Aliquid ipsum ita... 2 3 Agnolo Tafuri ... 1990-10-03 Aspernatur fugit voluptatibus. Cumque accusant... 3 4 Sig. Lamberto Cutuli ... 1973-01-15 Maiores temporibus beatae. Ipsam non autem ist... 4 5 Marcus Turner ... 2005-12-17 Témoin âge élever loi.\nFatiguer auteur autori... ... ... ... ... ... ... 9995 9996 Miss Alexandra Waters ... 1985-01-20 Commodi omnis assumenda sit ratione non. Commo... 9996 9997 Natasha Harris ... 2003-10-26 Voluptatibus dolore a aspernatur facere. Aliqu... 9997 9998 Adrien Marin ... 1983-05-29 Et unde iure. Reiciendis doloribus dignissimos... 9998 9999 Nermin Heydrich ... 2005-03-29 Plan moitié charge note convenir.\nSang précip... 9999 10000 Samuel Allen ... 2011-09-29 Total gun economy adult as nor. Age late gas p... [10000 rows x 6 columns]
We also have a customer_data.csv file with all of our data for further processing and use as we see fit.
Screenshot of generated customer data CSV file
The type specific data generators above — such as name, address, phone, etc. — are called providers. Learn how to extend its usefulness by enabling Faker to generate specialzied types of data using standard providers and community providers.
See Faker's GitHub repository and documentation for further capabilities and make your own dataset today.
Related:
- Create Synthetic Time-series with Anomaly Signatures in Python
- Teaching AI to Classify Time-series Patterns with Synthetic Data
- 3 Data Acquisition, Annotation, and Augmentation Tools