Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2021 » Nov » Tutorials, Overviews » Easy Synthetic Data in Python with Faker

Easy Synthetic Data in Python with Faker


Faker is a Python library that generates fake data to supplement or take the place of real world data. See how it can be used for data science.



Figure
Image by geralt on Pixabay

 

Real data, pulled from the real world, is the gold standard for data science, perhaps for obvious reasons. The trick, of course, if being able to find the real world data needed for a project. Sometimes you get lucky the data you need is available at your fingertips, in the form you need, and in the amount you want.

Often, however, real world data isn't enough for a project. In such a case, synthetic data can be used either in place of real data or to augment an insufficiently large dataset. There are lots of ways to artificially manufacture data, some of which are far more complex than others. There are a number of relatively simple options for doing so, however, and Faker for Python is one such solution.

From Faker's documentation:

Faker is a Python package that generates fake data for you. Whether you need to bootstrap your database, create good-looking XML documents, fill-in your persistence to stress test it, or anonymize data taken from a production service, Faker is for you.

 

Faker can be installed with pip:

pip install faker


And importing and instantiating an instance of Faker to use is done like this:

from faker import Faker
fake = Faker()


Basic Faker Usage

 
Faker is very straightforward to use. Let's first have a look at a few examples.

For our first trick, let's generate a fake name.

print(fake.name())


Deborah Brooks


How about a few other data types?

# generate an address
print(fake.address())

# generate a phone number
print(fake.phone_number())

# generate a date
print(fake.date())

# generate a social security number
print(fake.ssn())

# generate some text
print(fake.text())


123 Danielle Forges Suite 506
Stevenborough, RI 36008

968-491-2711

1974-05-27

651-27-9994

Weight where while method. Rock final environmental gas provide. Remember continue sure. Create resource determine fine. Else red across participant. People must interesting spend some us.


 

Faker Optimization

 
What if you're interested in locale-specific data? What if, for instance, I'm interested in generating Spanish names of the type one would find in Mexico?

fake = Faker(['es_MX'])
for i in range(10):
    print(fake.name())


Diana Lovato
Ing. Ariadna Palacios
Salvador de la Fuente
Margarita Naranjo
Alvaro Prado Melgar
Tomás Menchaca
Conchita Francisca Velázquez Zedillo
Ivonne Ana Luisa Bueno
Ramiro Raquel Vélez Urbina
Porfirio Esther Irizarry Varela


Generated output can be further optimized by using Faker's use weighting:

The Faker constructor takes a performance-related argument called use_weighting. It specifies whether to attempt to have the frequency of values match real-world frequencies (e.g. the English name Gary would be much more frequent than the name Lorimer). If use_weighting is False, then all items have an equal chance of being selected, and the selection process is much faster. The default is True.

 

Let's see how this works with some US English examples:

fake = Faker(['en_US'], use_weighting=True)
for i in range(20):
    print(fake.name())


Mary Mckinney
Margaret Small
Dominic Carter
Elizabeth Gibson
Kelsey Garcia
Chelsea Bradford
Robert West
Timothy Howe
Gary Turner
Cynthia Strong
Joshua Henry
Amanda Jenkins
Jacqueline Daniels
Catherine Jones
Desiree Hodge
Shannon Mason DVM
Marcia West
Dustin Parrish
Christopher Rodriguez
Brett Webb


Compare this with a similar output without use weighting:

fake = Faker(['en_US'], use_weighting=False)
for i in range(20):
    print(fake.name())


Mr. Benjamin Horton
Miss Maria Hardin
Tina Good DDS
Dr. Terry Barr MD
Meredith Mason
Roberta Velasquez
Mr. Tim Woods V
Marilyn Conway
Mr. Dwayne Leblanc III
Dr. Dan Krause IV
Mia Newman DVM
Thomas Small
Joseph Holmes
Dr. Tanner Zhang
Alan Dixon
Miss Rebecca Davila DVM
Joseph Becker MD
Dr. Erin Pugh PhD
Mr. Ernest Juarez
Ross Thompson


Note, for instance, the disproportionate number of doctors in the generated "sample."

 

A More Comprehensive Example

 
Let's say we want to generate a number of fake customer data records for an international company, we want to emulate real world distribution of naming, and we want to save this data to a CSV file for later use in a data science task of some sort. To clean up the data, we will also replace the newlines in the generated addresses with commas, and remove the newlines from generated text completely.

Here's a code excerpt that will do just that, exhibiting the power of Faker in a few lines of code.

from faker import Faker
import pandas as pd

fake = Faker(['en_US', 'en_UK', 'it_IT', 'de_DE', 'fr_FR'], use_weighting=True)

customers = {}

for i in range(0, 10000):
    customers[i]={}
    customers[i]['id'] = i+1
    customers[i]['name'] = fake.name()
    customers[i]['address'] = fake.address().replace('\n', ', ')
    customers[i]['phone_number'] = fake.phone_number()
    customers[i]['dob'] = fake.date()
    customers[i]['note'] = fake.text().replace('\n', ' ')

df = pd.DataFrame(customers).T
print(df)

df.to_csv('customer_data.csv', index=False)


         id                   name  ...         dob                                               note
0         1        Burkhardt Junck  ...  1982-08-11  Across important glass stop. Score include rel...
1         2          Ilja Weihmann  ...  1975-03-24  Iusto velit aspernatur nemo. Aliquid ipsum ita...
2         3          Agnolo Tafuri  ...  1990-10-03  Aspernatur fugit voluptatibus. Cumque accusant...
3         4   Sig. Lamberto Cutuli  ...  1973-01-15  Maiores temporibus beatae. Ipsam non autem ist...
4         5          Marcus Turner  ...  2005-12-17  Témoin âge élever loi.\nFatiguer auteur autori...
...     ...                    ...  ...         ...                                                ...
9995   9996  Miss Alexandra Waters  ...  1985-01-20  Commodi omnis assumenda sit ratione non. Commo...
9996   9997         Natasha Harris  ...  2003-10-26  Voluptatibus dolore a aspernatur facere. Aliqu...
9997   9998           Adrien Marin  ...  1983-05-29  Et unde iure. Reiciendis doloribus dignissimos...
9998   9999        Nermin Heydrich  ...  2005-03-29  Plan moitié charge note convenir.\nSang précip...
9999  10000           Samuel Allen  ...  2011-09-29  Total gun economy adult as nor. Age late gas p...

[10000 rows x 6 columns]


We also have a customer_data.csv file with all of our data for further processing and use as we see fit.

Figure
Screenshot of generated customer data CSV file

 

The type specific data generators above — such as name, address, phone, etc. — are called providers. Learn how to extend its usefulness by enabling Faker to generate specialzied types of data using standard providers and community providers.

See Faker's GitHub repository and documentation for further capabilities and make your own dataset today.

 
Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy