Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets

Finding the data you need is hard. So why not fake it?

Fake It Till You Make It: Generating Realistic Synthetic Customer Datasets
Image by mcmurryjulie on Pixabay


Being able to create and use synthetic data in projects has become a must-have skill for data scientists.

I have written in the past about using the Python library Faker for creating your own synthetic datasets. Instead of repeating anything in that article, let's treat this as the second in a series of generating synthetic data for your own data science projects. This time around, let's generate some fake customer order data.

If you don't know anything about Faker, how it is used, or what you can do with it, I suggest that you check out the previous article first.


The Plan

The plan is to synthesize a scaled-down version of a set of tables that would be used in the real-world business case of a customer order system.

Aside from items for purchase, let's think about what is called for in such a scenario.

  • Customers - in what is not much of a surprise, if you are going to build a system to track customer orders, you are going to need customers
  • Credit cards - customers need to pay for things, and in our simplified scenario they can only do so with credit cards
  • Orders - an order will consist of a customer, a cost, and a credit card for payment

That's the data we need, so that's the data we will make. After you go through this, you will probably find ways to make it more robust, more detailed, and more like the real world, which you should be able to go ahead and do on your own.


Imports and Helper Functions

Let's get started. First, the imports.

from faker import Faker
import faker.providers.credit_card
import pandas as pd

import random
from random import randint

Next, let's write a few helper functions that will be of use a little later on.

def random_n_digits(n):
    range_start = 10**(n-1)
    range_end = (10**n)-1
    return randint(range_start, range_end)

def unique_rand(rands, n):
    new_int = random_n_digits(n)
    if new_int not in rands:
        unique_rand(rands, n)
    return new_int, rands

def generate_cost():
    cost = ''
    digits = randint(1, 4)
    cost += str(random_n_digits(digits))
    cost += '.' + str(random_n_digits(2))
    return cost

The first function, random_n_digits, will be used to generate a random integer of length n. With attribution to this StackOverflow answer, see the example below:

def random_n_digits(n):
    range_start = 10**(n-1)
    range_end = (10**n)-1
    return randint(range_start, range_end)


This will come in handy for identifiers such as customer and order numbers.

The next function, unique_rand(), will be used to ensure that a generated identifier is unique to our system. It simply takes a list of integers and an integer representing the length of a new integer to be created, uses the previous function to create a new integer of this length, checks this new integer against the unique list, and if this new integer is also unique, it gets added to the list.

The final function's utility is given away by its name, generate_cost(). To generate a cost, the function randomly generates an integer between 1 and 4, which will become the length of the dollar place digits string for our generated cost. random_n_digits() is then used to generate an integer of that length. After this, the process is repeated to create a 2 digit integer, which becomes the decimal cents portion of the cost, to the right hand side of the decimal point. These 2 are put together and returned.

Now let's move on to faking it.


Don't worry, even Elaine fakes it.
Don't worry, even Elaine fakes it.


Creating Customers

With that, let's generate the customers. Our 10,000 customers will include the following attributes:

  • customer ID (cust_id) - generated using the helper functions outlined above
  • customer name (name) - generated using Faker; use_weighting=True means an attempt is made to have the frequency of generated values match real-world frequencies ("Lisa" will be more frequently generated than will "Braelynn"); the locales denote from where names are being generated
  • customer address (address) - generated using Faker
  • customer phone number (phone_number) - generated using Faker
  • customer date of birth (dob) - generated using Faker
  • customer note text field (note) - generated using Faker

The code also stores generated unique customer IDs (cust_ids) as a list in order to compare newly-generated IDs with existing to ensure uniqueness. After this, the dictionary which is used to store the customer data is passed into a new Pandas DataFrame, and ultimately stored to a CSV file.

fake = Faker(['en_US', 'en_UK', 'it_IT', 'de_DE', 'fr_FR'], use_weighting=True)

customers = {}
cust_ids = []

for i in range(0, 10000):
    customers[i]['cust_id'], cust_ids = unique_rand(cust_ids, 8)
    customers[i]['name'] = fake.name()
    customers[i]['address'] = fake.address().replace('\n', ', ')
    customers[i]['phone_number'] = fake.phone_number()
    customers[i]['dob'] = fake.date()
    customers[i]['note'] = fake.text().replace('\n', ' ')

customer_df = pd.DataFrame(customers).T

customer_df.to_csv('customer_data.csv', index=False)

       cust_id                 name  \
0     52287029            Jay Brown   
1     85688731    Frédérique Martel   
2     95499535      Georges Leclerc   
3     28715621  Christian Carpenter   
4     94472217       Lorraine Watts   
...        ...                  ...   
9995  70168635       Léon Couturier   
9996  10483280       Vincent Nelson   
9997  41868059           Gert Klapp   
9998  28049517    Simonetta Garrone   
9999  26781527      Alessio Camanni   

                                                address       phone_number  \
0     Flat 9, Hart islands, East Elliotchester, DY6N...       0117 4960802   
1             boulevard Chevalier, 93506 BourgeoisBourg  625.665.4731x5846   
2                    43 Poole way, Taylorstad, KW45 0FT         0780881522   
3     Rotonda Olivetti 99, Sandro salentino, 48332 L...      (00960) 04254   
4              Jolanda-Seifert-Allee 113, 91518 Koblenz    +39 353 5623602   
...                                                 ...                ...   
9995  Strada Molesini 3 Appartamento 32, Ariasso ven...   +44(0)1214960433   
9996                         Trubring 86, 28785 Ansbach  +33 5 61 32 08 79   
9997  Strada Casarin 01 Piano 8, Settimo Giovanni ne...    +39 695 7780253   
9998        Marga-Trubin-Straße 2/4, 13495 Feuchtwangen     (07310) 491854   
9999  avenue Susanne Berthelot, 70292 Poirier-sur-Ra...       0114 4960083   

             dob                                               note  
0     2004-08-20  Interroger dormir but remercier atteindre juge...  
1     2009-07-08  Semblable tout désert dominer lutte. Quart mêm...  
2     2021-04-17  Occaecati occaecati temporibus a asperiores di...  
3     1999-05-03  Rem itaque maxime dolor eum omnis. Eligendi qu...  
4     1997-06-02  Doloremque ut illo sunt. Modi non autem conseq...  
...          ...                                                ...  
9995  1981-06-06  Language state white receive soon. Usually tru...  
9996  2020-01-03  Similique quasi eos pariatur consequatur liber...  
9997  2018-10-13  Voluptatum exercitationem omnis rem. Beatae al...  
9998  1983-02-09  Treat vote poor church area discuss carry argu...  
9999  1987-02-06  Go remember center toward real food section. S...  

[10000 rows x 6 columns]

Creating Credit Cards

Customers need a method to pay for their orders, so let's give them all credit cards.

Actually, in an effort to simplify, we will generate credit cards without assigning them to any particular customer. Instead, we will just match customers and cards for orders. You could modify this with a little ingenuity to assign cards to customers and then ensure that orders were paid for with the proper cards. I'll leave that an an exercise for interested readers.

Below you will find that unique credit card numbers are generated with the same helper functions and same basic method as the unique customer IDs were. The credit card numbers are artificially short, but go ahead and make them as long you would like. The rest of the data is generated using Faker. The data is then fed into a Pandas DataFrame and saved as a CSV file for later use.

credit_cards = {}
cc_ids = []

for i in range(0, 10000):
    credit_cards[i]['cc_id'], cc_ids = unique_rand(cc_ids, 5)
    credit_cards[i]['type'] = fake.credit_card_provider()
    credit_cards[i]['number'] = fake.credit_card_number()
    credit_cards[i]['ccv'] = fake.credit_card_security_code()
    credit_cards[i]['expire'] = fake.credit_card_expire()

credit_cards_df = pd.DataFrame(credit_cards).T

credit_cards_df.to_csv('credit_card_data.csv', index=False)

      cc_id           type            number  ccv expire
0     33257   JCB 16 digit   213177754612892  121  11/24
1     86707  VISA 16 digit  6573538482942722  042  11/31
2     96668  VISA 16 digit  4780281393619055  671  01/23
3     73749  VISA 16 digit  3520725757002891  319  04/28
4     26342  VISA 13 digit    30141856563149  495  10/29
...     ...            ...               ...  ...    ...
9995  14141  VISA 13 digit     4617204802844  640  04/27
9996  35599        Maestro      639006455203  384  12/21
9997  46479  VISA 16 digit      503885514391  587  08/24
9998  78536  VISA 19 digit     4789890563459  057  07/22
9999  84649     Mastercard  3590096870674031  874  04/31

[10000 rows x 5 columns]

Creating Orders

Now let's generate ourselves some money.

Orders will be unique in the same manner as the previous customer IDs and credit card numbers. We will then link a random customer and a random credit card in an order, and generate a random cost using the third of the original three helper functions introduced earlier on.

In what has become a common pipeline, we then create a Pandas DataFrame of the dictionary, and save the data to file as a CSV.

orders = {}
order_ids = []

for i in range(0, 1000):
    orders[i]['order_id'], order_ids = unique_rand(order_ids, 10)
    orders[i]['cust_id'] = random.choice(cust_ids)
    orders[i]['cc_id'] = random.choice(cc_ids)
    orders[i]['cost'] = generate_cost()

orders_df = pd.DataFrame(orders).T

orders_df.to_csv('orders.csv', index=False)

       order_id   cust_id  cc_id     cost
0    9526379779  21484387  95840  6471.85
1    6999189530  90073074  75578     5.31
2    6124881941  84882923  13358   962.21
3    7476579071  91911770  22301    60.82
4    4102308607  60614412  28339  8086.96
..          ...       ...    ...      ...
995  2021016579  42107923  24863  4165.62
996  9279206414  49397693  45436     1.27
997  3378899620  40173623  96470    32.64
998  2222207181  73076539  40697  9701.29
999  1040242247  17749465  66052     9.63

[1000 rows x 4 columns]

The results it that you should have yourself three CSV files constituting the real-world emulation of an actual business process.

What do you do with the synthetic data now? Get creative. You could do some study, learn some new techniques or concepts, or undertake a project. A few more specific ideas include: using Python to create an SQL database out of this data to then practice your SQL skills with; performing a data exploration project; visualizing some of the synthetic data in interesting ways; seeing what kind of data preprocessing you could come up with to perform, such as splitting customer names into first and last, verifying that each customer has a credit card, ensuring young children aren't able to make purchases.

And just remember: keep on faking it.

Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.