Synthetic Data Generation: A must-have skill for new data scientists

A brief rundown of methods/packages/ideas to generate synthetic data for self-driven data science projects and deep diving into machine learning methods.



Data generation with arbitrary symbolic expressions

 
While the aforementioned functions are great to start with, the user have no easy control over the underlying mechanics of the data generation and the regression output are not a definitive function of inputs — they are truly random. While this may be sufficient for many problems, one may often require a controllable way to generate these problems based on a well-defined function (involving linear, nonlinear, rational, or even transcendental terms).

For example, we want to evaluate the efficacy of the various kernelized SVM classifiers on datasets with increasingly complex separators (linear to non-linear) or want to demonstrate the limitation of linear models for regression datasets generated by rational or transcendental functions. It will be difficult to do so with these functions of scikit-learn.

Moreover, user may want to just input a symbolic expression as the generating function (or the logical separator for classification task). There is no easy way to do so using only scikit-learn’s utility and one has to write his/her own function for each new instance of the experiment.

For solving the problem of symbolic expression input, one can easily take advantage of the amazing Python package SymPy, which allows comprehension, rendering, and evaluation of symbolic mathematical expressions up to a fairly high level of sophistication.

In one of my previous articles, I have laid out in detail, how one can build upon the SymPy library and create functions similar to those available in scikit-learn, but can generate regression and classification datasets with symbolic expression of high degree of complexity. Check out that article here and my Github repository for the actual code.

Random regression and classification problem generation with symbolic expression
We describe how using SymPy, we can set up random sample generators for polynomial (and nonlinear) regression and…towardsdatascience.com

For example, we can have a symbolic expression as a product of a square term (x²) and a sinusoidal term like sin(x) and create a randomized regression dataset out of that.


Fig: Randomized regression dataset with symbolic expression: x².sin(x)

Or, one can generate a non-linear elliptical classification boundary based dataset for testing a neural network algorithm. Note, in the figure below, how the user can input a symbolic expression m='x1**2-x2**2' and generate this dataset.


Fig: Classification samples with non-linear separator.

 

Categorical data generation using “pydbgen” library

 
While many high-quality real-life datasets are available on the web for trying out cool machine learning techniques, from my personal experience, I found that the same is not true when it comes to learning SQL.

For data science expertise, having a basic familiarity of SQL is almost as important as knowing how to write code in Python or R. But access to a large enough database with real categorical data (such as name, age, credit card, SSN, address, birthday, etc.) is not nearly as common as access to toy datasets on Kaggle, specifically designed or curated for machine learning task.

Apart from the beginners in data science, even seasoned software testers may find it useful to have a simple tool where with a few lines of code they can generate arbitrarily large data sets with random (fake) yet meaningful entries.

Enter pydbgenRead the docs here.

It is a lightweight, pure-python library to generate random useful entries (e.g. name, address, credit card number, date, time, company name, job title, license plate number, etc.) and save them in either Pandas dataframe object, or as a SQLite table in a database file, or in a MS Excel file.

Introducing pydbgen: A random dataframe/database table generator
A lightweight Python package for generating random database/dataframe to use in data science, learning SQL, machine…towardsdatascience.com

You can read the article above for more details. Here, I will just show couple of simple data generation examples with screenshots,


Fig: Generate random names using pydbgen library.

Generate a few international phone numbers,


Fig: Generate random phone numbers using pydbgen library.

Generate a full data frame with random entries of name, address, SSN, etc.,


Fig: Generate full dataframe with random entries using pydbgen library.

 

Summary and conclusion

 
We discussed the criticality of having access to high-quality datasets for one’s journey into the exciting world of data science and machine learning. Often the paucity of flexible and rich enough dataset limits one’s ability to deep dive into the inner working of a machine learning or statistical modeling technique and leaves the understanding superficial.

Synthetic datasets can help immensely in this regard and there are some ready-made functions available to try this route. However, sometimes it is desirable to be able to generate synthetic data based on complex nonlinear symbolic input, and we discussed one such method.

Furthermore, we also discussed a exciting Python library which can generate random real-life datasets for database skill practice and analysis tasks.

The goal of this article was to show that young data scientists need not be bogged down by unavailability of suitable datasets. Instead, they should search for and devise themselves programmatic solutions to create synthetic data for their learning purpose.

Along the way, they may learn many new skills and open new doors to opportunities.

 
If you have any questions or ideas to share, please contact the author at tirthajyoti[AT]gmail.com. Also, you can check author’s GitHub repositories for other fun code snippets in Python, R, or MATLAB and machine learning resources. If you are, like me, passionate about machine learning/data science, please feel free to add me on LinkedIn or follow me on Twitter.

 
Bio: Tirthajyoti Sarkar is a semiconductor technologist, machine learning/data science zealot, Ph.D. in EE, blogger and writer.

Original. Reposted with permission.

Related: