DataLang: A New Programming Language for Data Scientists… Created by ChatGPT?

I recently tasked ChatGPT-4's to come up with a new programming language appropriate for data scientists in their day to day tasks. Let's look at the results, and the process of getting there.



DataLang: A New Programming Language for Data Scientists... Created by ChatGPT?
Image created by Author with Midjourney

 

This article will provide you with an overview of a project that I gave ChatGPT to run with, the creation of a new data science-oriented programming language. The details are all spelled out below, but for reasons that may become evident in subsequent reading, I wanted to give ChatGPT the opportunity to introduce the language in a captivating way. And it's a doozy. So have a read of this first, and then we can catch up on the other side.

The world of data science is about to experience a seismic shift with the advent of a groundbreaking programming language, custom-built for data scientists by data scientists. In today's post, we'll take you behind the scenes of the inception, development, and realization of this revolutionary language: DataLang.

Imagine a language meticulously crafted to streamline your data science tasks, with a razor-sharp focus on usability, efficiency, and collaboration. A language designed to break down barriers and propel your data-driven insights to unprecedented heights. DataLang is the answer to the prayers of data scientists and analysts who yearn for a more powerful toolset, one tailored exclusively for their unique needs and challenges.

In this riveting tell-all, we will explore the conception of DataLang and its journey to becoming the most sought-after language in the data science domain. Join us as we uncover the exclusive insights, unanticipated challenges, and pivotal moments that brought DataLang to life, narrated by the very team that made it possible.

But beware - this is no ordinary blog post. We'll reveal hidden gems and mind-bending features that promise to intrigue, engage, and leave you craving more. The revelation of DataLang is just the beginning, and trust us when we say, you won't want to miss a single line of code.

So, brace yourselves for a thrilling ride into the future of data science, and let DataLang guide your way. Let the journey begin!

 

At least we didn't start this off by blatantly overselling the concept...

 

The Initial Design

 

I recently tasked ChatGPT-4's to come up with a new programming language appropriate for data scientists in their day to day tasks. Here is the initial prompt used to set the stage for its new assignment:

You are a seasoned computer scientist working on systems programming. A data science team has approached you to derive a new programming language designed and optimized for data-oriented tasks, making it perfectly suitable for data scientists to use in their day to day job. Provide a high-level description of this language; how it would differ from other popular data science languages such as Python, R, and SQL; what features would be most important to such a language; issues that a language might have; and anything else you feel would be of importance. Keep in mind that its standard libraries and built-in functionality should provide the most support possible for day to day data science operations. Write a short program to demonstrate the features, syntax, and functionality of this language as you perceive it.


Language Name: DataLang

High-Level Description

DataLang is a language designed specifically for data-oriented tasks and optimized for performance and ease of use in data science applications. It combines the best features of Python, R, and SQL, along with unique features designed to streamline data science workflows.

Key Differentiating Features

Native data structures: DataLang will have built-in data structures like data frames and tensors, which are commonly used in data science tasks. These structures will have integrated functionality for handling missing values, schema enforcement, and data type coercion.

Lazy evaluation: To improve performance, DataLang will use lazy evaluation for data manipulation tasks, where operations are only executed when necessary. This can save memory and computation time, especially when working with large datasets.

Querying capabilities: Inspired by SQL, DataLang will have built-in querying capabilities that allow users to filter, sort, join, and aggregate data without needing an external library.

Integrated visualization: DataLang will have a built-in visualization library that supports various chart types and is easily customizable. The visualization capabilities will integrate seamlessly with the native data structures.

Functional and parallel programming: DataLang will encourage functional programming practices and support parallel processing for better performance on large-scale data tasks.

Extensive libraries: DataLang will have extensive standard libraries focused on statistical analysis, machine learning, deep learning, natural language processing, and time-series analysis.

Interoperability: DataLang will have built-in support for interoperability with other popular languages like Python and R, allowing users to leverage existing libraries and codebases.

Possible Issues

Adoption: As a new language, DataLang may face challenges in attracting users and building a strong community around it.

Compatibility: Ensuring compatibility with existing tools, libraries, and platforms will be crucial for the success of DataLang.

Performance optimization: Balancing ease of use and performance optimization will be an ongoing challenge.

 

Questioning Some Design Choices

 
I had the following immediate questions for ChatGPT.

1. Why the decision for a functional paradigm?
2. What would the machine learning and statistical modeling libraries look like and how would they be structured?.
3. Can you provide a more exhaustive list of data structures it would support? Would it support the creation of new data structure in any way? If so, how?
4. Does the language run a risk of trying to be too many things for too many people by including diverse functionality in standard libraries such as machine learning and data vizualization?

 

The decision for the functional paradigm: Functional programming is chosen because it promotes immutability, statelessness, and the use of pure functions, which are advantageous in data science workflows. These properties make it easier to reason about the code, test, and parallelize operations, which are essential when working with large datasets or computationally intensive tasks like machine learning.

Machine learning and statistical modeling libraries in DataLang would follow a modular and consistent API, with separate modules for different tasks like linear regression, classification, clustering, and deep learning. The design would be inspired by popular libraries like scikit-learn, TensorFlow, and PyTorch, but with optimizations and simplifications for data science workflows.

DataLang would support the following data structures natively:

DataFrame: A tabular data structure with labeled columns, similar to a spreadsheet or SQL table.
Series: A one-dimensional array-like data structure, akin to a single column of a DataFrame.
Tensor: A multidimensional array-like data structure, used for more complex data manipulation, especially in deep learning.
Graph: A data structure to represent relationships between entities in the form of nodes and edges.
TimeSeries: A specialized DataFrame for handling time-stamped data.
DataLang would allow the creation of new data structures by providing an extensible and customizable class system. Users can define their data structures by subclassing existing ones or creating entirely new classes, utilizing built-in data structures and methods for their specific use cases.

The language might run a risk of trying to be too many things for too many people, but this risk can be mitigated by focusing on providing essential and widely-used functionality in the standard libraries while maintaining a modular design. This would allow users to choose and utilize only the modules they need, minimizing the overhead and complexity.

 

I am not bothering to share the code snippet requests and provided responses integrated into the above conversation. The reason for this is that the syntax changed in noticeable ways during our back and forth.

When I criticized ChatGPT about the combination of functional and object-oriented paradigms related to its initial variable assignment syntax and lack of distinction between those which were mutable and immutable, ChatGPT responded as follows:

In DataLang, we can support both mutable and immutable variables by introducing two keywords: let and const. The let keyword will be used for declaring mutable variables, while the const keyword will be used for declaring immutable variables. Using these keywords, developers can clearly define the intended behavior of a variable, promoting safer and more predictable code.

Here's an example of how the updated syntax would look like:

# Mutable variable declaration
let mutable_var = 5

# Immutable variable declaration
const immutable_var = 10

# Attempting to change the value of an immutable variable
immutable_var = 20  # This would raise a compilation error or runtime exception

 

This was evidence that ChatGPT was open to constructive criticism when it came to technical design of DataLang.

 

Further Changes

 
I had the following questions and comments for ChatGPT regarding further design decisions.

With a data-focused language, it only makes sense that we should be able to enforce data typing or, at the very least, data hinting. I wondered what ChatGPT thought of this.

Should we allow type hinting or type setting? If so, please provide some code to demonstrate. If not, please explain why.

 

Yes, allowing type hinting or type setting in DataLang would be beneficial. Type hinting helps improve code readability and maintainability by making the variable types explicit. It also allows for better error checking, either at compile-time or during runtime, reducing the likelihood of type-related bugs.

In DataLang, we can use a similar syntax to TypeScript or Python's type hinting with a colon and the type.

By supporting type hinting in DataLang, we can help developers write more robust and self-documenting code, improving the overall development experience for data science tasks.

 

After see an example of a class in DataLang, I had the following comment.

I don't like the "def __init__" syntax. Can we replace it with something that explicitly states that this is a constructor? Please update the code above to reflect this change.

 

Certainly, we can replace the def __init__ syntax with a more explicit keyword like constructor to define the constructor for a class. This will make the code more readable and easier to understand.

With the constructor keyword, the code becomes more explicit and easier to understand, enhancing the developer experience when working with custom classes and data structures in DataLang.

 

Once I saw the updated class code, I had the following concern, relating back to the variable assignment issue above. The initial class code example also did not have any getter or setter methods, and so I wanted to ensure that ChatGPT was taking an acceptable approach to class variable access.

The self.name = name assignment (and others like it) don't seem to follow the let and const methods of setting variables. Is this problematic? I think a uniform approach would be better. This could set immutables for class instances as well. Also, how would updates to object variables be accomplished? Please update the above code to reflect my initial concern, and then add some code to show how an update to a class variable would be accomplished.

 

You are correct that having a uniform approach for assigning variables would be beneficial. To maintain consistency with the let and const keywords, we can declare instance variables using these keywords inside the constructor. For updating object variables, we can use setter methods.

 

Below is an example code snippet taking everything thus far into consideration. Interestingly, prior to this snippet, ChatGPT was identifying snippets as a variety of different programming languages. This is the first time that ChatGPT has identified generated code as 'datalang':

 

DataLang code snippet generated by ChatGPT, and identified by ChatGPT as 'datalang' code
DataLang code snippet generated by ChatGPT, and identified by ChatGPT as 'datalang' code

 

Further Code Examples

 
I then wanted some more code to see how things looked in action.

Can you provide some sample code to create a new data structure?

 

# Import the built-in DataFrame
import datalang.datastructures as dds

# Create the custom data structure
class CustomDataFrame(dds.DataFrame):
    constructor(self, data: dict):
        super().constructor(data)

    def custom_summary(self) -> str:
        let num_columns: int = len(self.columns)
        let num_rows: int = len(self)
        return f"CustomDataFrame with {num_columns} columns and {num_rows} rows."

# Instantiate the custom data structure
let data_dict: dict = {"A": [1, 2, 3], "B": [4, 5, 6]}
let custom_df: CustomDataFrame = CustomDataFrame(data_dict)

# Use the custom method
print(custom_df.custom_summary())

 

Please write code that would create a data ingestor class and perform relevant tasks.

 

import datalang.io as dio

class DataIngestor:
    # Supported file types
    const SUPPORTED_FILE_TYPES: list[str] = ["csv", "json"]

    def read_data(self, file_path: str) -> dds.DataFrame:
        # Extract file extension
        file_ext = file_path.split(".")[-1].lower()

        if file_ext not in self.SUPPORTED_FILE_TYPES:
            raise ValueError(f"Unsupported file type: {file_ext}")

        if file_ext == "csv":
            return dio.read_csv(file_path)
        elif file_ext == "json":
            return dio.read_json(file_path)

# Instantiate the DataIngestor class
data_ingestor = DataIngestor()

# Read CSV and JSON files
csv_data = data_ingestor.read_data("data.csv")
json_data = data_ingestor.read_data("data.json")

 

Write a DataLang script to read in a file named "dataset.csv", split it into train and test sets, train a naive Bayes model, test it on the test data, and graph some results, only using DataLang's syntax, commands, and internal libraries.

 

import datalang.io
import datalang.data
import datalang.ml
import datalang.viz

# Read the dataset
let dataset: DataFrame = io.read_csv("dataset.csv")

# Split the dataset into train and test sets
let (train_data, test_data): (DataFrame, DataFrame) = data.train_test_split(dataset, ratio=0.8)

# Train a Naive Bayes model
let model: NaiveBayes = ml.NaiveBayes()
model.train(train_data)

# Test the model on the test data
let predictions: Array

 

I think we can agree this is a fairly succinct syntax for such tasks.

 

But Wait... There's More!

 
I figured data scientists would need a "getting started" tutorial, so why not get ChatGPT to write one?

How about writing a getting started for data scientists looking to learn DataLang?

 

I immediately amended this be nicely formatted as a professional looking HTML document, which you can find here.

That's all fine and dandy, but what about researchers who might be looking for a more technical overview of the language, like what you might find in a research paper. Guess who I thought would be a good candidate to write such a paper?

First I asked ChatGPT to write an outline of just such a paper, and I then had in generate the actual paper, section by section. But what good is a text version of a research paper? "Hey, ChatGPT, regenerate that same report using LaTeX."

You can read the full DataLang paper here. Is it any good? I'll let you judge for yourself.

Will DataLang need a logo? All great programming languages have an identifiable visual representation, so why should DataLang be any different?

Designing a logo for DataLang requires a combination of creativity and symbolism that represents the essence of the programming language. Here's a suggestion for a logo concept:

Base: Utilize a stylized "D" and "L" for DataLang, which could be combined or overlapped in a visually appealing and unique way.

Data representation: Integrate a graphical element that represents data, such as a bar chart, line chart, or scatter plot. This can be subtle, but it should be recognizable.

Code aspect: Include a code-like element in the design, such as curly braces or angle brackets, to emphasize the programming language aspect of DataLang.

Color scheme: Use a combination of colors that are associated with technology and innovation, such as shades of blue and green, or a more sophisticated color palette with greyscale and a pop of color for emphasis.

Keep in mind that this is just a suggestion, and a professional designer or design team can further develop the idea and create a final, polished logo that captures the spirit and essence of DataLang.

 

These seem like solid suggestions. I didn't craft a prompt around any of this to generate a logo, but I'm sure I could have had ChatGPT give it a try.

 

Why Stop There?

 
Well, we (and by "we" I mean ChatGPT) designed a language and some relevant artefacts surrounding its potential implementation. Why give up now?

Create a plan to build a DataLang compiler using the Python SLY library.

 

Install the SLY library: First, make sure you have the SLY library installed. You can install it using pip:

pip install sly

Define the lexer: Create a Python file named datalang_lexer.py and import the Lexer class from the sly module. Define a class named DataLangLexer that inherits from sly.Lexer. Inside this class, define your lexer rules as class attributes using regular expressions. For each token, also define a method that will be called when the lexer matches the token in the input.

Define the parser: Create another Python file named datalang_parser.py and import the Parser class from the sly module. Define a class named DataLangParser that inherits from sly.Parser. Inside this class, define your parser rules as methods. These methods will have names derived from your non-terminal symbols and will contain the appropriate production rules.

Define the abstract syntax tree (AST): Create another Python file named datalang_ast.py. In this file, define classes for each AST node type, such as expressions, statements, declarations, etc. These classes will be used to build the AST when parsing the DataLang code.

Modify the parser to build the AST: Update the parser methods in datalang_parser.py to build and return AST nodes using the classes you defined in datalang_ast.py.

Create an interpreter or code generator: Depending on your goal (interpreting the code or generating code for another language), create a Python file named datalang_interpreter.py or datalang_codegen.py. Implement a visitor pattern or another suitable pattern to walk the AST and either interpret the code or generate target code.

Create a driver program: Create a Python file named datalang_compiler.py that ties everything together. In this file, instantiate the lexer and parser classes and use them to lex and parse the DataLang source code. Then, use the interpreter or code generator to process the AST.

Run the compiler: Finally, run the datalang_compiler.py script with a DataLang source file as input, and observe the output or the generated code.

These are the high-level steps to build a compiler using the Python SLY library. Each step requires a deep understanding of DataLang's syntax and semantics, as well as how to leverage the SLY library effectively.

 

This is beyond the scope of what I intended when I started this endeavour a couple of hours ago (oh yeah, did I mention this has only taken a couple of hours so far? In grad school I did an entire 3 credit independent study solely dedicated to creating a programming language and implementing its compiler — it took me 2+ months to get this far). Who knows though? Maybe I revisit this at some time for a laugh.

I did, however, have GPT create the grammar in the form of a SLY lexer, which you can find here.

 

Final Thoughts

 
Does data science need another programming language, especially one that doesn't have an immediate selling point over the other commonplace languages at the moment? No. Is DataLang that language, even if another was needed? Also, no.

But this was an interesting way to test out ChatGPT-4's capabilities and have a little bit of fun. I have a viable pipeline for ChatGPT to be able to generate some technical code and accompanying artefacts, so if I come up with a better (read: useful) idea, maybe I'll give it another shot.

I hope you found something useful here.

 
 
Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.