My AI Plays Piano for Me

Training an RNN with a Combined Loss Function.

By Kathrin Melcher, Data Scientist at KNIME

How about generating some Schuman-inspired music? Is music generation even something that a [deep] neural network can do? The goal of this project is to artificially generate piano music.

Often neural networks are used to solve either a classification or a regression problem. Music on the other hand is a sequence of notes, each note played for a given duration. The notes can be interpreted as classes and their duration are numerical values. Therefore generating music is equivalent to generating a sequence of both: a class and a numerical value. This means that the neural network must produce two outputs at each given time: one note, aka a classification problem, and its corresponding duration, aka a regression problem.

So let’s find out how we can define, train and apply a recurrent neural network (RNN) that produces multiple outputs by using a combined loss function to train the model.

A network with multiple outputs is for example also used for object detection in computer vision. Here the outputs are the class of the object as well as the position of the corners of the bounding box

Recurrent Neural Networks (RNNs) are a family of neural networks, which are especially powerful when working on sequential data. As music is a sequence of notes, it clearly makes sense to utilize an RNN for generating piano music. In case you are new to RNNs and LSTMs I recommend reading the following two blog posts “Once Upon A Time … “ by LSTM Network and Understanding LSTM Networks.

The first step is to access some music files and to find a reasonable numerical representation for music.


Convert Music into Numbers

Robert Schumann was one of the greatest pianists of the Romantic era. I’m going to use his music as the starting point for our artificial music generation. A set of his piano pieces is available on GitHub as midi files.

One of the biggest challenges for me was to find a way to convert the midi files into a numerical representation and back again. The numerical representation to train the network and the midi files to share the final music with you (keep reading to the end to listen to some “Schumann inspired AI music”).

Thinking of music, it is clearly a sequence of notes with different durations. But actually this is only half of the truth, as sometimes a note starts while the previous one is still being played. Therefore a third feature is needed to encode music, which is called offset. Offset measures the time between the beginning of the musical piece until the start of the associated note.

As the RNN is already handling the sequential structure of the input data, the offset doesn’t fit as an absolute measure. So instead of using the absolute offset, I decided to use the relative offset, which is the difference between the current and the previous note’s offset. This allows the network to start a note while another note is still playing and to encode chords.

So to summarize, the following three features can be used to represent a piano piece:

  • The notes
  • their durations and
  • their relative offset

To extract these three features from the midi files, I use the Python library, Music21, developed by MIT researchers and community. This step is performed with KNIME Analytics Platform using the Python integration.

The raw extracted data is then cleaned and preprocessed via a KNIME workflow. E.g., notes are encoded via an index based on a dictionary with values between 1 and 78. See the Appendix for more information about the data preparation workflow.

Note. During training the index encoding of the notes is converted into a one-hot encoding by the Keras Network Learner node.

Figure 1 shows you a subset of the numerical representation of a song after the preprocessing. The first row, for example, tells us that the note encoded with the integer value 22 is played for 0.25th of a quarter note and it starts at the same time as the previous note as the relative offset difference is 0.

Figure 1: Numerical representation of a music piece encoded as a sequence of notes with their durations and relative offset.


Next, we need to think about a suitable RNN architecture, as this impacts the further data preprocessing.


Find a Suitable RNN Architecture

I decided to go for a many-to-many RNN architecture. That is a network that during training is fed with both input and target sequences. During the deployment it is then sufficient to provide just the start token to generate new music.

Another option could be to train the network with a many-to-one approach. In this case during deployment a start sequence is necessary, e.g. the first 100 notes.

Let’s build the network structure step by step and start with predicting only the notes. Predicting the next note in a sequence is a multiclass classification problem. Therefore an RNN layer, e.g. LSTM layer, followed by a Softmax layer is an appropriate, simple and suitable network architecture.

Figure 2 shows this possible network structure in the rolled and unrolled representation, where st represents the start token, n₁ , n₂, … the sequence of original notes, and n₁ , n₂, … with the hat the sequence of generated probability distributions based on which the next note is picked.

Note. The start token st is only used for sequence snippets that include the beginning of a song.

Figure 2: Rolled and unrolled representation of an RNN that can learn to predict a sequence of notes.


At each step, the Softmax layer uses the current output from the LSTM layer to produce the probability distribution for the next note in the output sequence.

The network structure could be extended by adding more hidden layers between the LSTM Layer and the Softmax layer, e.g. a second LSTM Layer or some hidden dense layer. This could improve the performance of the network.

To train the network, an input sequence [stn₁ , n₂, … ,…], and a target sequence [n₁ , n₂, n₃,…] is necessary. The target sequence is just the input sequence shifted by one note. This means that during training the next true note is used as input for the next step rather than a note picked based on the probability distribution from the previous step. This training approach is called teacher forcing.

To train the network the loss function categorical cross entropy can be used. For an RNN it is the sum over the categorical cross entropy losses at each step.

Let’s now add the duration and the relative offset as additional inputs and outputs to the network.

Both, the duration and the relative offset, are positive numerical values. Therefore their outputs can be created by a dense layer with activation function ReLU. Figure 3 shows you the updated network structure in the unrolled representation. Here a three dimensional sequence [(st_d, st_n, st_o), (d₁, n₁, o₁), (d₂, n₂, o₂), …] feeds the network and a three dimensional sequence is generated.

Figure 3: Unrolled representation of an RNN that learns to predict a sequence of notes with their durations and relative offset.


At each step the LSTM unit now has three inputs: a note, its duration, and its offset difference. Then the output of the LSTM layer at each time step is processed by three additional layers: a Softmax layer to generate the probability distribution based on which the next note is picked, a ReLU layer to predict its duration, and another ReLU layer to predict its relative offset.

To train the network to predict all three features, the loss function needs to take into account the categorical cross entropy loss for the notes as well as the mean squared error between the predicted and true values for duration and offset difference. Therefore the loss function becomes the sum of the three losses at each step.

Finally to train the network, three input and three target sequences are needed. Again the target sequences are the input sequences shifted by one step.

The lower part in the preparation workflow described in the Appendix creates the input and target sequences, using a sequence length of 100 steps.

Let’s now see how this network structure can be built in KNIME Analytics Platform.


Define the RNN Structure

Figure 4 shows the training workflow. The brown Keras nodes define the network architecture.

Figure 4: This workflow defines the recurrent neural network architecture and trains it using the prepared input and output sequences.


The workflow has Five Keras Input Layers

  • One for the input sequence of notes.
    (Input shape = “?, <number of notes>”)

  • One for the input sequence of durations.
    (Input shape = “?, 1”)

  • One for the input sequence of offset differences.
    (Input shape = “?, 1”)

  • And two inputs for the initial hidden states of the LSTM units. Defining the initial states as inputs is optional, but makes it easier to use the model during deployment.
    (Input shape = “<Number of units used in the LSTM layer>”)

Note 1. The “?” is used to allow different sequence lengths as input. During training a sequence length of 100 has been used.
Note 2. Notice the size of the input for the note sequence. Indeed, while just one number defines duration and offset difference, the index based encoded note is converted to an one-hot encoded by the Keras Network Learner node.

The first three inputs are combined using the two Keras Concatenate Layer nodes.

The Keras Concatenate Layer node concatenates two inputs along a specified axis. The inputs must be of the same shape except for the concatenation axis.

The concatenated tensor is then fed into the Keras LSTM Layer node. It is used with the following setting options:

  • Units: 256
  • Return sequences: Activated
  • Return states: Activated

It is important to activate the Return sequences and Return state checkboxes, as we need the intermediate output states during the training process and the cell state in the deployment.

After the LSTM layer, the network splits into three sub networks, one for each feature. Each subnetwork uses the output of the LSTM node as input to a hidden layer with activation ReLU, implemented via a Keras Dense Layer node.

Finally each subnetwork gets its output layer with its own activation function.

  • Softmax with 79 units for the notes, as we 79 different notes
  • ReLU with one unit for the duration
  • ReLU with one unit for the offset difference

Tip. Define a “Name Prefix” in the output layers to make it easier to recognize the layers when referring to them within learner and executor nodes.

Finally the three subnetworks are combined using two Keras Collect Layer nodes. The node just collects the different outputs, so all three outputs can be used by the Keras Network Learner node.


Train the Network

The next step is the definition of the combined loss function and the training parameters. If a network has multiple output layers, the loss function for each output layer can be defined in the second tab, the one named Target Data, of the configuration window of the Keras Network Learner node (Figure 5).

Figure 5: Target Data Tab of the Keras Network Learner node.


The third tab, named Options, of the Keras Network Learner nodes allows you to define the training settings. We used:

  • Epochs: 100
  • Batch Size: 100
  • Shuffle training data before each epoch: activated (this is to avoid overfitting)
  • Optimizer: Adam with default settings

As a last step, the trained network is saved in .h5 format using the Keras Network Writer node.


Let the Network Play Piano

The tension rises. We have trained a model. What will the generated music sound like? Let’s get the Keras model to play some music for us.

Figure 6 shows you the deployment workflow. The key idea is to use the trained network in a recursive loop, where in each iteration the last predicted note, the last predicted duration, the last predicted relative offset as well as the last output hidden states are used as input to make the next predictions.

Figure 6: This workflow uses the trained Keras model to generate piano music and converts it into a midi file.


The workflow starts with reading the Keras model. Next it uses the Keras Set Output Layers node to define the 5 outputs needed as input in each iteration.

Afterwards the network is converted into a TensorFlow model. This has the advantage that the model can be executed using Java instead of Python, which speeds up the execution.

Next the initial inputs are created. This means the start token for the notes, duration and offset difference as well as the initial state vectors.

Then the network is applied one first time using the TensorFlow Network Executor node. Based on the output probability distribution of the softmax layer the next note is picked. Then for each feature, aka the notes, duration and relative offset, a collection cell is created, including only the first predicted values. The collection cells are used to collect the results from the different iterations.

Then the recursive loop starts. In the first iteration the outputs from the first TensorFlow Network Executor node are used as input for the TensorFlow Network Executor node within the loop body. Then again a note is picked based on the softmax output and the collection cells are updated. Now in each iteration the network is applied based on the output of the previous iteration.

After the loop, the generated music is post-processed by the metanode “Post-processing”. This metanode converts the relative offset into the absolute offset using the Moving Aggregation node and converts the predicted indexes into notes by using the dictionary created during the preprocessing.

Next the three created sequences are used by the Python Snippet node to convert and save the generated music in a .midi file.

To listen to our generated music inside of KNIME Analytics Platform we can use the Audio Viewer node.

Figure 7: Workflow to read and play the midi files generated by the Keras model.


And here is also some sampled music for you!

Download this workflow group including the four introduced workflows, data, and a trained model to:

  • Just generate some more music using the trained model
  • Retrain the existing structure on some other data
  • Or use the data and try out another network architecture (e.g. make the architecture more complex by adding some hidden layers, change the number of hidden units, use stacked LSTMs, increase the number of epochs, etc.)

The workflow and the blog post were inspired by the following blog post “Music Generation with LSTM”.


Appendix: Data Preparation

The workflow in Figure 8 reads and prepares the data to train the model.

In Step 1 the workflow creates a table with the paths to all available midi files using the List Files/Folders node. Afterwards each midi file is processed by the Python Source node using the library music 21 to extract the sequence of notes, durations and the offset.

Figure 8: The preparation workflow performs the preprocessing of the midi files in 3 steps:
Read, clean, and create sequences.


Next in Step 2, the chords are converted to notes, creating one row for each note of a chord with the same offset and duration (Unpivoting node), and a dictionary for the index encoding of the notes is created.

Afterwards one song after the other is processed using a loop. For each song the data is first preprocessed by

  • converting the absolute offset into the relative offset
  • converting the string duration values into numbers.

Next, in Step 3, the sequences for each feature are created by using the Lag Column node. Then the columns are resorted and the input and target sequences are created.

Bio: Kathrin Melcher is a data scientist at KNIME.

As first published in Low Code for Advanced Data Science.

Original. Reposted with permission.