Generating Text with RNNs in 4 Lines of Code
Want to generate text with little trouble, and without building and tuning a neural network yourself? Let's check out a project which allows you to "easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code."
Generating text is one of those projects that seems like a lot of fun to machine learning and NLP beginners, but one which is also pretty daunting. Or, at least it was for me.
Thankfully, there are all sorts of great materials online for learning how RNNs can be used for generating text, ranging from the theoretical to the technically in-depth to those decidedly focused on the practical. There are also some very good posts which cover it all and are now considered canon in this space. All of these materials share one thing in particular: at some point along the way, you have to build and tune an RNN to do the work.
While this is a obviously a worthwhile undertaking, especially for the sake of learning, what if you are OK with a much higher level of abstraction, whatever your reason may be? What if you are a data scientist that requires a building block in the form of an RNN text generator to plug into your project? Or, what if, as a newcomer, you simply want to get your hands a bit -- but not too -- dirty, as a means of testing the water or as motivation to dig down further?
In that vein, let's take a look at textgenrnn, a project which allows you to "easily train your own text-generating neural network of any size and complexity on any text dataset with a few lines of code." textgenrnn is authored by Max Woolf, an Associate Data Scientist at BuzzFeed, and former Apple Software QA Engineer.
textgenrnn is a built on top of Keras and TensorFlow, and can be used to generate both character and word level text (character level is the default). The network architecture uses attention-weighting and skip-embedding for accelerated training and improved quality, and allows for the tuning of a number of hyperparameters, such as RNN size, RNN layers, and the inclusion of bidirectional RNNs. You can read more about textgenrnn and its features and architecture at its Github repo or in this introductory blog post.
Since the "Hello, World!" for text generation (at least, in my mind) seems to be generating Trump tweets, let's go with that. textgenrnn's default pretrained model can be trained on new texts easily -- though you can also use textgenrnn to train a new model (just add new_model=True to any of its train functions) -- and since we want to see how quickly we can get generating tweets, let's go that route.
Acquiring the Data
I grabbed a selection of Donald Trump's tweets -- Jan 1, 2014 - Jun 11, 2018 (yesterday, at time of writing), which clearly includes tweets from both before and after his inauguration as President of the United States -- from Trump Twitter Archive, a site which makes querying and downloading tweets from the President painless. I chose only to grab the text from the tweets in that date range, since I don't care about any of the metadata, and saved it to a text file I appropriately called
Training the Model
Let's see how uncomplicated it is to generate text with textgenrnn. The following 4 lines are all we need to import the library, create a text generation object, train the model on the
trump-tweets.txt file for 10 epochs, and then generate some sample tweets.
from textgenrnn import textgenrnn textgen = textgenrnn() textgen.train_from_file('trump-tweets.txt', num_epochs=10) textgen.generate(5)
After about 30 minutes, here's what's generated (on the 10th epoch):
My @FoxNews will be self finally complaining about me that so he is a great day and companies and is starting to report the president in safety and more than any mention of the bail of the underaches to the construction and freedom and efforts the politicians and expensive meetings should have bee The world will be interviewed on @foxandfriends at 7:30pm. Enjoy! .@JebBush and Fake News Media is a major place in the White House in the service and sense where the people of the debate and his show of many people who is a great press considering the GREAT job on the way to the U.S. A the best and people in the biggest! Thank you! New Hampshire Trump Int'l Hotel Leadership Barrier Lou Clinton is a forever person politically record supporters have really beginning in the media on the heart of the bad and women who have been succeeded and before you can also work the people are there a time strong and send out the world with Join me in Maryland at 7:00 A.M. and happened to the WALL and be true the longer of the same sign into the Fake News Media will be a great honor to serve that the Republican Party will be a great legal rate the media with the Best Republican Party and the American people that will be the bill by a
Leaving politics aside, and given that we are only using ~12K tweets for training in a mere 10 epochs, these generated tweets are not... terrible. Want to play with temperature (the textgenrnn default is 0.5) to get some more creative tweets? Let's try it out:
“Via-can see this Democrats were the opening at GREAT ENSUS CALL! .@GovSeptorald Taster is got to that the subcent Vote waiting them. @Calkers Major President Obama will listen for the disaster! Grateful and South Carolina so his real ability and much better-- or big crisis on many signing! It is absolutely dumbers for well tonight. Love us in the great inherition of fast. With bill of badly to forget the greatest puppet at my wedds. No Turnberry is "bigger.” - Al
Well, that's less convincing. How about something more conservative, which the model is more confident of:
The Fake News Media is a great people of the president was a great people of the many people who would be a great people of the president was a big crowd of the statement of the media is a great people of the people of the statement of the people of the people of the world with the statement of th Thank you @TrumpTowerNY #Trump2016 https://t.co/25551R58350 Thank you for your support! #Trump2016 https://t.co/7eN53P55c The people of the U.S. has been a great people of the presidential country is a great time and the best thing that the people of the statement of the media is the people of the state of the best thing that the people of the statement of the statement of the problem in the problem and success and t Thank you @TheBrodyFile tonight at 8:00 A.M. Enjoy!
Well now, some of these are seemingly more legible.
Of course, this isn't perfect. There are all sorts of other things we could have tried, and the good news is that, if you don't want to implement your own solution, textgenrnn can be used to perform many of these things (again, see the Github repo):
- Train our own model from scratch
- Train with more sample data for a greater number of iterations
- Tune other hyperparameters
- Preprocess the data a bit (at the very least to eliminate the fake URLs)
Kind of fun. I'm interested in seeing how a default textgenrnn model performs out-of-the-box against a custom, well-tuned model. Maybe something for next time.
- 5 Machine Learning Projects You Should Not Overlook, June 2018
- Getting Started with spaCy for Natural Language Processing
- Find Out What Celebrities Tweet About the Most