Generating English Pronoun Questions Using Neural Coreference Resolution
This post will introduce a practical method for generating English pronoun questions from any story or article. Learn how to take an additional step toward computationally understanding language.
By Ramsri Goutham, NLP, AI Freelancer
In this post we will see how to generate English pronoun questions from any story or article. This is one step towards automatically generating English language learning worksheet.
The input to our program will be a small story like the following -
Scientists know many things about the Sun. They know how old it is. The Sun is more than 4½ billion years old. It is also a star that is the centre of our solar system. They also know the Sun’s size.
The output from the program will be a set of pronoun questions -
- What does “They” refer to in the sentence — “They know how old it is.”? Ans : Scientists
- What does “They” refer to in the sentence — “They also know the Sun’s size.”? Ans : Scientists
- What does “It” refer to in the sentence — “It is also a star that is the centre of our solar system.”? Ans : The sun
Let’s get started to see how we can achieve this using the neural coreference library from hugging face.
All the code and the jupyter notebook is available at -
ramsrigouthamg/Neuralcoref_generate_english_pronoun_questions
Use huggingface neuralcoref library to automatically generate english pronoun questions from any story or article.
First install the necessary libraries in the jupyter notebook :
!pip install neuralcoref !pip install spacy==2.1.0 !python3 -m spacy download en
Initialize neural coreference library:
# Load your usual SpaCy model (one of SpaCy English models) import spacy nlp = spacy.load('en') import re from nltk.tokenize import sent_tokenize# Add neural coref to SpaCy's pipe import neuralcoref neuralcoref.add_to_pipe(nlp,greedyness=0.5,max_dist=50,blacklist=False)
Initialize with some sample text and resolve coreferences with the library:
text = "Scientists know many things about the Sun. They know how old it is. The Sun is more than 4½ billion years old. It is also a star that is the centre of our solar system. They also know the Sun’s size."text = str(text)doc = nlp(text)clusters = doc._.coref_clustersprint("clusters ",clusters) print ("\n\n") resolved_coref = doc._.coref_resolved print ("Resolved by NeuralCoref: \n" ) print(resolved_coref)
The print output from clusters will be:
[Scientists: [Scientists, They, They], the Sun: [the Sun, The Sun, Sun, It]]
For each entity you will see the coreferences of that entity and also any pronouns associated with it.
When you let the library resolve coreferences (replace pronouns with their nouns) the coreference resolved output is :
Scientists know many things about the Sun. Scientists know how old it is. the Sun the Sun is more than 4½ billion years old. the Sun is also a star that is the centre of our solar system. Scientists also know the Sun’s size.
Because of resolving errors as well as perhaps some indexing errors in the library logic the ouput is some times odd. Eg: “the sun the sun” in second line.
We will attempt to write our custom resolving function as well as generate english pronoun grammar questions in the process.
First we choose only a subset of pronouns to be replaced. And write an auxiliary function that is helpful for us to get a sentence given an index of a word.
Then we write our custom function to resolve only those coreferences from the cluster list that are pronouns and are not the same as the original entities.
With our custom coreference function above, the output for the initial text with coreference resolved is —
Scientists know many things about the Sun. Scientists know how old it is. The Sun is more than 4½ billion years old. The sun is also a star that is the centre of our solar system. Scientists also know the Sun’s size.
And finally printing out the questions -
print ("\nQuestions generated :") print ("[Note: There might be a few answer errors because of the errors in the coreference algorithm itself] \n")for index,question in enumerate(questions): print ('%d) What does \"%s\" refer to in the sentence - \"%s\"?'%(index+1,question[1],question[0].strip())) print ("Ans : %s\n"%(question[2]))
The output is :
Questions generated : [Note: There might be a few answer errors because of the errors in the coreference algorithm itself] 1) What does “They” refer to in the sentence — “They know how old it is.”? Ans : Scientists 2) What does “They” refer to in the sentence — “They also know the Sun’s size.”? Ans : Scientists 3) What does “It” refer to in the sentence — “It is also a star that is the centre of our solar system.”? Ans : The sun
Happy coding ! For any questions or just to say hi reach out to me at ramsrigouthamg@gmail.com
Bio: Ramsri Goutham is an NLP and AI Freelancer. He is currently building an AI-Assisted tool for educators to make the process of creating assessments faster and better.
Original. Reposted with permission.
Related:
- Getting Started with Automated Text Summarization
- NLP Year in Review — 2019
- An Introductory Guide to NLP for Data Scientists with 7 Common Techniques