Innovating versus Doing: NLP and CORD19
How I learned to trust the process and find value in the road most traveled.
By Austin Eovito, Data Scientist at IBM
As data scientists, we are often forced to invest a significant portion of time cleaning data, pre-processing this data, doing feature discovery, etc. This checklist is subject to differences based on industry, scope of the role, and the goal of the technical endeavor. Post-data curation and maintenance, data scientists often find themselves at their next hurdle: model selection, in this case, which one aligns most directly with our goal (or key performance indicators, KPI), and what is most feasible, amongst a suite of other questions. We are skipping over some portions of the data scientist development cycle, but for this blog these two portions will suffice to make our main point: how much time should data scientists spend innovating, versus doing? This blog focuses on doing rather than purely innovating, with the idea being that DOING lends itself to innovation more-so than innovation lends itself to doing. I will approach this by utilizing a data-scientists best friend, Google, to apply topic modeling to the CORD19 5-20-2020 dataset from the Allen Institute. The data can be found here.
Every week it feels there is a new State-of-the-Art algorithm, with an accompanying dataset, academic paper, use-case, and preliminary results. Technical folks often find themselves consuming these reports, diagnosing what is good, what can be done better, what direction this work leads to, and how it lends itself to current work. This is one part of the innovation cycle.
The next is to stress-test the new algorithm or what have you, apply it, and see if it is adopted in industry. I am sure you have seen things similar to this discussed via hype cycle graphics, data-science Dunning-Kruger charts, and technical coverage from media outlets. On the flip-side, if one is to look at Kaggle, specifically submissions for the CORD19 dataset, you can see that the algorithms and techniques applied across a large portion of the code-bases have strong overlap. The question then resides: why replicate work? To demonstrate ability; to move the progress bar forward; or to understand the technical underpinnings yourself? Each data scientist must ask themselves these questions as they undertake their work. Why am I doing this, towards what goal, and what does this goal serve in the larger picture?
With this backdrop, I will explain further this idea of innovating versus doing based on a Topic Modeling notebook you can find here. I began to put this idea together with the initial release of the CORD19 dataset on 03-13-20. At that point, not much progress had been made towards understanding COVID19 publicly, and the world was beginning to change. I thought that the best use of my time would be to do something unique, perhaps even original. So, I went and read a few academic papers, namely on Language Models like BERT and OpenAI’s GPT2, and decided that would be a great starting point. This was my first error. Rather than spending anytime data-munging, I went straight into modeling. And guess what, I got results! Whether they were good or not is not the point of the blog; instead, focus on the fact I chose to immediately apply models to data, without understanding my own goal, the data, or most importantly the fact this is probably an ineffective way of going about this task. But behold, I achieved the output below from GPT2, and immediately thought to myself, ‘WOW, I just need to fine-tune some more, and I will get some pretty cool things to show people!’ Rather than facilitate understanding for myself, or for others, I found myself focused on the clout I would receive for utilizing such a famous algorithm. See below for the GPT2 output [trained on approximately 20,000 abstracts] from April 13, 2020:
[‘services, the government is a major provider of medicines for all citizens; this is an area that is highly sensitive to globalisation and to a growing number of high-speed technologies and applications. To be an international trade or development hub, the WHO should take seriously this opportunity. We should not treat other countries as an obstacle to globalisation. A globalisation that requires the development of high-quality, effective and sustainable healthcare would also need to address the real needs of the non-human animals. It must be seen in that the current outbreak in Wuhan's Zhuhai Province in China is one of the most complex challenges to social control of the animal population in Wuhan, the number of infected animals at the time are estimated to be about 200,000, and the number of non-human animals is estimated to be about 400,000. All these numbers are based on the assumption that the animal rights of non-human animals are not strictly based on the rights and behaviour of non-human animals but only on the rights and behaviour of them. To that, the non-human animals should be treated with respect and must be taken as one of the most important non-human animals in the system.Human resources, especially in Africa and particularly in the Middle East, should also be given priority. We cannot be satisfied over the lack of resources for health care in the developing countries with the high number of infections in all Africa, where many diseases exist. However, it should be noted that the use of health care in the Middle East is currently almost non-existent. In that regard, the situation on the ground in all the Middle East, especially in the Arabian peninsula, is very severe and needs to be dealt with with the utmost care. The problem here is in particular the treatment of the suffering animals in the Middle East. These are not the kinds of situations that are frequently mentioned in the news about the Middle East. While it has emerged that the most severe cases in the Middle East are those in the Middle East, the issue in the Middle East cannot be under-defined as a whole because in the world around the world, the Middle East can be a problem. According to the WHO estimate of around 16% of human cases are considered non-human, and the most severe cases (at least for most of the Middle East) are those from China  (Figure ). That is, about 5-10% of human cases may occur from the Middle East, and of these, only around 5% are non-human, and only around 9 to 10% are humans (Figure ). In such cases, non-human animals may occur.There are more than 6 million non-human animals in the Middle East. The majority of non-human animals are domesticated animals. According to the WHO estimate, as the proportion of non-human animals from the Middle East is around 6.5% of human ones and around 3% are non-human animals, it is expected that only 3.3% of human animals are domesticated animals, and the majority are considered non-human ( Figure ). In some locations where the epidemic is still going on and there are cases of infection in the Middle East, but in other places where the epidemic is still going on, the overwhelming proportion of non-human animals is non-human animals. In some places, in order to avoid the outbreak, it is necessary to provide some facilities for animals during the outbreak.In our report, the study is based on the study in France that took data from all of China's population in Wuhan with a population of 3.7 million. It was reported that in total, about 12.09% of Wuhan's population were non-human animals  . Therefore, the amount of non-human animals that are in the Middle East should be considered as a high priority for conservation and healthcare management.Humanities such as environmental health sciences (EHS), medicine and education are not considered an essential piece of medicine, as they are not covered in our report so much as in the study in Wuhan. However, they can provide useful services in a very limited way. Besides, they provide a great deal of information. Moreover, they can be utilized for both human and non-human needs. We mentioned in the report that in particular, when non-human animals are in care and are being cared for individually or on individual basis, they are more reliable and can reduce the spread and suffering of some diseases like Ebola. In this regard, the fact that the amount of animals in the Middle East and South America is still small compared to the WHO estimate of around 10% and is the result of the fact that the non-human animals may be one of the most common diseases in these regions.The situation in Nigeria is the most serious. The population of Nigeria is 1.8 million, and the population has a high health status, which makes it more than a safe place to stay. In our report, the population of Nigeria is 7.9 million .’]
As you can see, the results are not great. I ignored many of the aspects that make this model not only produce great results but work well in general! My training files did not respect the format of the documents I was trying to generate, I took short-cuts in pre-processing, and ultimately, I did not question my intentions or goals throughout (I just wanted to generate cool abstracts). I ended up iterating over this model quite a few times before I realized that I was tackling the issue from the wrong angle. So, off to Google I went to investigate how others got GPT2 to work for them, and I ended up at this great resource that used GPT2 to create poetry.
This blog stood out to me; not only did the author accomplish their goal, but it was a solid goal that left people who read it better off on their own endeavors. At this point, I revisited the CORD19 challenge, and looked at the tasks associated with it, specifically, what objectives were associated with the dataset. Of the initial 9 tasks, all of them were in some way associated with Topic Modeling. Interesting I thought, Topic Modeling is a well-researched area (LDA was created based on pLSI, and the original LDA paper was published in 2003), why would they need public submissions? And then it hit me; Google ‘LDA with Python’. There are hundreds, if not thousands of tutorials, examples, and blogs that cover this topic. They differed at the level they were written, their scope, and their relevance to the task at hand. I ended up visiting around 50 different blogs, looked at Kaggle submissions, read papers, and Googled more than I thought I could. At the end of this process, I was unenthusiastic to continue at best. What was the point of continuing when all of my goals had not only been accomplished, but accomplished well?
The goal of a well-written technical blog is two-fold; to educate the reader to a new technology/opinion/point of view, and to empower them to either learn-more/replicate/or extend the subject-matter of the blog. So, how INNOVATIVE would it be if I re-used/augmented these codebases with minute changes? Not very. However, what if I collected these resources, created a single document reflecting these sources, linked to them, and utilized it for my original goal? More succinctly, what if I took advantage of what had already been done, and instead of doubting my originality, lean into it and extend these blogs I was using. What if I did basic topic-modeling to facilitate understanding the data, and then used GPT2 to generate new papers based on the TOPIC, instead of the entire corpus [our corpus at time of writing is ~1.8GB]? And that is indeed what I did. The topics below are from an LDA run with 10 topics [the code includes a hyperparameter tuning section at the end, which I omitted due to compute time]. Even with an elementary run, we procured interesting (not SOTA) results. Here are the 10 topics (Gensim does not produce topic names, I named each topic based on the words included in the topic, which you can see in the linked git):
Topic 0: Public Health/Systems and Research
Topic 1: Biological Papers
Topic 2: Genomic Research/Analysis and Detection
Topic 3: Epidemiology
Topic 4: Patient Treatment [we have german words here, indicating further cleaning is needed as previously addressed]
Topic 5: Patient Treatment [Interesting that we have overlap based on language]
Topic 6: Disease Expression and Effects
Topic 7: Vaccines and Response
Topic 8: Infection/Symptoms
Topic 9: Non-human Transmission/Studies
We can alter our output topics by tuning hyperparameters, altering the preprocessing techniques, or simply choosing a different number of topics. However, this example directly informs us of the pain points I addressed earlier when I trained GPT2 on the entire corpus. To procure more accurate and relevant results, it would be best to create separate training files based on different topics (this will be addressed in the next blog). So, with this new information, my priorities shifted.
I decided it would be in my best interest to synthesize what I was doing, extend it, and then utilize it to facilitate my original goal of creating new, factually sound papers based on our corpus. This is of course glossing over ideas of limited data, overlap in topics, etc. Instead, I decided to just push through. Doing so, if you look through the accompanying code with this blog, you can see where I documented what resources and tools I was using, and how it fits into the larger picture. The idea behind this being that someone could utilize my code themselves, and fine-tune it based on their needs and their data [like a language model]. So that is what I did.
This blog serves as a recap of some of my struggles, some small wins I had, and how you, the reader, can punish yourself less for not creating GPT36, and instead focus on what is relevant to you, on the topics you care about, for the people and things you care about. Because at the end of the day, why innovate if you don’t leave people with the ability to do more than they could before. So, with that, I hope this blog helps highlight some of the development roadblocks I encountered, facilitates better understanding of my goal, and how the codebase referenced in this blog will be extended.
The next blog in this series will focus on the GPT2 portion of the code, which will be in another notebook for viewers to use. Look for that here in July.