Gold BlogExclusive: OpenAI summarizes KDnuggets

OpenAI has recently done amazing work summarizing full-length books. We have asked OpenAI to summarize two recent KDnuggets posts, and the results have a very human-like quality. Only the last line betrays the inhuman intelligence at work.



OpenAI has recently published an important work, focused on the alignment problem, the problem of ensuring that general-purpose AI and machine learning systems align with human intentions. The "Paperclip Maximizer" is a famous example of alignment gone wrong.

To test scalable alignment methods, OpenAI trained a model to summarize entire books, as described in their blog on KDnuggets: Scaling human oversight of AI systems for difficult tasks – OpenAI approach.

OpenAI Summarizes KDnuggets

OpenAI model works by first summarizing small sections of a book, then summarizing those summaries into a higher-level summary, and so on. The results were pretty amazing, so we have asked OpenAI to summarize two top KDnuggets blogs from last year, and here are the summaries. Judge for yourself how good are the summaries and see if you can spot a big mistake in summary of blog 2 (but don't stop reading KDnuggets yet until OpenAI achieves full AGI capabilities :).

How it works
The blog post gets split up into sections (2 sections for the first blog post, 5 for the second), and each section gets summarized.  Then, the section summaries are concatenated and summarized into an overall summary.  The model sees text only, so it doesn’t see the images/charts in the posts.

Note that the OpenAI model is used to summarize books, so it thinks the blog author is a “narrator”. For longer texts like books, there would be more than one round of summarizing summaries.

KDnuggets Blog 1: If I had to start learning Data Science again, how would I do it?

Below is the text generated by OpenAI, slightly reformatted for readability


Overall summary

The narrator suggests 5 Kaggle micro-courses to start with, starting with Python, Pandas, Data Visualization, Intro to Machine Learning, and Intermediate Machine Learning. Once you've completed these courses, it's time to start practicing what you've learned by participating in Kaggle competitions.

The narrator suggests reading Data Science from Scratch to understand the foundations of the algorithms you're using, and watching Machine Learning by Andrew Ng to learn more about the mathematical approach to machine learning. Finally, watch Deep Learning by Andrew Ng to learn about deep learning.

Section 1

The narrator describes how they would start learning data science all over again. They explain that they prefer to start with practical examples and move to more abstract concepts.

The narrator suggests 5 Kaggle micro-courses to start with. They suggest starting with Python, Pandas, Data Visualization, Intro to Machine Learning, and Intermediate Machine Learning. They explain that these courses are not linear and that it's ok to go back and forth between them to refresh concepts.

Section 2

The narrator suggests that once you've completed the first 5 courses, it's time to start practicing what you've learned by participating in Kaggle competitions. They suggest starting with the Titanic competition to practice classification and relevant metrics, and the Housing Prices competition to practice regression models.

The narrator suggests reading Data Science from Scratch to understand the foundations of the algorithms you're using.

Next, they suggest watching Machine Learning by Andrew Ng to learn more about the mathematical approach to machine learning.

Finally, they suggest watching Deep Learning by Andrew Ng to learn about deep learning. At this point, it's up to the individual


KDnuggets Blog 2: Know What Employers are Expecting for a Data Scientist Role in 2020.

Below is the text generated by OpenAI, slightly reformatted for readability


Overall summary

The narrator actively looks for a job change to data science, without any formal education in AI/Machine Learning.

He scrapes data from Naukri.com, a job portal, to find out what skills companies are looking for, what experience level they desire, and what companies are actively hiring.

He finds that companies are looking for experienced candidates with 5-10 years of experience. Most of the vacancies are still termed as Data scientists, followed by Senior Data scientist and Lead Data scientist. Employers are expecting candidates to have knowledge of big data technologies due to the huge rise in the amount of data recorded every day.

The narrator concludes that you don't have to have a formal education in AI/Machine Learning to be a data scientist.

Section 1

The narrator actively started looking for a job change to data science, without any formal education in AI/Machine Learning. He started learning it out of interest and enrolled in many MOOCs.

When searching for jobs, he found that companies were looking for candidates with a wide range of skills, including multiple Big Data technologies, cloud computing, and data visualization tools.

He decides to analyze the skills and trends most sought in the industry right now. He scrapes data from the job portal to find out what skills companies are looking for, what experience level they desire, and what companies are actively hiring.

Section 2

The narrator scrapes data from Naukri.com, a job portal, to find out what skills companies are looking for, what experience level they desire, and what companies are actively hiring.

He will scrape the following information for each job: Role, Company name, Experience, Location, and Key Skills.

He iterates over the first 50 pages of the website, each containing 20 results. For each job, he scrapes the role, company, location, experience, and key skills.

Section 3

The narrator cleans the data by handling missing values, duplicate data, and tokenizing the locations and skills columns.

He analyzes the data to find out which location offers more openings, what companies are actively recruiting, and what is the most desired experience level.

He finds that companies are looking for experienced candidates, with 5-10 years of experience. There are more openings for experienced candidates than freshers. Companies usually don't recruit freshers from job portals, they will directly recruit them from campus recruitment. Freshers can always opt to work for startups to gain the necessary experience.

Section 4

The narrator looks into the most frequently mentioned roles in the data. Most of the vacancies are still termed as Data scientists, followed by Senior Data scientist and Lead Data scientist.

The narrator then looks into the skills that companies are looking for. Machine learning is the most important skill to have for a data scientist. Data mining and Data analysis are the key activities that every data scientist has to go through. Employers are expecting candidates to have knowledge of big data technologies due to the huge rise in the amount of data recorded every day.

Section 5

The narrator looks into the programming languages that are in demand in the industry. Python and R are the most popular languages, followed by SAS and C++.

He then looks into the deep learning frameworks that are in demand. TensorFlow is the most popular framework, followed by Keras.

He then looks into the big data technologies that are in demand. Spark and Hadoop are the most popular ones.

Finally, he looks into the data visualization tools that are in demand. Tableau is the most popular one, followed by Microsoft's Power BI.

The narrator concludes that you don't have to.


GP: The summary is amazingly human-like, with only the last sentence showing the lack of understanding by not specifying what "you don't have to".
See below the original last paragraph from the blog Know What Employers are Expecting for a Data Scientist Role in 2020 to understand what it is that "you don't have to".

Do you really have to match all the skills mentioned in this post to get a job?

Well not really, there are few tools in the list which are easy to pick up on job if you are strong with your fundamentals. Having said that if you are just looking for a job having these skills on your resume will probably help you land an interview.

If you are good with all the mentioned must-have skills for a data scientist then the best approach should be to start attending the interviews and meanwhile try to fill the gaps in your understanding and learn the tools/technologies you feel will give you an edge over other candidates.

To be fair to OpenAI, this mistake is only in the intermediate summary. The overall summary of Blog 2 above correctly closes with "you don't have to have a formal education in AI/Machine Learning to be a data scientist."

Related: