The ICLR Experiment: Deep Learning Pioneers Take on Scientific Publishing

Deep learning pioneers Yann LeCun and Yoshua Bengio have undertaken a grand experiment in academic publishing. Embracing a radical level of transparency and unprecedented public participation, they've created an opportunity not only to find and vet the best papers, but also to gather data about the publication process itself.



With ICLR, deep learning's premiere conference, neural network pioneers Yann LeCun and Yoshua Bengio have undertaken a grand experiment in academic publishing. Embracing a radical level of transparency and unprecedented public participation, they've created an opportunity not only to find and vet the best papers, but also to gather data about the publication process itself. Last week, conference organizers announced 2016's accepted papers. In ICLR's spirit of experimentation, I'll try to analyze the results, taking a look at what worked and also what didn't. Full disclosure: I am the author of both an accepted and a rejected paper, hopefully this leaves me with a sufficiently broad perspective and sufficiently muted bias.

First, some background. Scientific publication typically follows a familiar model. Even in the rapidly evolving field of computer science, the review process is profoundly homogenous. Authors distill the findings of the their research in a manuscript. They submit it to an appropriate journal. A area chair appoints anonymous reviewers, Sometimes the authors' identities are kept anonymous from the reviewers. In these cases, the reviews are said to be double-blind, when only the reviewers are anonymous, the reviews are single blind. After one or more rounds of review and rebuttal, the paper is accepted or rejected, and all parties move on.

bengio-lecun-2007-small

Even in computer science, where the pace of research has accelerated, we've clung tightly to this old model. In a small but significant change, the premier venues for computer science research are now often conferences, rather than journals. With their shorter publication cycles (typically one round of review, taking mere months), conferences are better suited to the faster pace of research. Increasingly, journals are venues for expanded versions of conference papers, with gentler introductions, clearer derivations, and more expansive experimental results.

Additionally, computer scientists have come to embrace arXiv.org, a public-access repository of papers. Many significant journals and conferences in computer science allow dual submission with arXiv.org, even though it explicitly undermines the anonymity of double-blind reviews. In a previous article, I explored the incredible benefits and also the dangers of arXiv.org and the 24-7 research cycle. The deep learning community has relied especially heavily on arXiv.org, as some papers have spurred entire subfields of research before even reaching publication.

ICLR's Experimental Publication Model

In 2012, with this historical context, Yann LeCun and Bengio created the International Conference on Learning Representations. Informally, it is often described as "the deep learning conference". The connection between "deep learning" and "representation learning" is that deep neural networks jointly learn to transform raw data into useful representations along with a classifier to separate examples into categories, while traditional machine learning methods focus only on the classification part of this pipeline. To my understanding, the objectives of the conference were two-fold. First, while deep learning had been gaining acceptance at established machine learning venues such as NIPS and ICML, there no dedicated deep learning conferences existed. Second, as stated clearly on Yann LeCun's website, the founders sought to experiment with the model of publication, proposing an alternative better suited to the times. Given the precarious status of anonymity in a generation raised on arXiv.org, LeCun and Bengio went all out, fully embracing transparency. LeCun had previously run the Snowbird Learning workshop annually between 1986 and 2012, and in addition to the conference track, ICLR continues to feature a workshop track for works in progress, meant to continue the tradition of the Snowbird workshop.

Led by Senior Program Chair Hugo LaRochelle, here's how the process worked this year:

  1. Authors post papers to arXiv.org the day they are submitted (November 12th). Submissions to arXiv through the conference web page include a link to the arXiv entry. These arXiv manuscripts must appear in the conference template. One consequence of this it is clear afterwards to the entire deep learning community which papers are accepted and which rejected.
  2. The submitted papers are all listed on CMT, Microsoft's online tool from managing conference submissions. All registered users can view all submitted papers.
  3. The general public are free to leave comments on papers (but not anonymously) . These comments include an optional star rating.
  4. Anonymous reviewers are assigned to each paper. These reviews, including both critical feedback and numerical scores (10 point scale) are posted on CMT.
  5. An unusually long period of rebuttal and response (by conference standards) ensues, leading ultimately to a decision (on February 4th). According to LeCun's one motivation for this lengthier process is to reject less papers for superficial reasons (like missing citations, problems with writing style or fixable bugs in notation), by giving authors a chance to improve the drafts throughout the process.

What Worked:

Going through the process twice (for the first time), I found it both fascinating and educational. In this section I'll present the positive side of the takeaways.

Amazing Data

Arguably the most amazing aspect of the ICLR process is the data it produces. The first peer reviews I ever read were the rejections that accompanied my very first paper submission. I learned from the feedback and was fortunate to improve the draft of the paper was accepted on resubmit. Then I set to work on my next paper, and ultimately submitted it a year later. At that point, I had read a grand total of six peer reviews.

I imagine this is typical. Reviews are sensitive. Even when a conference accepts your paper, your work is torn apart and its weaknesses exposed. Most people aren't eager to share these reviews with new acquaintances, and it's awkward for a junior reviewer to ask senior researcher to read their rejection reviews. A notable exception, UCSD Professor Julian McAuley (http://cseweb.ucsd.edu/~jmcauley/) posts all of his reviews online. These contain both positive and negative reviews, but only for published papers.

So most researchers are forced to be reinforcement learners. We learn by interaction. Each data point concerning the publication process comes at the cost of 3-6 months of toil. Even as a third year PhD student, I had read at most 10-20 reviews prior to submitting to ICLR. ICLR has given the community the incredible gift, via the richest dataset of reviews (to my knowledge) in the history of academic publication. Suddenly, young researchers can read through nearly 900 reviews. What is too incremental? What makes an experiment unconvincing? These questions previously required years of experience to answer, now they require hours of careful reading.

High Quality Reviews

Of course, high quantities of low quality data wouldn't justify excitement. Fortunately, the reviews were generally of very high quality. The tight focus of the conference on deep learning resulted in highly knowledgable reviewers who generally knew the relevant literature and provided incisive feedback. I submitted two papers to the conference. The first explored classifying medical diagnoses from time series of sensor readings and clinical observations in the pediatric ICU, using LSTM recurrent neural networks (http://arxiv.org/abs/1511.03677). The paper gained acceptance to the conference, but only after the draft was improved thanks in large part to the critical feedback from the anonymous reviewers. They pointed us to relevant prior work and suggested interesting follow-up experiments.

The second paper investigated recurrent neural networks for generating text conditionally at the character level (http://arxiv.org/abs/1511.03683). While the paper was not accepted, the criticism was of high quality. Prudent quantitative evaluations and relevant prior work were suggested by our reviewers.

Opportunity for Rejected Papers to Have Influence

Another strength of the ICLR, rejected papers can nonetheless influence the field. In this year's conference, several papers have proved influential despite missing the cut for inclusion in the conference. Absent arXiv, the author's might have been denied credit for their contributions, and the community might have been denied the contributions themselves. ICLR's model gives an opportunity for history to recognize work, even when reviewers reject it. Similarly, many papers in past years have proved valuable assets to the deep learning community despite missing out on selection.

Areas for Improvement:

For the above reasons, I'm grateful for the audacity of Drs. LeCun and Bengio to depart from the status quo in publication. TO be complete, it's worth also examining the parts of the process which didn't work as well.

Star Ratings are Tacky

Probably the most questionable aspect of this year's conference were the star ratings accompanying public reviews. Star ratings are used to rank. We are conditioned every day to ignore items with low star ratings. We shun Netflix movies with low star-rating (except for kung fu flicks). I can't remember purchasing an Amazon product with a rating below 4 stars. It's nice to get feedback from the public but public commenters (potentially with conflicts of interest) shouldn't have the right to bias the official reviewers' feelings towards a paper by saddling them with a star rating. They should at least have to win over the reviewers with their prose.

More to the point star ratings feel both unscientific and belittling. Even the numerical scores that normally accompany official reviews make me uncomfortable, but at least those scores have clear meanings (clear reject, marginally below acceptance threshold, marginally above, solid paper, accept, etc. In contrast, star ratings offered to commenters without guidance.

Awkwardness on the Comment Boards

On multiple occasions I was asked to comment on other people's papers. Obviously I did not (I didn't comment on any papers). I am sure I'm not the only one who encountered this. Academic publication shouldn't be a popularity contest. Presumably solicited comments would be strictly positive comments. If such comments have influence, it provides an incentive to campaign for a paper acceptance. More generally, the public commenting process felt awkward. Comments were infrequent, and the vast majority of papers received no comments.

Unfortunately, I'm not sure know how to avoid these problems while keeping public comments. Anonymous public comments present obvious problems. They would likely to devolve into Reddit-style trolling. On the other hand, with non-anonymous comments, many researchers are reluctant to provide candid feedback. This results in the present situation where only a small number of comments go up, written by an even smaller number of commenters. Non-anonymous commenting also poses a problem regarding prominent commenters. Do famous commenters carry more clout with reviewers? Should they?

Of course, public commenting also has charms. Many public commenters helped fill in missing references. Some pointed out parts of papers where explanations might be unclear or additional experiments might be helpful, leading to improved drafts. I don't know how to reconcile all of these issues. Public comments are a fascinating part of the ICLR process, but one still rough around the edges, warranting further experimentation.

A Case for Blind Submissions (Sometimes)

While ICLR's model is fascinating, double-blind reviews still have a place. This year's ICLR conference appears to be run with integrity, with papers receiving high-quality critical reviews regardless of the celebrity of their authors. However, absent double-blind conferences to compare against, it might eventually grow difficult to gauge the objectivity of single-blind conferences.

Conclusions

In all, ICLR poses a refreshing, cavalier model, in the otherwise homogenous landscape of academic publication. ICLR addresses the modern publication environment, acknowledging and embracing arXiv.org, while many other conferences ignore the issue, carrying on with the (often false) pretense of blind submissions. I believe that ICLR will improve all other conferences by producing data both on what works and what doesn't. Less adventurous organizers can still learn from ICLR's experiments. Further, ICLR can accelerate the career development of many young machine learning researchers, providing them with an unprecedented opportunity to learn from both positive and negative reviews of hundreds of papers, where they previously would sink months at a time to acquire each new data point. Hopefully, ICLR will continue both to improve and to experiment. But please, kill the star ratings.

Zachary Chase Lipton Zachary Chase Lipton is a PhD student in the Computer Science Engineering department at the University of California, San Diego. Funded by the Division of Biomedical Informatics, he is interested in both theoretical foundations and applications of machine learning. In addition to his work at UCSD, he has interned at Microsoft Research Labs and as a Machine Learning Scientist at Amazon, is a Contributing Editor at KDnuggets, and has signed on as an author at Manning Publications.

Related: