CS Conference Simulation
Serving as Program Committee Chair of OOPSLA 2013 has been an enlightening experience, and I’d like to share one more insight I’ve gained (I wrote about another one here). Specifically, I want to focus on the acceptance rate.
The acceptance rate of CS conferences tends to gravitate around the magic 25% number, sometimes a little more, sometimes a little less. Why is this? The fact is even more puzzling because, as far as I can remember, I never heard a PC Chair say explicitly “We’re shooting for 25% acceptance rate.” It just happens. Are we all predisposed to accept 1 out of 4 papers submitted to any conference? But if so, and given that reviewers often disagree on their scores, how do we reach that magic ballpark figure every time, even with differences of opinion?
In order to understand this phenomenon, here’s an 8-line Python program that simulates reviewers’ scores. My model is very simple; I assume two main things: (1) reviews are independent of each other and independent of the papers; and (2) reviewers are 3 times more likely to give Bs and Cs than to give As and Ds. Here is the code:
import string import random choices='ABBBCCCD' scores = 'n'.join(''.join(sorted(random.choice(choices) for x in range(3))) for y in range(200)) f=open('papers.csv', 'w') f.write('Scoresn') f.write(scores) f.close()
This model generates a score distribution that is very similar to the scores I saw for OOPSLA this year. The program creates a CSV file that you can open in Excel and order in whatever way you like. If you don’t want to run it, here’s a sample output. If you run it, you can play with the several parameters there, and make different assumptions.
OK, so what does this show? First, it shows that, under this model, papers with AAx or ABx, the papers that have the highest chances of being accepted, account for, roughly,
22.5% just under 25%, statistically speaking. This is already a strong hint for the 25% puzzle. But not all of these papers are accepted, so the story doesn’t end here. Papers with at least one A account for about 33% of submissions. These are the papers that tend to gain the attention of Program Chairs, and be discussed in the PC meetings. Papers without an A may or may not make it to the discussion, depending on a number of factors, the most important one being the limited amount of time of those meetings (2 days at most).
Here lies perhaps one of the most important constraints of the situation. I’ve heard from others that one can’t meaningfully discuss more than 100 papers in these 2-day meetings, and that is already a rush. For a 200-submission conference, the line at exactly the middle tends to fall squarely in the B-range papers — papers that have some combination of Bs and Cs, but no As. So many times, PC Chairs move the line up to discuss less papers, to give more time to the papers discussed. How much up? Well, it depends on the each concrete submission set and each PC Chair.
PC Chairs, like journal editors, play a very important role here. They are the ones who decide which papers are discussed and which papers are rejected without face time discussion. So, here’s a first interesting question: if you were a PC Chair for one of these simulated score sets, and you had 12 hours to discuss papers, where would you draw the line(s)?
Let’s say some number between 33% and 40% of submissions are discussed. Some percentage of those don’t survive the discussion and end up being rejected. The papers that don’t have at least one A are particularly at risk, because no one feels particularly inclined to defend them against the attacks — and no paper is immune to criticism. Some papers that have an A are also at risk, especially those where the other 2 reviewers gave it a C or lower.
Again, the PC Chair has an enormous influence here too. (S)he can project a climate of tolerance or of strictness, and everything in between. The behavior of the PC Chair influences the probabilities of acceptance/rejection of the papers that are discussed.
I didn’t simulate the final decision part of the process, I leave it as an exercise to the reader! But all things taken into consideration in this stochastic process, and assuming a PC Chair somewhere in the middle of the road, we end up in the 20th percentile for the papers that survive all the hurdles!
OK, now that you have played with a simulated OOPSLA, we can raise a few more questions. Here are some:
Does my guess about the likelihood of A/B/C/D reflect reality? I don’t know, I only have 1 data point, but the model is pretty close to that data point. And if so, is this a consequence of the quality of the papers themselves or of the predisposition of reviewers? In my model, it’s a consequence of the reviewers, the papers don’t matter for the choice of score. [UPDATE 2013-05-28 My model does not predict the scores of individual papers, it simply analyses the distribution of scores. In reality, the scores of papers are a combination of the papers themselves and the predisposition of reviewers to give A/B/C/D (a Naive Bayes classification model, thanks to Hitesh for pointing it out).] The features of the papers that lead to the score distribution don’t matter for this analysis. What this analysis shows is that if reviewers would behave differently (e.g. give more positive than negative reviews), we would end up with a different score distribution, and very likely with a different acceptance rate.
What happens when conferences grow? ICSE, for example, has been getting 400+ submissions in recent years. Its acceptance rate is really low, under 20%. Assuming my behavioral model of reviewers, 1/3 of that (i.e. “at least one A”) is 133, which is already above the 100-paper threshold. So what happens in those PC meetings? (ICSE has recognized this problem, and is moving to new model of having a 2-layer Program Committee, which I’m still not sure how it will work)
Is the f2f PC meeting imposing an artificial constraint on the process? What would happen if we eliminated that serial interaction and moved to online discussions? A couple of data points here: in the last couple of years, OOPSLA has done that for the subset of papers whose authors are members of the Program Committee. Interestingly, this hasn’t affected in any way the acceptance rate of these papers — again 20th percentile for them too. Maybe for large conferences like ICSE, this might make a difference.
What would happen if instead of 3 reviewers per paper we would have 5 reviewers per paper? OK, well, that would likely be unfeasible in practice, because of the enormous reviewing load. But ignoring that, would having 5 reviewers per paper make a difference? WRT the scores distribution, it would certainly make a difference: in that scenario of complete orthogonality that my simple (simplistic) program assumes, about 1/2 of the papers would now be likely to get at least one A. That would give them a ticket to the discussion table, and would prevent any others that don’t have an A from being discussed, for lack of time. But would they survive the discussion? My guess is that the end result, statistically speaking, would be about the same, because the number of papers that have a majority of positive scores (As an Bs) in that scenario is, again, around 1/3. Given that some of the papers with mostly positive reviews don’t survive, we’d be gravitating towards the 20th percentile again.
What would happen if reviewers reviewed less papers? Here’s my hypothesis: I suspect that instead of reviewers being 3x more likely to give Bs and Cs instead of As and Ds they would be 4x more likely to do so, because they wouldn’t have a good reference point for ‘average’ — everyone would be a lot more cautious in general. So the ‘ABBBCCCD’ choices become ‘ABBBBCCCCD’. This would reduce the number of papers with A, and increase the number of papers in the middle. (My simplistic model predicts what happened with the papers reviewed by the ERC quite accurately, which is scary…) Conversely, if reviewers would review more papers, this might have the opposite effect — reviewers might become more assertive on their positives and negatives. So something like ‘ABBCCD’. If we use that in the model, we end up with about half of the papers having an A, but an increase in the number of ‘clashes’ between reviewers, A-D papers. (Does that happen at POPL? I don’t know. The reviewing load is really high there, so my hypothesis could be put to the test.)
I’m not entirely sure what to make of all of this. I wanted to understand the gravitation towards 25%, and the generative model in those 8 lines of code sheds some light into it by simulating reviewers’ behavior in specific what-if scenarios. Models, obviously, are always an approximation of reality. I’m not sure the community wants to change anything; maybe everyone is happy with the 25-ish% acceptance rates. If we wanted to change something (that’s a big IF), this model makes it clear that there are a few concrete parameters that affect the dynamics of the process.