So, there’s a project from the University of Arizona that’s trying to investigate CS papers and see how reproducible they are.
It’s a cool project. Made a pretty picture to be sure! The problem is, their pretty picture is… pretty misleading when it comes to build failures.
There’s many more like this — I took a cursory look through their results, and here’s a couple of categories:
Path setup failures:
- This one, which obviously just has to do with the QT setup.
- Postgres path not provided properly
Miscellaneous student errors:
- This one, this one, and this one all appear to be a text file encoding issue.
student1
really messed up these results. - This one seems to need GCC’s Cxx HashMap, yet is being run through visual studio.
- Boost failed to build?!
- Hardware not available — counted as a build failure!?
I’m a little sad that this is what they decided to put out. Their definition of reasonable effort is
In our experiments we instructed the students to spend no more than 30 minutes on building the systems. In many cases this involved installing additional libraries and compilers, editing makefiles, etc. The students were also instructed to be liberal in their evaluations, and, if in doubt, mark systems as buildable.
Yet, as you can see above, it appears nobody actually looked at what the students said. They were taken at their word, and they were far from liberal!
Moreover, I also disagree that an undergrad given 30 minutes is a good check. Especially considering the quality of the above results.
Shaming researchers through bad methodology like this isn't going to get more people to put out code.
I love the idea of this investigation, but I don’t love the methodology.
Edited to Add:: One thing that I did love from this project was their proposal in the technical report:
Our proposal is therefore much more modest. Rather than forcing authors to make their systems available, we propose instead that every article be required to specify the level of reproducibility a reader or reviewer should expect.
This is a worthwhile baby-step in and of itself, and should probably be integrated into paper classification systems. THis would eventually provide evolutionary pressure towards reproducibility.