For a number of years, we have been developing a protocol for comprehensively documenting all the steps of data management and analysis that go into an empirical research paper. We teach this protocol every semester to undergraduates writing research papers in our introductory statistics classes, and students writing empirical senior theses use our protocol to document their work with statistical data. The protocol specifies a set of electronic files—including data files, computer command files, and metadata—that students assemble as they conduct their research, and then submit along with their papers or theses.

This website contains information about various aspects of the protocol: the principles that underlie it, the nuts-and-bolts of how to use it to document a statistical study, what we have learned from our experiences introducing the protocol to undergraduates, and outreach efforts we are engaged in to encourage other instructors of statistical methods to consider incorporating instruction on good practices in data management and documentation in their courses.

We use the term "soup-to-nuts" to describe our protocol because it aims to achieve "complete replicability." The guiding principle is that the information included in the documentation of a statistical study should be complete and transparent enough to allow an interested third party to easily and exactly reproduce all the steps of data management and analysis that led from the original data files to the results reported in the paper. By contrast, among most academic journals that maintain on-line repositories of documentation for the empirical studies they publish, the requirement is just that authors submit the processed data (after cleaning, merging, etc.) used for analysis, and computer code with commands that generate the results in the paper from the processed data. This standard of "partial replicability" leaves the steps needed to transform the original data files into the files used for analysis entirely undocumented.

Complete details of the protocol can be found in the "Instructions for Students" document posted on this website under Article Appendixes. The main elements of the documentation specified by the protocol include:

  • All the original data files from which any data used in the study were extracted. These files are preserved in the form in which they were first obtained, before they were cleaned or modified in any way. (In some cases, it is necessary also to save metadata with coding or other information explaining how to interpret the data in the original files.)
  • Metadata files, containing additional information a user would need to understand the data, such as variable definitions, units of measurement, coding schemes and sampling methods.
  • A set of computer command files, written in the language of whatever statistical software is being used, with all the instructions needed to access the data in the original files, process it as necessary (cleaning, merging, defining new variables, etc.) to construct the final dataset(s) used for analysis, and finally to generate all the statistical results (numbers reported in text, tables and figures) presented in the paper.
  • A read-me file that gives instructions for using the data and command files to replicate the results reported in the paper.
Read MoreRead Less

Who Are We?

Richard Ball and Norm Medeiros, both of Haverford College, began collaborating on the work described in this website more than a decade ago. Richard Ball is an associate professor of economics, and Norm Medeiros is the associate librarian.

We would welcome questions and comments about any aspect of the work described in this website. Please feel free to email us at TIER@Haverford.edu

Richard Ball
Associate Professor of Economics
Haverford College

Norm Medeiros
Associate Librarian
Haverford College