Reproducible science provides the
critical standard by which published results are judged and central findings are
either validated or refuted .
Reproducibility also allows
others to build upon existing work and use it to test new ideas and develop
methods. Advances over the years have resulted in the development of complex
methodologies that allow us to collect ever increasing amounts of data.
While repeating expensive studies
to validate findings is often difficult, a whole host of other reasons have
contributed to the problem of reproducibility. One such reason has been the
lack of detailed access to under- lying data and statistical code used for
analysis, which can provide opportunities for others to verify findings. In an
era rife with costly retractions, scientists have an increasing burden to be
more transparent in order to maintain their credibility. While post publication
sharing of data and code is on the rise, driven in part by funder mandates and
journal requirements , access to such research outputs is still not very
common
Some examples of free git repository are:
By sharing detailed and versioned
copies of one’s data and code researchers can not only ensure that reviewers
can make well-informed decisions, but also provide opportunities for such
artifacts to be repurposed and brought to bear on new research questions.
Opening up access to the data and software, not just the final publication, is
one of goals of the open science movement. Such sharing can lower barriers and
serve as a powerful catalyst to accelerate progress. In the era of limited
funding, there is a need to leverage existing data and code to the fullest
extent to solve both applied and basic problems. This requires that scientists
share their research artifacts more openly, with reasonable licenses that
encourage fair use while providing credit to original authors .
Besides overcoming social
challenges to these issues, existing technologies can also be leveraged to
increase reproducibility. All scientists use version control in one form or
another at various stages of their research projects, from the data collection
all the way to manuscript preparation. This process is often informal and haphazard;
where multiple revisions of papers, code, and datasets are saved as duplicate
copies with uninformative file names (e.g. draft 1.doc, draft 2.doc). As authors
receive new data and feedback from peers and collaborators, maintaining those
versions and merging changes can result in an unmanageable proliferation of files.
One solution to these problems would be to use a formal Version Control System
(VCS), which have long been used in the software industry to manage code.
A key feature common to all types
of VCS is that ability save versions of files during development along with
informative comments which are referred to as commit messages.
Every change and accompanying
notes are stored independent of the files, which obviates the need for
duplicate copies. Commits serve as checkpoints where individual files or an
entire project can be safely reverted to when necessary. Most traditional VCS
are centralized which means that they require a connection to a central server
which maintains the master copy. Users with appropriate privileges can check
out copies, make changes, and upload them back to the server.
Among the suite of version
control systems currently available, Git stands out in particular because it
offers features that make it desirable for managing artifacts of scientific
research. The most compelling feature of Git is its decentralized and
distributed nature. Every copy of a Git repository can serve either as the
server (a central point for synchronizing changes) or as a client.
This ensures that there is no
single point of failure. Authors can work asynchronously without being
connected to a central server and synchronize their changes when possible. This
is particularly useful when working from remote field sites where internet
connections are often slow or non-existent. Unlike other VCS, every copy of a
Git repository carries a complete history of all changes, including authorship,
which can be viewed and searched by anyone. This feature allows new authors to
build from any stage of a versioned project. Git also has a small footprint and
nearly all operations occur locally.
By using a formal VCS,
researchers can not only increase their own productivity but also make it for
others to fully understand, use, and build upon their contributions. In the
rest of the paper I describe how Git can be used to manage common science
outputs and move on to describing larger use-cases and benefits of this
workflow. Readers should note that I do not aim to provide a comprehensive
review of version control systems or even Git itself. There are also other
comparable alternatives such as Mercurial and Bazaar which provide many of the
features described below. My goal here is to broadly outline some of advantages
of using one such system and how it can benefit individual researchers,
collaborative efforts, and the wider research community.
http://www.scfbm.org/content/8/1/7
No comments:
Post a Comment