Saturday, May 25, 2013

Git can facilitate greater Reproducibility n Transparency - Karthik Ram


Reproducible science provides the critical standard by which published results are judged and central findings are either validated or refuted . 
Reproducibility also allows others to build upon existing work and use it to test new ideas and develop methods. Advances over the years have resulted in the development of complex methodologies that allow us to collect ever increasing amounts of data.

While repeating expensive studies to validate findings is often difficult, a whole host of other reasons have contributed to the problem of reproducibility. One such reason has been the lack of detailed access to under- lying data and statistical code used for analysis, which can provide opportunities for others to verify findings. In an era rife with costly retractions, scientists have an increasing burden to be more transparent in order to maintain their credibility. While post publication sharing of data and code is on the rise, driven in part by funder mandates and journal requirements , access to such research outputs is still not very common 


Some examples of free git repository are:



By sharing detailed and versioned copies of one’s data and code researchers can not only ensure that reviewers can make well-informed decisions, but also provide opportunities for such artifacts to be repurposed and brought to bear on new research questions. Opening up access to the data and software, not just the final publication, is one of goals of the open science movement. Such sharing can lower barriers and serve as a powerful catalyst to accelerate progress. In the era of limited funding, there is a need to leverage existing data and code to the fullest extent to solve both applied and basic problems. This requires that scientists share their research artifacts more openly, with reasonable licenses that encourage fair use while providing credit to original authors .
Besides overcoming social challenges to these issues, existing technologies can also be leveraged to increase reproducibility. All scientists use version control in one form or another at various stages of their research projects, from the data collection all the way to manuscript preparation. This process is often informal and haphazard; where multiple revisions of papers, code, and datasets are saved as duplicate copies with uninformative file names (e.g. draft 1.doc, draft 2.doc). As authors receive new data and feedback from peers and collaborators, maintaining those versions and merging changes can result in an unmanageable proliferation of files. One solution to these problems would be to use a formal Version Control System (VCS), which have long been used in the software industry to manage code.
A key feature common to all types of VCS is that ability save versions of files during development along with informative comments which are referred to as commit messages.
Every change and accompanying notes are stored independent of the files, which obviates the need for duplicate copies. Commits serve as checkpoints where individual files or an entire project can be safely reverted to when necessary. Most traditional VCS are centralized which means that they require a connection to a central server which maintains the master copy. Users with appropriate privileges can check out copies, make changes, and upload them back to the server.
Among the suite of version control systems currently available, Git stands out in particular because it offers features that make it desirable for managing artifacts of scientific research. The most compelling feature of Git is its decentralized and distributed nature. Every copy of a Git repository can serve either as the server (a central point for synchronizing changes) or as a client.

This ensures that there is no single point of failure. Authors can work asynchronously without being connected to a central server and synchronize their changes when possible. This is particularly useful when working from remote field sites where internet connections are often slow or non-existent. Unlike other VCS, every copy of a Git repository carries a complete history of all changes, including authorship, which can be viewed and searched by anyone. This feature allows new authors to build from any stage of a versioned project. Git also has a small footprint and nearly all operations occur locally.
By using a formal VCS, researchers can not only increase their own productivity but also make it for others to fully understand, use, and build upon their contributions. In the rest of the paper I describe how Git can be used to manage common science outputs and move on to describing larger use-cases and benefits of this workflow. Readers should note that I do not aim to provide a comprehensive review of version control systems or even Git itself. There are also other comparable alternatives such as Mercurial and Bazaar which provide many of the features described below. My goal here is to broadly outline some of advantages of using one such system and how it can benefit individual researchers, collaborative efforts, and the wider research community.

http://www.scfbm.org/content/8/1/7


No comments: