Difference between revisions of "Discuss what we will consider a GPF"

From Geoscience Paper of the Future
Jump to: navigation, search
(New Frameworks to Create a New Generation of Scientific Articles)
(New Frameworks to Create a New Generation of Scientific Articles)
Line 1: Line 1:
 
[[Category:Task]]
 
[[Category:Task]]
 +
 +
= Science is Changing =
 +
 +
== Opening Science ==
 +
 +
 +
 +
== Reproducibility ==
 +
Scientific articles describe computational methods informally, often requiring a significant effort from others to reproduce and to reuse. Reproducibility is a cornerstone of scientific method, so it is important that reproducibility be possible not just in principle but in practice in terms of time and effort to the original team and to the reproducers.  The reproducibility process can be so costly that it has been referred to as “forensic” research [Baggerly and Coombes 2009]. Studies have shown that reproducibility is not achievable from the article itself, even when datasets are published [Bell et al 2009; Ioannidis et al 2009]. Retractions of publications do occur, more often than is desirable The Scientist 2010]. A recent editorial proposed tracking the “retraction index” of scientific journals to indicate the proportion of published articles that are later found problematic [Fang and Casadevall 2011]. Publishers themselves are asking the community to end “black box” science that cannot be easily reproduced [Nature 2006]. The impact of this issue is well beyond scientific research circles. Clinical trials based on erroneous results pose significant threats to patients [Hutson 2010].  The validity of scientific research methods has been put in question [Lehrer 2010]. The public has neutral to low trust on scientists for important topics such as flu pandemics, depression drugs, and autism causes Scientific American 2010].  Pharmaceutical companies have reported millions of dollars in losses due to irreproducible results that seemed initially promising and worth investing [Naik 2011].
 +
 +
Computational reproducibility is a relatively modern concept. The Stanford Exploration Project led by Jon Claerbout published an electronic book containing a dissertation and other articles from their geosciences lab [Claerbout and Karrenbach 1992].  Papers are accompanied by zipped files with the codes that could used to reproduce the results, and a methodology was developed to create and manage all these objects that continues today with the Madagascar software [Schwab et al 2000].  Advocates of reproducibility have sprung over the years in many disciplines, from signal processing [Vandewalle et al 2009] to psychology [Spies et al 2012].  Organized community efforts include reproducibility tracks at conferences [Manolescu et al 2008; Bonnet et al 2011; Wilson et al 2012], reproducibility editors in journals [Diggle and Zeger 2009], and numerous community workshops and forums (e.g., [Bourne et al 2012]). Active research in this area is addressing a range of topics including copyright [Stodden 2009], privacy [Baker et al 2010], and social [Yong 2012] and validation issues [Guo 2012].
 +
 +
Scientific publications could be extended so that they incorporate computational workflows, as many already include data [Bourne 2010]. Without access to the source codes for the papers, reproducibility has been shown elusive [Hothorn and Leisch 2011].  This would make scientific results more easily reproducible because articles would have not just a textual description of the computational process used but also a workflow that, as a computational artifact, could be inspected and automatically re-executed. Some systems exist that augment publications with scripts or workflows, such as Weaver for Latex [Leisch 2002; Falcon 2007] and GenePattern for MS Word [Mesirov 2010]. Many scientific workflow systems now include the ability to publish provenance records (including Kepler, Taverna, VizTrails, Pegasus, Triana, and Wings).  The Open Provenance Model was developed by the scientific workflow community and is extensively used [Moreau et al 2011].  Repositories of shared workflows enable scientists to reuse workflows published by others and facilitate reproducibility [De Roure et al 2009]. An alternative is the publication of workflows as open web objects using semantic web technologies [Missier et al 2010; Garijo and Gil 2011].
  
 
= New Frameworks to Create a New Generation of Scientific Articles =
 
= New Frameworks to Create a New Generation of Scientific Articles =

Revision as of 15:35, 9 April 2015


Science is Changing

Opening Science

Reproducibility

Scientific articles describe computational methods informally, often requiring a significant effort from others to reproduce and to reuse. Reproducibility is a cornerstone of scientific method, so it is important that reproducibility be possible not just in principle but in practice in terms of time and effort to the original team and to the reproducers. The reproducibility process can be so costly that it has been referred to as “forensic” research [Baggerly and Coombes 2009]. Studies have shown that reproducibility is not achievable from the article itself, even when datasets are published [Bell et al 2009; Ioannidis et al 2009]. Retractions of publications do occur, more often than is desirable The Scientist 2010]. A recent editorial proposed tracking the “retraction index” of scientific journals to indicate the proportion of published articles that are later found problematic [Fang and Casadevall 2011]. Publishers themselves are asking the community to end “black box” science that cannot be easily reproduced [Nature 2006]. The impact of this issue is well beyond scientific research circles. Clinical trials based on erroneous results pose significant threats to patients [Hutson 2010]. The validity of scientific research methods has been put in question [Lehrer 2010]. The public has neutral to low trust on scientists for important topics such as flu pandemics, depression drugs, and autism causes Scientific American 2010]. Pharmaceutical companies have reported millions of dollars in losses due to irreproducible results that seemed initially promising and worth investing [Naik 2011].

Computational reproducibility is a relatively modern concept. The Stanford Exploration Project led by Jon Claerbout published an electronic book containing a dissertation and other articles from their geosciences lab [Claerbout and Karrenbach 1992]. Papers are accompanied by zipped files with the codes that could used to reproduce the results, and a methodology was developed to create and manage all these objects that continues today with the Madagascar software [Schwab et al 2000]. Advocates of reproducibility have sprung over the years in many disciplines, from signal processing [Vandewalle et al 2009] to psychology [Spies et al 2012]. Organized community efforts include reproducibility tracks at conferences [Manolescu et al 2008; Bonnet et al 2011; Wilson et al 2012], reproducibility editors in journals [Diggle and Zeger 2009], and numerous community workshops and forums (e.g., [Bourne et al 2012]). Active research in this area is addressing a range of topics including copyright [Stodden 2009], privacy [Baker et al 2010], and social [Yong 2012] and validation issues [Guo 2012].

Scientific publications could be extended so that they incorporate computational workflows, as many already include data [Bourne 2010]. Without access to the source codes for the papers, reproducibility has been shown elusive [Hothorn and Leisch 2011]. This would make scientific results more easily reproducible because articles would have not just a textual description of the computational process used but also a workflow that, as a computational artifact, could be inspected and automatically re-executed. Some systems exist that augment publications with scripts or workflows, such as Weaver for Latex [Leisch 2002; Falcon 2007] and GenePattern for MS Word [Mesirov 2010]. Many scientific workflow systems now include the ability to publish provenance records (including Kepler, Taverna, VizTrails, Pegasus, Triana, and Wings). The Open Provenance Model was developed by the scientific workflow community and is extensively used [Moreau et al 2011]. Repositories of shared workflows enable scientists to reuse workflows published by others and facilitate reproducibility [De Roure et al 2009]. An alternative is the publication of workflows as open web objects using semantic web technologies [Missier et al 2010; Garijo and Gil 2011].

New Frameworks to Create a New Generation of Scientific Articles

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an Executable Papers Challenge. They have a new type of paper called a software paper. They also publish articles of the future in different disciplines (see this [paleontology example), where the figures are interactive, they can be easily downloaded for slide presentations, the citations are hyperlinked, etc. Those efforts are complementary to what we are trying to do here.

The Case of the Tuberculosis Drugome

This is a case where the work published in a previously published paper was reproduced using a workflow system, where the data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

Looking at the Future

The Vision

In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

What is a Geoscience Paper of the Future?

A paper is one thing (think of a larger wrapper, with conceptual framework) as opposed to the smaller bits (code, datasets, individual figures) that are updated along they way (e.g. get associated with your ORCID). Don't want to get into the stigma of least publishable units. Recognize that there are different types of publications (letter, full paper, etc.) for different sized contributions, too.

A GPF paper includes:

  • data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
  • software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
  • provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
  • figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record. {Figures may be be a "prettyfied" version of the published version.}

Not all the GPF papers will be able to satisfy all of these things. For example, some collaborators may not want to release data, or some of the software. In those cases, the papers will explain the issues in trying to do these releases, and the challenges that they pose to the future of open reproducible publications.