Discuss what we will consider a GPF

From Geoscience Paper of the Future
Jump to: navigation, search


Science is Changing: Making Research Accessible

Opening Science

The US Office of Science and Technology Policy released a memorandum requiring all agencies to develop plans for releasing data and other results from federally funded scientific research [Holdren 2013]. The memo emphasizes that scientific data has tremendous value, so much so that it is declared "an asset for progress" [Holdren 2013]. Other governments are pursuing similar initiatives. As a result, federal funding agencies are developing new initiatives and programs to make research results widely available [NSF 2013].

Although scientists state many reasons that make sharing data difficult [Tenopir et al 2011], there are many good reasons to share data. Sharing data keeps us honest and improves peer review and mutual validation [Corbyn 2012; Krugman 2013; Baggerly and Coombes 2009]. Another reason is the increased reputation and citation of research when datasets become available [Piwowar et al 2007]. Similar arguments can be made for making software available [CITE].

Reproducibility

Scientific articles describe computational methods informally, often requiring a significant effort from others to reproduce and to reuse. Reproducibility is a cornerstone of scientific method, so it is important that reproducibility be possible not just in principle but in practice in terms of time and effort to the original team and to the reproducers. The reproducibility process can be so costly that it has been referred to as “forensic” research [Baggerly and Coombes 2009]. Studies have shown that reproducibility is not achievable from the article itself, even when datasets are published [Bell et al 2009; Ioannidis et al 2009]. Retractions of publications do occur, more often than is desirable The Scientist 2010]. A recent editorial proposed tracking the “retraction index” of scientific journals to indicate the proportion of published articles that are later found problematic [Fang and Casadevall 2011]. Publishers themselves are asking the community to end “black box” science that cannot be easily reproduced [Nature 2006]. The impact of this issue is well beyond scientific research circles. Clinical trials based on erroneous results pose significant threats to patients [Hutson 2010]. The validity of scientific research methods has been put in question [Lehrer 2010]. The public has neutral to low trust on scientists for important topics such as flu pandemics, depression drugs, and autism causes Scientific American 2010]. Pharmaceutical companies have reported millions of dollars in losses due to irreproducible results that seemed initially promising and worth investing [Naik 2011].

Computational reproducibility is a relatively modern concept. The Stanford Exploration Project led by Jon Claerbout published an electronic book containing a dissertation and other articles from their geosciences lab [Claerbout and Karrenbach 1992]. Papers are accompanied by zipped files with the codes that could used to reproduce the results, and a methodology was developed to create and manage all these objects that continues today with the Madagascar software [Schwab et al 2000]. Advocates of reproducibility have sprung over the years in many disciplines, from signal processing [Vandewalle et al 2009] to psychology [Spies et al 2012]. Organized community efforts include reproducibility tracks at conferences [Manolescu et al 2008; Bonnet et al 2011; Wilson et al 2012], reproducibility editors in journals [Diggle and Zeger 2009], and numerous community workshops and forums (e.g., [Bourne et al 2012]). Active research in this area is addressing a range of topics including copyright [Stodden 2009], privacy [Baker et al 2010], and social [Yong 2012] and validation issues [Guo 2012].

Scientific publications could be extended so that they incorporate computational workflows, as many already include data [Bourne 2010]. Without access to the source codes for the papers, reproducibility has been shown elusive [Hothorn and Leisch 2011]. This would make scientific results more easily reproducible because articles would have not just a textual description of the computational process used but also a workflow that, as a computational artifact, could be inspected and automatically re-executed. Some systems exist that augment publications with scripts or workflows, such as Weaver for Latex [Leisch 2002; Falcon 2007] and GenePattern for MS Word [Mesirov 2010]. Many scientific workflow systems now include the ability to publish provenance records (including Kepler, Taverna, VizTrails, Pegasus, Triana, and Wings). The Open Provenance Model was developed by the scientific workflow community and is extensively used [Moreau et al 2011]. Repositories of shared workflows enable scientists to reuse workflows published by others and facilitate reproducibility [De Roure et al 2009]. An alternative is the publication of workflows as open web objects using semantic web technologies [Missier et al 2010; Garijo and Gil 2011].

New Frameworks to Create a New Generation of Scientific Articles

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an Executable Papers Challenge. They have a new type of paper called a software paper. They also publish articles of the future in different disciplines (see this [paleontology example), where the figures are interactive, they can be easily downloaded for slide presentations, the citations are hyperlinked, etc. Those efforts are complementary to what we are trying to do here.

The Case of the Tuberculosis Drugome

This is a case where the work published in a previously published paper was reproduced using a workflow system, where the data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

Looking at the Future

The Vision

In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

What is a Geoscience Paper of the Future?

A paper is one thing (think of a larger wrapper, with conceptual framework) as opposed to the smaller bits (code, datasets, individual figures) that are updated along they way (e.g. get associated with your ORCID). Don't want to get into the stigma of least publishable units. Recognize that there are different types of publications (letter, full paper, etc.) for different sized contributions, too.

A GPF paper includes:

  • data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
  • software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
  • provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
  • figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record. {Figures may be be a "prettyfied" version of the published version.}

Not all the GPF papers will be able to satisfy all of these things. For example, some collaborators may not want to release data, or some of the software. In those cases, the papers will explain the issues in trying to do these releases, and the challenges that they pose to the future of open reproducible publications.

Discussion

It's possible that small or very large contributions are not well captured in the current publishing paradigms.

For example, nanopublications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

Figures

Do we want to do exactly the same figure automatically? Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

What to Document: Timing and Intermediate Proceses, Failed Experiments

When should we document and what are the bounds on what we document?

For example, should data and workflows for 'failed' experiments be included? Should dataset DOIs be assigned before there are results from using them?

How much of your experimental history does one include? The experimental process often ends up nowhere. Should all the failed experiments be documented? Should there be a DOI for the results of the successful experiment, and another for failed trials?

Good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.


References

[Baggerly and Coombes 2009] Baggerly, K. A. and Coombes, K. R. “Deriving Chemosensitivity from Cell Lines: Forensic Bioinformatics and Reproducible Research in High-Throughput Biology.” Annals of Applied Statistics, 3(4), 2009. Available from http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.aoas/1267453942

[Baker et al 2010] “Transparency and reproducibility in data analysis: the Prostate Cancer Prevention Trial.” Stuart G. Baker, Amy K. Drake, Paul Pinsky, Howard L. Parnes, Barnett S. Kramer. Biostatistics, 11(3), 2010.

[Bell et al 2009] “A HUPO test sample study reveals common problems in mass spectrometry–based proteomics.” Bell AW, Deutsch EW, Au CE, Kearney RE, Beavis R, Sechi S, Nilsson T, Bergeron JJ, and the Human Proteome Organization (HUPO) Test Sample Working Group. Nature Methods, 6(6), 2009. Available from http://www.nature.com/nmeth/journal/v6/n6/full/nmeth.1333.html

[Bonnet et al 2011] “Repeatability and workability evaluation of SIGMOD 2011.” Philippe Bonnet, Stefan Manegold, Matias Bjørling, Wei Cao, Javier Gonzalez, Joel Granados, Nancy Hall, Stratos Idreos, Milena Ivanova, Ryan Johnson, David Koop, Tim Kraska, René Müller, Dan Olteanu, Paolo Papotti, Christine Reilly, Dimitris Tsirogiannis, Cong Yu, Juliana Freire, Dennis Shasha:. SIGMOD Record 40(2): 45-48, 2011.

[Bourne et al 2010] Bourne, P. “What Do I Want from the Publisher of the Future?” PLoS Computational Biology, 2010. Available from http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1000787

[Bourne et al 2011] “Improving Future Research Communication and e-Scholarship.” Phil E. Bourne, Tim Clark, Robert Dale, Anita de Waard, Ivan Herman, Eduard Hovy, and David Shotton (Eds). The FORCE 11 Manifesto, available from http://www.force11.org.

[Callahan et al 2006] “Managing the Evolution of Dataflows with VisTrails.” Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos E. Scheidegger, Claudio T. Silva and Huy T. Vo. Proceedings of IEEE Workshop on Workflow and Data Flow for Scientific Applications (SciFlow), 2006.

[Claerbout and Karrenbach 1992] “Electronic documents give reproducible research a new meaning.” Jon Claerbout and Martin Karrenbach. 1992. 62nd Annual International Meeting of the Society of Exploration Geophysics., Expanded Abstracts, 92: Society of Exploration Geophysics, 601-604, 1992. Available from http://sepwww.stanford.edu/doku.php?id=sep:research:reproducible:seg92

[De Roure et al 2009] De Roure, D; Goble, C.; Stevens, R. “The design and realizations of the myExperiment Virtual Research Environment for social sharing of workflows”. Future Generation Computer Systems, 25 (561-567), 2009

[Diggle and Zeger 2009] “Reproducible research and Biostatistics.” Peter J. Diggle and Scott L. Zeger. Biostatistics 10(3), 2009.

[Falcon 2007] Falcon, S. “Caching code chunks in dynamic documents: The weaver package.” Computational Statistics, (24)2, 2007. Available from http://www.springerlink.com/content/55411257n1473414/

[Fang and Casadevall 2011] Fang, C.F., and Casadevall, A. “Retracted Science and the retracted index”. Infection and Immunity. 2011. doi:10.1128/IAI.05661-11

[Freire and Silva 2012] “Making Computations and Publications Reproducible with VisTrails.” Juliana Freire and Claudio Silva. Computing in Science and Engineering 14(4): 18-25, 2012.

[Gil et al 2007a] “Examining the Challenges of Scientific Workflows,” Yolanda Gil, Ewa Deelman, Mark Ellisman, Thomas Fahringer, Geoffrey Fox, Dennis Gannon, Carole Goble, Miron Livny, Luc Moreau, and Jim Myers. IEEE Computer, vol. 40, no. 12, pp. 24-32, December, 2007. http://www.computer.org/portal/web/csdl/doi/10.1109/MC.2007.421 (preprint available at http://www.isi.edu/~gil/papers/computer-NSFworkflows07.pdf)

[Guo 2012] “CDE: A Tool For Creating Portable Experimental Software Packages.” Philip J. Guo. Computing in Science and Engineering: Special Issue on Software for Reproducible Computational Science, Jul/Aug 2012.

[Hothorn and Leisch 2011] “Case Studies in Reproducibility.” Torsten Hothorn and Friedrich Leisch. Briefings in Bioinformatics, 12(3), 2011. Available from http://bib.oxfordjournals.org/content/12/3/288

[Hutson 2010] Hutson, S. “Data Handling Errors Spur Debate Over Clinical Trial,” Nature Medicine, 16(6), 2010. Available from http://www.nature.com/nm/journal/v16/n6/full/nm0610-618a.html

[Ioannidis et al 2009] Ioannidis J.P., Allison D.B., Ball C.A., Coulibaly I, Cui X., Culhane A.C., Falchi M, Furlanello C., Game L., Jurman G., Mangion J., Mehta T., Nitzberg M., Page G.P., Petretto E., van Noort V. ”Repeatability of Published Microarray Gene Expression Analyses.” Nature Genetics, 41(2), 2009. Available from http://www.nature.com/ng/journal/v41/n2/full/ng.295.html

[Lehrer et al 2010] Lehrer, J. “The Truth Wears Off: Is There Something Wrong with the Scientific Method?” The New Yorker, December 13, 2010. Available from http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer

[Leisch 2002] Leisch, F. “Sweave: Dynamic Generation of Statistical Reports Using Literate Data Analysis”, Proceedings of Computational Statistics, 2002. Preprint available from http://www.statistik.lmu.de/~leisch/Sweave/Sweave-compstat2002.pdf

[Manegold et al 2010] “Repeatability & workability evaluation of SIGMOD 2009.” Manegold, S, Manolescu I, Afanasiev L, Feng J, Gou G, Hadjieleftheriou M, Harizopoulos S, Kalnis P, Karanasos K, Laurent D, Lupu M, Onose N, Ré C, Sans V, Senellart P, Wu T, Shasha D. SIGMOD Record 38, 2010. Available from http://www.sigmod.org/sigmod/record/issues/0909/p40.open.repeatability2009.pdf

[Manolescu et al 2008] “The repeatability experiment of SIGMOD 2008” Ioana Manolescu, Loredana Afanasiev, Andrei Arion, Jens Dittrich, Stefan Manegold, Neoklis Polyzotis, Karl Schnaitter, Pierre Senellart, Spyros Zoupanos, Dennis Shasha. ACM SIGMOD Record 37(1), 2008. Available from http://portal.acm.org/citation.cfm?id=1374780.1374791&coll=&dl=&idx=J689∂=newsletter&WantType=Newsletters&title=ACM%20SIGMOD%20Record

[Mesirov 2010] Mesirov, J. P. “Accessible Reproducible Research.” Science, 327:415, 2010. Available from http://www.sciencemag.org/cgi/rapidpdf/327/5964/415?ijkey=WzYHd6g6IBNeQ&keytype=ref&siteid=sci

[Missier et al 2010] Missier, P., Sahoo, S. S., Zhao, J., Goble, C., and Sheth, A. (2010). Janus: from Workflows to Semantic Provenance and Linked Open Data. Provenance and Annotation of Data and Processes Third International Provenance and Annotation Workshop IPAW 2010 Troy NY USA June 1516 2010 Revised Selected Papers 6378, 129-141. Available at: http://www.mygrid.org.uk/files/presentations/SP-IPAW10.pdf.

[Moreau and Ludaescher 2007] Moreau, L. and B. Ludaescher, editors. Special Issue on the First Provenance Challenge, volume 20. Wiley, April 2007.

[Moreau et al 2011] Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y., Groth, P., Kwasnikowska, N., Miles, S., Missier, P., Myers, J., Plale, B., Simmhan, Y., Stephan, E., and denBussche, J. V. “The Open Provenance Model Core Specification (v1.1).” Future Generation Computer Systems, 27(6), 2011. Preprint available from http://www.bibbase.org/cache/www.isi.edu__7Egil_publications.bib/moreau-etal-fgcs11.html

[Naik 2011] “Scientists' Elusive Goal: Reproducing Study Results.” The Wall Street Journal, December 2, 2011.

[Nature 2006] Nature Editorial. “Illuminating the Black Box.” Nature, 442(7098), 2006. Available from http://www.nature.com/nature/journal/v442/n7098/full/442001a.html

[Schwab et al 2000] “Making Scientific computations reproducible.” Schwab, M.; Karrenbach, N.; Claerbout, J. Computing in Science & Engineering, 2(6), pp.61-67, Nov.-Dec. 2000. Available from http://sep.stanford.edu/lib/exe/fetch.php?id=sep%3Aresearch%3Areproducible&cache=cache&media=sep:research:reproducible:cip.pdf

[Scientific American 2010] Scientific American. “In Science We Trust: Poll Results on How you Feel about Science” Scientific American, October 2010. Available from http://www.scientificamerican.com/article.cfm?id=in-science-we-trust-poll

[Scientist 2010] The Scientist. “Top Retractions of 2010.” The Scientist, December 16, 2010. Available from http://www.the-scientist.com/news/display/57864/

[Spies et al. 2012] “The reproducibility of psychological science.” Jeffrey Spies et al. Report of the Open Science Collaboration. Available from openscienceframework.org/reproducibility/ [Vandewalle et al 2009] “What, why and how of reproducible research in signal processing.” P. Vandewalle, J. Kovačević and M. Vetterli. IEEE Signal Processing., May 2009.

[Stodden 2009] "The Legal Framework for Reproducible Research in the Sciences: Licensing and Copyright", Victoria Stodden. IEEE Computing in Science and Engineering, 11(1), January 2009.

[Wilson et al 2012] “RepliCHI SIG – from a panel to a new submission venue for replication.” Max L. Wilson, Wendy Mackay, Ed H. Chi, Michael S Bernstein, Jeffrey Nichols. ACM SIGCHI, 2012.

[Yong 2012] “Replication studies: Bad copy.” Ed Yong. Nature 485, 298–300, 17 May 2012. Available from doi:10.1038/485298a

More References

[Tenopir et al 2011] "Data Sharing by Scientists: Practices and Perceptions." Carol Tenopir, Suzie Allard, Kimberly Douglass, Arsev Umur Aydinoglu, Lei Wu, Eleanor Read, Maribeth Manoff, and Mike Frame. PLoS ONE 6(6): e21101. doi:10.1371/journal.pone.0021101

[Corbyn 2012] "Misconduct is the main cause of life-sciences retractions." Zoë Corbyn, Nature, 1 October 2012.

[Krugman 2013] "The Excel Depression." Paul Krugman, The New York Times, April 19 2013.

[Baggerly and Coombes 2009] "Deriving chemosensitivity from cell lines: Forensic bioinformatics and reproducible research in high-throughput biology." Keith A. Baggerly and Kevin R. Coombes. Annals of Applied Statistics, Volume 3, Number 4, pp. 1309-1334, 2009.

[Holdren 2013] "Increasing Public Access to the Results of Scientific Research." John Holdren, Memorandum of the US Office of Science and Technology, 22 February 2013. Available from https://petitions.whitehouse.gov/response/increasing-public-access-results-scientific-research.

[NSF 2013] “National Science Foundation Collaborates with Federal Partners to Plan for Comprehensive Public Access to Research Results”, NSF Press Release 13-030. Available from http://www.nsf.gov/news/news_summ.jsp?org=NSF&cntn_id=127043.

[Piwowar et al 2007] "Sharing Detailed Research Data Is Associated with Increased Citation Rate." Heather A. Piwowar, Roger S. Day, Douglas B. Fridsma. PLoS ONE 2(3): e308. doi:10.1371/journal.pone.0000308