Geoscience Paper of the Future - User contributions [en]

Main Page

2015-06-18T20:02:18Z

Allen: /* Roster of GeoSoft GPF Papers */

== Overview ==

The GeoSoft '''Geoscience Paper of the Future (GPF)''' activity aims to demonstrate how papers will be published in the future, going beyond a PDF format and including software, datasets, and workflow all published in open and accessible ways that make the paper transparent, reproducible, and machine indexable. We refer to such a paper as a geoscience paper of the future, or GPF for short.

The papers will be submitted to a [http://onlinelibrary.wiley.com/journal/10.1002/%28ISSN%292333-5084/homepage/call_for_papers.htm Special Issue on GeoScience Papers of the Future] of the [http://sites.agu.org AGU] [http://agupubs.onlinelibrary.wiley.com/agu/journal/10.1002/(ISSN)2333-5084/ Earth and Space Sciences] journal.

We are extending the GeoSoft GPF activity to the broader community as a [http://www.geosoft-earthcube.org/gpf/ Geoscience Papers of the Future Initiative] to offers training and support to other potential authors of GPFs.

== Quick Links ==
* [[Hold_regular_telecons#Call_Time_and_Access_Codes | Telecon information]]
** Next telecon and agenda: [[Hold_regular_telecons#Telecon_Friday_June_12.2C_2015 | Friday June 12]]
* [[Document_GPF_activities | Task descriptions and training materials]]
* [[Plan_overall_timeline | Overall timeline]]

* [[Tips on How to Use Wikis]]

== Roster of GeoSoft GPF Papers ==

{| class="wikitable" style="color:black; background-color:#ffffcc;" cellpadding="10"
|style="width: 10%" |'''Name'''
|style="width: 20%" |'''Affiliation'''
|style="width: 15%" |'''Research Area'''
|style="width: 40%" |'''Tentative Title'''
|style="width: 10%" |'''Submission Category'''
|-
| Cedric David
| NASA Jet Propulsion Laboratory
| Hydrology and river modeling
| [[Document_GPF_activities_by_Cedric_David | Going beyond triple-checking, allowing for peace of mind in community model development]]
| Technical Report
|-
| Ibrahim Demir et al
| University of Iowa
| Hydrology
| [[Document_GPF_activities_by_Ibrahim_Demir | Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System]]
| Research Article
|-
| Wally Fulweiler et al
| Boston University
| Coastal marine ecosystems and biogeochemistry
| [[Document_GPF_activities_by_Wally_Fulweiler | What can we learn from a decade of directly measured sediment di-nitrogen gas fluxes?]]
| Technical Report
|-
| Bakinam Essawy, Jon Goodall et al
| University of Virginia
| Hydrology
| [[Document_GPF_activities_by_Jon_Goodall | Post-processing Workflows Using Data Grids to Support Hydrologic Modeling]]
| Research Article
|-
|Leif Karlstrom & Lay Kuan Loh
| University of Oregon & Carnegie Mellon University
| Volcanology and fluid mechanics
| [[Document_GPF_activities_by_Leif_Karlstrom | Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking]]
| Research Article
|-
| Kyo Lee et al
| NASA Jet Propulsion Laboratory
| Regional climate model evaluation
| [[Document_GPF_activities_by_Kyo_Lee | Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System]]
| Research Article
|-
| Heath Mills et al
| University of Houston Clear Lake
| Marine geomicrobiology
| [[Document_GPF_activities_by_Heath_Mills | Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses]]
| Research Article
|-
| Ji-Hyun Oh
| NASA Jet Propulsion Laboratory
| Tropical meteorology
| [[Document_GPF_activities_by_Ji-Hyun_Oh | Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation]]
| Technical Report
|-
| Suzanne Pierce et al
| Texas Advanced Computing Center, The University of Texas Austin
| Hydrogeology, decision support
| [[Document_GPF_activities_by_Suzanne_Pierce | MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the geosciences]]
| Research Article
|-
| Allen Pope
| NSDIC
| Glaciology, remote sensing
| [[Document_GPF_activities_by_Allen_Pope | Reproducibly Estimating and Evaluating Supraglacial Lake Depth with Landsat 8 and other Multispectral Sensors]]
| Technical Report
|-
| Mimi Tzeng et al
| Dauphin Island Sea Lab
| Physical Oceanography
| [[Document_GPF_activities_by_Mimi_Tzeng | Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)]]
| Technical Report
|-
| Sandra Villamizar et al
| University of California Merced
| River ecohydrology
| [[Document_GPF_activities_by_Sandra_Villamizar | Producing long-term series of whole-stream metabolism using readily available data]]
| Technical Report
|-
| Xuan Yu et al
| University of Delaware
| Hydrology
| [[Document_GPF_activities_by_Xuan_Yu | Learn integrated modeling of coupled surface and subsurface hydrology from scratch]]
| Technical Report
|}

== Acknowledgments ==

This activity is organized by the [http://www.geosoft-earthcube.org GeoSoft project] as part of the [http://www.earthcube.org EarthCube initiative] of the US National Science Foundation with awards ICER-1343800 and ICER-1440323.

Document domain characteristics by Allen Pope

2015-06-18T04:08:29Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Document domain characteristics]] 
With CSDMS: the things that I am measuring are at an odd interface between the surface of glaciers and the lakes themselves (which is distinct of “meltwater” because meltwater can be in both lakes and streams, and I’m only talking about the lakes). So, although the CSDMS structures make sense, there is still some discretion and parsing which is necessary and I’m not 100% certain about it.

Also, a fuller mapping of variables could definitely done (as opposed to specific inputs and outputs), but because it isn’t passing models right now (it is an algorithm applies to satellite imagery), it is something to keep in mind but not strictly necessary


{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-04-18|
TargetDate=2015-05-01|
Type=Low}}

Document domain characteristics by Allen Pope

2015-06-18T04:05:30Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Document domain characteristics]] 
With CSDMS: the things that I am measuring are at an odd interface between the surface of glaciers and the lakes themselves (which is distinct of “meltwater” because meltwater can be in both lakes and streams, and I’m only talking about the lakes). So, although the CSDMS structures make sense, there is still some discretion and parsing which is necessary and I’m not 100% certain about it.


{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-04-18|
TargetDate=2015-05-01|
Type=Low}}

Document domain characteristics by Allen Pope

2015-06-18T04:05:10Z

Allen: Set PropertyValue: Progress = 75

[[Category:Task]]
 Details on how to do this task: [[Document domain characteristics]] 


{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-04-18|
TargetDate=2015-05-01|
Type=Low}}

Prepare the article for publication by Allen Pope

2015-06-17T23:30:21Z

Allen: Set PropertyValue: Progress = 50

[[Category:Task]]
 Details on how to do this task: [[Prepare the article for publication]] 


{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=50|
StartDate=2015-05-16|
TargetDate=2015-05-29|
Type=Low}}

Document software by specifying metadata by Allen Pope

2015-06-17T23:30:08Z

Allen: Set PropertyValue: Progress = 100

[[Category:Task]]

Documented in GeoSoft: http://www.geosoft-earthcube.org/portal/#browse/Software-1iukxlzb6dy5s

Found largely easy to do, although somewhat redundant. Also, not complete because will need to go back to add citations, etc.


{{#set:|
Progress=100}}

Document software by specifying metadata by Allen Pope

2015-06-17T23:30:01Z

Allen:

Make software accessible by Allen Pope

2015-06-17T22:59:41Z

Allen: Set PropertyValue: Progress = 100

[[Category:Task]]
 Details on how to do this task: [[Make software accessible]] 

I successfully downloaded GitHub for Mac - this was a little challenge because I'm running 10.8 (because of other software I use) and GitHub isn't natively supporting below 10.9 now.
But, I found a link for an older version that lets me do what I need to: https://web.archive.org/web/20140704072852/https://mac.github.com/

This made it easy to copy my already-organized code to a repo (where I added author, licenses, readme, etc.)

Next step will be to get a DOI for this - once I have a little more info on publications to put into the Readme.
One thing that is slightly unclear is whether I will be able to update things in the release (e.g. Workflow, etc.) before sharing it.
https://guides.github.com/activities/citable-code/



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=100|
StartDate=2015-04-04|
TargetDate=2015-04-17|
Type=Low}}

Make software accessible by Allen Pope

2015-04-06T23:59:10Z

Allen: Set PropertyValue: Progress = 50

[[Category:Task]]
 Details on how to do this task: [[Make software accessible]] 

I successfully downloaded GitHub for Mac - this was a little challenge because I'm running 10.8 (because of other software I use) and GitHub isn't natively supporting below 10.9 now.
But, I found a link for an older version that lets me do what I need to: https://web.archive.org/web/20140704072852/https://mac.github.com/

This made it easy to copy my already-organized code to a repo (where I added author, licenses, readme, etc.)

Next step will be to get a DOI for this - once I have a little more info on publications to put into the Readme.
One thing that is slightly unclear is whether I will be able to update things in the release (e.g. Workflow, etc.) before sharing it.
https://guides.github.com/activities/citable-code/



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=50|
StartDate=2015-04-04|
TargetDate=2015-04-17|
Type=Low}}

Make software accessible by Allen Pope

2015-04-06T23:58:45Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software accessible]] 

I successfully downloaded GitHub for Mac - this was a little challenge because I'm running 10.8 (because of other software I use) and GitHub isn't natively supporting below 10.9 now.
But, I found a link for an older version that lets me do what I need to: https://web.archive.org/web/20140704072852/https://mac.github.com/

This made it easy to copy my already-organized code to a repo (where I added author, licenses, readme, etc.)

Next step will be to get a DOI for this - once I have a little more info on publications to put into the Readme.
One thing that is slightly unclear is whether I will be able to update things in the release (e.g. Workflow, etc.) before sharing it.
https://guides.github.com/activities/citable-code/



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=0|
StartDate=2015-04-04|
TargetDate=2015-04-17|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T20:06:03Z

Allen: Set PropertyValue: Progress = 100

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

Between READMEs, the Workflow diagram, and Comments, all code should now be sufficiently clear for other people to execute my software.
I made explicit the order of scripts to be called, the purpose of the code, any paths called for files I/O, any other functions/scripts called within the code, and any parameters used in the code (thresholds, etc.)

Software used (including versions, toolboxes, etc.) are documented in the overall Readme file

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place at this point (the end of the study) - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable in the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!

Now - I "only" have to share my code (edit README) to do so, and Get DOIs for it to put into the workflow. That process may be circular, but I hope not. If so, I'll just need to put the figure in another repository or something...



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=100|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T20:05:55Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

Between READMEs, the Workflow diagram, and Comments, all code should now be sufficiently clear for other people to execute my software.
I made explicit the order of scripts to be called, the purpose of the code, any paths called for files I/O, any other functions/scripts called within the code, and any parameters used in the code (thresholds, etc.)

Software used (including versions, toolboxes, etc.) are documented in the overall Readme file

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place at this point (the end of the study) - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable in the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!

Now - I "only" have to share my code (edit README) to do so, and Get DOIs for it to put into the workflow. That process may be circular, but I hope not. If so, I'll just need to put the figure in another repository or something...



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T20:01:35Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

Between READMEs, the Workflow diagram, and Comments, all code should now be sufficiently clear for other people to execute my software.
I made explicit the order of scripts to be called, the purpose of the code, any paths called for files I/O, any other functions/scripts called within the code, and any parameters used in the code (thresholds, etc.)

Software used (including versions, toolboxes, etc.) are documented in the overall Readme file

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place at this point (the end of the study) - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable in the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T20:00:13Z

Allen: Set PropertyValue: Progress = 75

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

Between READMEs, the Workflow diagram, and Comments, all code should now be sufficiently clear for other people to execute my software.
I made explicit the order of scripts to be called, the purpose of the code, any paths called for files I/O, any other functions/scripts called within the code, and any parameters used in the code (thresholds, etc.)

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place at this point (the end of the study) - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable in the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=75|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T19:58:42Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

Between READMEs, the Workflow diagram, and Comments, all code should now be sufficiently clear for other people to execute my software.
I made explicit the order of scripts to be called, the purpose of the code, any paths called for files I/O, any other functions/scripts called within the code, and any parameters used in the code (thresholds, etc.)

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place at this point (the end of the study) - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable in the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=0|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T19:49:16Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable int the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

ALSO - some of my code is text files to just copy and run in the command line. These are by necessity hard-coded. I have now many this explicit with some commenting. I don't currently have the knowledge to automate these or make paths parameters.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=0|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Allen Pope should make software executable by others

2015-04-06T19:46:32Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Make software executable by others]] 

This task made me realize how many things were hard-coded into the software.
The unwieldy issue here is the large filestructure that is in place - scripts call from all over the place, and I know now for the future to create folders & populate them from within the code so that it can be more repeatable int the future. As it is, there are so many manual calls for folders that there we really too many to reasonably make parameters.

Instead - as with some models I have used - I have defined these things thing up front in the script. Yes, it requires some user interaction, but after that initial section everything should run smoothly. If these haven't been set correctly, it will throw and error and the user will then have to go fix it.

So - my software is executable by others, but it will take just a little more work and stresses the importance of thinking of these things BEFORE you start coding!



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=0|
StartDate=2015-03-21|
TargetDate=2015-04-03|
Type=Low}}

Document provenance of results by Allen Pope

2015-04-06T19:06:36Z

Allen: Set PropertyValue: Progress = 100

[[Category:Task]]
 Details on how to do this task: [[Document the provenance of the results]] 

I looked through the workflow tools. They look good in theory, but with code already done, it was too much of a hassle. Also, not used by my community so not as helpful. MIght also be important that MATLAB is proprietary and didn't look like it was supported? Not sure about gdal.
So, I have chosen to created a fairly detailed workflow diagram, instead (I chose to do this in Illustrator).
[https://www.dropbox.com/s/io0fl64y9f2ns5o/GPF_Workflow.ai?dl=0]

This was a good exercise to align datasets and code. It made sure that I knew all the bits of code I needed (was good to do before sharing code - now that is ready to do, too) as well as all the data.
Gave me an appreciation for the complexity behind what is otherwise a fairly simple description.

I think it will potentially make easier for others to use.
It also made me realize my code is probably not terribly efficient. Something to work on better in the future, not necessarily now.
I also realized that putting together this structure helped me change the way I think about blocks of code, etc - which will be helpful for sharing code later.
It made me think of better ways to structure my code (where to put parameters / how to comment) and how to make it more automated, less hard-coded to particular filenames/landsat scenes.

Consider who did what - and it's a matter of scale (number of authors, processes, etc.)

As a not-formally-trained coder, this made me realize more best practice in terms of creating code. (see above)

One thing which isn't 100% reproducible is the figures. For the map - the GIS is not something I can code up.
For the other plots, I provide a way to get to the point of the figure where possible - but I then have taken it through plot.ly and Illustrator to get to what is in the paper.
I would change how I did this - save the exact code that got me to a figure as much as possible. As well as share the plot.ly itself as an easier way to share things.

Will need to describe in text, too. Break down each sub-section of the workflow, and potentially sketch it out even more broadly at the time of writing.
Potentially make more modular figures, too

Also - need to add code DOIs as appropriate.



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=100|
StartDate=2015-03-07|
TargetDate=2015-03-20|
Type=Low}}

Document provenance of results by Allen Pope

2015-04-06T19:06:27Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Document the provenance of the results]] 

I looked through the workflow tools. They look good in theory, but with code already done, it was too much of a hassle. Also, not used by my community so not as helpful. MIght also be important that MATLAB is proprietary and didn't look like it was supported? Not sure about gdal.
So, I have chosen to created a fairly detailed workflow diagram, instead (I chose to do this in Illustrator).
[https://www.dropbox.com/s/io0fl64y9f2ns5o/GPF_Workflow.ai?dl=0]

This was a good exercise to align datasets and code. It made sure that I knew all the bits of code I needed (was good to do before sharing code - now that is ready to do, too) as well as all the data.
Gave me an appreciation for the complexity behind what is otherwise a fairly simple description.

I think it will potentially make easier for others to use.
It also made me realize my code is probably not terribly efficient. Something to work on better in the future, not necessarily now.
I also realized that putting together this structure helped me change the way I think about blocks of code, etc - which will be helpful for sharing code later.
It made me think of better ways to structure my code (where to put parameters / how to comment) and how to make it more automated, less hard-coded to particular filenames/landsat scenes.

Consider who did what - and it's a matter of scale (number of authors, processes, etc.)

As a not-formally-trained coder, this made me realize more best practice in terms of creating code. (see above)

One thing which isn't 100% reproducible is the figures. For the map - the GIS is not something I can code up.
For the other plots, I provide a way to get to the point of the figure where possible - but I then have taken it through plot.ly and Illustrator to get to what is in the paper.
I would change how I did this - save the exact code that got me to a figure as much as possible. As well as share the plot.ly itself as an easier way to share things.

Will need to describe in text, too. Break down each sub-section of the workflow, and potentially sketch it out even more broadly at the time of writing.
Potentially make more modular figures, too

Also - need to add code DOIs as appropriate.



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=90|
StartDate=2015-03-07|
TargetDate=2015-03-20|
Type=Low}}

Document provenance of results by Allen Pope

2015-04-06T04:17:18Z

Allen:

[[Category:Task]]
 Details on how to do this task: [[Document the provenance of the results]] 

I looked through the workflow tools. They look good in theory, but with code already done, it was too much of a hassle. Also, not used by my community so not as helpful. MIght also be important that MATLAB is proprietary and didn't look like it was supported? Not sure about gdal.
So, I have chosen to created a fairly detailed workflow diagram, instead (I chose to do this in Illustrator).
[https://www.dropbox.com/s/io0fl64y9f2ns5o/GPF_Workflow.ai?dl=0]

This was a good exercise to align datasets and code. It made sure that I knew all the bits of code I needed (was good to do before sharing code - now that is ready to do, too) as well as all the data.
Gave me an appreciation for the complexity behind what is otherwise a fairly simple description.

I think it will potentially make easier for others to use.
It also made me realize my code is probably not terribly efficient. Something to work on better in the future, not necessarily now.
I also realized that putting together this structure helped me change the way I think about blocks of code, etc - which will be helpful for sharing code later.
It made me think of better ways to structure my code (where to put parameters / how to comment) and how to make it more automated, less hard-coded to particular filenames/landsat scenes.

Consider who did what - and it's a matter of scale (number of authors, processes, etc.)

As a not-formally-trained coder, this made me realize more best practice in terms of creating code. (see above)

One thing which isn't 100% reproducible is the figures. For the map - the GIS is not something I can code up.
For the other plots, I provide a way to get to the point of the figure where possible - but I then have taken it through plot.ly and Illustrator to get to what is in the paper.
I would change how I did this - save the exact code that got me to a figure as much as possible. As well as share the plot.ly itself as an easier way to share things.

Will need to describe in text, too. Break down each sub-section of the workflow, and potentially sketch it out even more broadly at the time of writing.
Potentially make more modular figures, too



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Allen_Pope|
Progress=90|
StartDate=2015-03-07|
TargetDate=2015-03-20|
Type=Low}}

Develop proposal for special issue

2015-04-03T21:43:54Z

Allen: /* Papers to be included */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Papers have been broadly categorized according to their main "Challenges" - including '''Reproducibility (i.e., documenting and reproducing previously published results), Dark Code (i.e., describing and sharing code integral to the presented results), Sharing Big Data (i.e. making available large datasets), and Transferability (i.e., updating a previously-used method to a new version of software, etc.).'''

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Sharing Big Data. Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]], John Gentle, and Daniel Noll (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austin; US Department of Energy)

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the Geosciences
* '''Short abstract:'''The MCSDSS application is an advanced example of interactive design that can lead to data fusion for science visualization, decision support applications, and education. What sets the tool apart is its firm underpinning in data, innovative new forms of interface design, and the reusable platform. A key advance is the creation of a framework that can be used to feed new data, videos maps, images, or formats of information into the application with relative ease.

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials; Creation of an interface that enables non-programmers to build out interactive visualizations for their data
* '''Relationship to other publications:''' This article is new content, the proof of concept idea was developed with DOE funding for a student competition and resulted in an initial implementation that was reported in the DOE competition report and a masters thesis for co-author Daniel Noll
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:''' mid- to late June 2015

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T21:37:42Z

Allen: /* Papers to be included */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Papers have been broadly categorized according to their main "Challenges" - including '''"Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability."'''

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Sharing Big Data. Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]], John Gentle, and Daniel Noll (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austin; US Department of Energy)

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the Geosciences
* '''Short abstract:'''The MCSDSS application is an advanced example of interactive design that can lead to data fusion for science visualization, decision support applications, and education. What sets the tool apart is its firm underpinning in data, innovative new forms of interface design, and the reusable platform. A key advance is the creation of a framework that can be used to feed new data, videos maps, images, or formats of information into the application with relative ease.

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials; Creation of an interface that enables non-programmers to build out interactive visualizations for their data
* '''Relationship to other publications:''' This article is new content, the proof of concept idea was developed with DOE funding for a student competition and resulted in an initial implementation that was reported in the DOE competition report and a masters thesis for co-author Daniel Noll
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:''' mid- to late June 2015

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T21:37:20Z

Allen: /* Background: Why a Special Issue on Geoscience Papers of the Future? */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge''' (including "Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability")
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Sharing Big Data. Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]], John Gentle, and Daniel Noll (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austin; US Department of Energy)

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the Geosciences
* '''Short abstract:'''The MCSDSS application is an advanced example of interactive design that can lead to data fusion for science visualization, decision support applications, and education. What sets the tool apart is its firm underpinning in data, innovative new forms of interface design, and the reusable platform. A key advance is the creation of a framework that can be used to feed new data, videos maps, images, or formats of information into the application with relative ease.

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials; Creation of an interface that enables non-programmers to build out interactive visualizations for their data
* '''Relationship to other publications:''' This article is new content, the proof of concept idea was developed with DOE funding for a student competition and resulted in an initial implementation that was reported in the DOE competition report and a masters thesis for co-author Daniel Noll
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:''' mid- to late June 2015

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T21:36:51Z

Allen: /* The challenges of creating GPFs */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future. Papers have been broadly categorized according to their main "Challenges" - including '''"Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability."'''

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge''' (including "Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability")
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Sharing Big Data. Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]], John Gentle, and Daniel Noll (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austin; US Department of Energy)

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the Geosciences
* '''Short abstract:'''The MCSDSS application is an advanced example of interactive design that can lead to data fusion for science visualization, decision support applications, and education. What sets the tool apart is its firm underpinning in data, innovative new forms of interface design, and the reusable platform. A key advance is the creation of a framework that can be used to feed new data, videos maps, images, or formats of information into the application with relative ease.

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials; Creation of an interface that enables non-programmers to build out interactive visualizations for their data
* '''Relationship to other publications:''' This article is new content, the proof of concept idea was developed with DOE funding for a student competition and resulted in an initial implementation that was reported in the DOE competition report and a masters thesis for co-author Daniel Noll
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:''' mid- to late June 2015

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T21:36:03Z

Allen: /* The challenges of creating GPFs */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future. Papers have been broadly categorized according to their main "Challenges" - including "Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability."

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge''' (including "Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability")
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Sharing Big Data. Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]], John Gentle, and Daniel Noll (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austin; US Department of Energy)

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' MCSDSS: An accessible platform and application to enable data fusion and interactive visualization for the Geosciences
* '''Short abstract:'''The MCSDSS application is an advanced example of interactive design that can lead to data fusion for science visualization, decision support applications, and education. What sets the tool apart is its firm underpinning in data, innovative new forms of interface design, and the reusable platform. A key advance is the creation of a framework that can be used to feed new data, videos maps, images, or formats of information into the application with relative ease.

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials; Creation of an interface that enables non-programmers to build out interactive visualizations for their data
* '''Relationship to other publications:''' This article is new content, the proof of concept idea was developed with DOE funding for a student competition and resulted in an initial implementation that was reported in the DOE competition report and a masters thesis for co-author Daniel Noll
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:''' mid- to late June 2015

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:56:00Z

Allen: /* Papers to be included */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge''' (including "Reproducibility," "Dark Code," "Sharing Big Data," and "Transferability")
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:55:00Z

Allen: /* Papers to be included */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge''' (including "Reproducibility," "Dark Code," "Sharing Big Data," ...)
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne A Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:54:20Z

Allen: /* [Yu and Bhatt 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne A Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproducibility; Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:53:56Z

Allen: /* [Villamizar 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne A Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Decision Support Systems, Hydrogeology, Participatory Modeling, Data Fusion
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Reproducibility; Dark Code; Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:53:34Z

Allen: /* [Tzeng 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' Dark Code, Reproducibility; My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:52:55Z

Allen: /* [Pierce 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce]] and John Gentle (Texas Advanced Computing Center and Jackson School of Geosciences, The University of Texas at Austi

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Reproducibility, Dark Code; Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:52:34Z

Allen: /* [Oh 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' Reproducibility, Dark Code; This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:52:00Z

Allen: /* [Mills 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Challenge:''' Reproducibility; Dark Code
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:51:22Z

Allen: /* [Lee 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Big Data Sharing, Dark Code; Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''
* '''Challenge:''' My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:50:49Z

Allen: /* [Loh and Karlstrom 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Reproducibility (i.e., Quantifying clustering); We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''
* '''Challenge:''' My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:50:19Z

Allen: /* [Demir 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Reproducibility, Transferability; Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Quantifying clustering. We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''
* '''Challenge:''' My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Develop proposal for special issue

2015-04-03T17:49:53Z

Allen: /* [David 2015] */

[[Category:Task]]

== Background: Why a Special Issue on Geoscience Papers of the Future? ==

[[Discuss_what_we_will_consider_a_GPF#The_Vision | Include here our discussion for the vision]]

Background should be 1-2 pages.

Motivated by need to fully document and make research accessible and reproducible.

=== Motivation: The EarthCube Initiative and the GeoSoft Project ===

[http://www.geosoft-earthcube.org/about Include here background about GeoSoft from the web site]

OSTP memo. EarthCube reports.
Other reports that talk about the need for new approaches to editing.

It's possible that small or very large contributions are not well captured in the current publishing paradigms. Nanopublications.

For example, nano-publications are a possible way to reflect advances in a research process that may not merit a full pubication but they are useful advances to share with the community. A challenge here is that there is a stigma in publishing for publishing units that are too small or very small.

Alternatively, a very large piece of research or work with many parts may be better suited to a GPF style publication.

Perhaps, the concept of a 'paper' can be better reflected in the concept of a 'wrapper' or a collection of materials and resources. The purpose is to assure that publications are representative of the work, effort, and results achieved in the research process.

=== What is a GPF ===

[[Discuss_what_we_will_consider_a_GPF#What_is_a_Geoscience_Paper_of_the_Future.3F | Include here our discussion of what is a GPF]]

=== The challenges of creating GPFs ===

The articles in this issue reflect the current best practice for generating a Geoscience Paper of the Future.

'''Figure discussions''': Do we want to do exactly the same figure automatically. Figures in the paper may be a clean versions of an image generated by software. To the extent possible, authors have included clear delineations of provenance. The goal is to assure that readers may regenerate the figures using documented workflows, data, and codes. An important note (Allen, Sandra) is that frequently figures are generated by code, scripts, etc. yet the actual figure is finalized with user..... Mimi is trying to say: is it really worth belaboring the point about how the prettified version of the figure is made? If it is: both of the visualization software I've used (Matlab and SigmaPlot) have actual code in the background that specifies how to set up the prettification, and this code can be found, copied out, and rerun to generate the exact same figure with all of the prettification in the same place. SigmaPlot uses Visual Basic (I think) in its macros. If it is an important point about explicit code, this should be doable. But I'm not sure it's strictly necessary to specify exactly where all the prettifications are to get the gist across.

How much of your experimental history does one include? (Ibrahim). The experimental process often ends up nowhere. Should we document all the failed experiments? Get one DOI for the results of the successful experiment? Another for failed trials?

'''''Documenting: Timing and Intermediate proceses'''''
When should we document and what are the bounds on what we document?
For example, should we document and include data and workflows for 'failed' experiments? Or should we assign datasets DOIs before we know the results from using them?
The group thinks that good ideas/practices may include documenting and sharing data when you have a clear understanding of the outcomes worth reporting. For example successful experiments should have clear, clean data documented and shared. Whereas one strategy with 'failed' experiments could include bundling the intermediate datasets with one DOI and a more general discussion of the process/methods.

=== Related work ===

[[Discuss_what_we_will_consider_a_GPF#New_Frameworks_to_Create_a_New_Generation_of_Scientific_Articles | Include here the related work we have discussed]]

== Papers to be included ==

Would it be worthwhile to group the papers into broader categories rather than giving specifics about every single paper?

For each submission, we describe:

* '''Authors and affiliations'''
* '''Keywords of research area'''
* '''Tentative title'''
* '''Short abstract'''
* '''Challenge'''
* '''Relationship to other publications''' (is the article based on a previously published article? is it new content? IF PREVIOUSLY PUBLISHED, PLS PROVIDE A POINTER TO THE PUBLISHED ARTICLE AND SPECIFY WHAT PERCENTAGE OF THE WORK PRESENTED WILL BE NEW)
* '''Pointer to the wiki page that documents the article'''
* '''Expected submission date'''

=== [David 2015] ===

* '''Authors and affiliations:''' [[Cedric David]]
* '''Keywords of research area:''' Hydrology, Rivers, Modeling, Testing, Reproducibility.
* '''Tentative title:''' Going beyond triple-checking, allowing for peace of mind in community model development.
* '''Short abstract:''' The development of computer models in the general field of geoscience is often made incrementally over many years. Endeavors that generally start on one single researcher's own machine evolve over time into software that are often much larger than was initially anticipated. Looking at years of building on their computer code, sometimes without much training in computer science, geoscience software developers can easily experience an overwhelming sense of incompetence when contemplating ways to further community usage of their software. How does one allow others to use their code? How can one foster survival of their tool? How could one possibly ensure the scientific integrity of ongoing developments including those made by others? Common issues faced by geoscience developers include selecting a license, learning how to track and document past and ongoing changes, choosing a software repository, and allowing for community development. This paper provides a brief summary of experience with the three former steps of software growth by focusing on the almost decade-long code development of a river routing model. The core of this study, however, focuses on reproducing previously-published experiments. This step is highly repetitive and can therefore benefit greatly from automation. Additionally, enabling automated software testing can arguably be considered the final step for sustainable software sharing, by allowing the main software developer to let go of a mental block considering scientific integrity. Creating tools to automatically compare the results of an updated version of a software with those of previous studies can not only save the main developer's own time, it can also empower other researchers to in their ability to check and justify that their potential additions have retained scientific integrity.
* '''Challenge:''' Reproducibility; Ensure that updates to an existing model are able to reproduce a series of simulations published previously.
* '''Relationship to other publications:''' This research is related to past and ongoing development of the Routing Application for Parallel computatIon of Discharge (RAPID). The primary focus of this paper is to allow automated reproducibility of at least the [http://dx.doi.org/10.1175/2011JHM1345.1 first RAPID publication]. The scientific subject of this GPF differs from the article(s) to be reproduced as its focus is on development of automatic testing methods. In that regard, the paper is expected to be 95% new.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Cedric_David | Page]]
* '''Expected submission date:'''

=== [Demir 2015] ===

* '''Authors and affiliations:''' [[Ibrahim Demir]]
* '''Keywords of research area:''' hydrological network, optimization, network representation, database query
* '''Tentative title:''' Analysis and Optimization of Hydrological Network Database Representation Methods for Fast Access and Query in Web-based System
* '''Short abstract:''' Web based systems allow users to delineate watersheds on interactive map environments using server side processing. With increasing resolution of hydrological networks, optimized methods for storage of network representation in databases, and efficient queries and actions on the river network structure become critical. This paper presents a detailed study on analysis of widely used methods for representing hydrological networks in relational databases, and benchmarking common queries and modifications on the network structure using these methods. The analysis has been applied to the hydrological network of Iowa utilizing 90m DEM and 600,000 network nodes. The application results indicate that the representation methods provide massive improvements on query times and storage of network structure in the database. Suggested method allows watershed delineation tools running on client-side with desktop-like performance.
* '''Challenge:''' Some of the internal steps to prepare data might require long computation time and different software environments.
* '''Relationship to other publications:''' The article is based on a new study
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ibrahim_Demir | Page]]
* '''Expected submission date:'''

=== [Fulweiler 2015] ===

* '''Authors and affiliations:''' [[Wally Fulweiler]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Wally_Fulweiler | Page]]
* '''Expected submission date:'''

=== [Loh and Karlstrom 2015] ===

* '''Authors and affiliations:''' [[Lay Kuan Loh]] and [[Leif Karlstrom]]
* '''Keywords of research area:''' Spatial clustering, Eigenvector selection, Entropy Ranking, Cascades Volcanic Region, [http://geosphere.gsapubs.org/content/3/3/152.abstract Afar Depression], [http://astrogeology.usgs.gov/search/details/Mars/Research/Volcanic/TharsisVents/zip Tharsis provonce]
* '''Tentative title:''' Characterization of volcanic vent distributions using spectral clustering with eigenvector selection and entropy ranking
* '''Short abstract:''' Volcanic vents on the surface of Earth and other planets often appear in groups that exhibit spatial patterning. Such vent distributions reflect complex interplay between time-evolving mechanical controls on the pathways of magma ascent, background tectonic stresses, and unsteady supply of rising magma. With the ultimate aim of connecting surface vent distributions with the dynamics of magma ascent, we have developed a clustering method to quantify spatial patterns in vents. Clustering is typically used in exploratory data analysis to identify groups with similar behavior by partitioning a dataset into clusters that share similar attributes. Traditional clustering algorithms that work well on simple point-cloud type synthetic datasets generally do not scale well the real-world data we are interested in, where there are poor boundaries between clusters and much ambiguity in cluster assignments. We instead use a spectral clustering algorithm with eigenvector selection based on entropy ranking based off work from [http://www.sciencedirect.com/science/article/pii/S0925231210001311 Zhao et al 2010] that outperforms traditional spectral clustering algorithms in choosing the right number of clusters for point data. We benchmark this algorithm on synthetic vent data with increasingly complex spatial distributions, to test the ability to accurately cluster vent data with variable spatial density, skewness, number of clusters, and proximity of clusters. We then apply our algorithm to several real-world datasets from the Cascades, Afar Depression and Mars.
* '''Challenge:''' Quantifying clustering. We plan to study how varying the statistical distribution, density, skewness, background noise, number of clusters, proximity of clusters, and combinations of any of these factors affects the performance of our algorithm. We test it against man-made and real world datasets. '''
* '''Relationship to other publications:''' New content, but one of the databases we are studying in the paper (Cascades Volcanic Range) would be based off a different paper we are preparing and planning to submit earlier.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Leif_Karlstrom | Page]]
* '''Expected submission date:''' June 2015

=== [Lee 2015] ===

* '''Authors and affiliations:''' [[Kyo Lee]], Maziyar Boustani and Chris Mattmann, Jet Propulsion Laboratory
* '''Keywords of research area:'''North American regional climate, regional climate model evaluation system, Open Climate Workbench,
* '''Tentative title:''' Evaluation of simulated temperature, precipitation, cloud fraction and insolation over the conterminous United States using Regional Climate Model Evaluation System
* '''Short abstract:'''This study describes the detailed process of evaluating model fidelity in simulating four key climate variables, surface air temperature, precipitation, cloud fraction and insolation and their covariability over the conterminous United States region. Regional Climate Model Evaluation System (RCMES), a suite of public database and open-source software package, provides both observational datasets and data processors useful for evaluating any climate models. In this paper, we provide a clear and easy-to-follow workflow of RCMES to replicate published papers evaluating North American Regional Climate Change Assessment Program (NARCCAP) regional climate model (RCM) hindcast simulations using observations from variety of sources.
* '''Challenge:'''Sharing big data, better documenting source codes, encouraging climate science community to use RCMES
* '''Relationship to other publications:''' [http://journals.ametsoc.org/doi/abs/10.1175/JCLI-D-12-00452.1 Kim et al. 2013], [http://link.springer.com/article/10.1007/s00382-014-2253-y Lee et al. 2014]
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kyo_Lee | Page]]
* '''Expected submission date:'''End of June 2015

=== [Miller 2015] ===

* '''Authors and affiliations:''' [[Kim Miller]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Kim_Miller | Page]]
* '''Expected submission date:'''

=== [Mills 2015] ===

* '''Authors and affiliations:''' [[Heath Mills]], University of Houston Clear Lake; Brandi Kiel Reese, Texas A&M Corpus Christi
* '''Keywords of research area:'''
* '''Tentative title:'''Iron and Sulfur Cycling Biogeography Using Advanced Geochemical and Molecular Analyses
* '''Short abstract:'''
* '''Challenge:''' My paper will develop and document a new pipeline to analyze a combined and robust genetic and geochemical data set. New, reproducible methods will be highlighted in this manuscript to help others better analyze similar data sets. There is a general lack of guidance within my field for such challenges. This manuscript will be unique and helpful from an analysis standpoint as well as for the science being presented.
* '''Relationship to other publications:''' Original Manuscript
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Heith_Mills | Page]]
* '''Expected submission date:'''

=== [Oh 2015] ===

* '''Authors and affiliations:''' [[Ji-Hyun Oh]] Jet Propulsion Laboratory/University of Southern California
* '''Keywords of research area:''' Tropical Meteorology, Madden-Julian Oscillation, Momentum budget analysis
* '''Tentative title:''' Tools for computing momentum budget for the westerly wind event associated with the Madden-Julian Oscillation
* '''Short abstract:'''As one of the most pronounced modes of tropical intraseasonal variability, the Madden-Julian Oscillation (MJO) prominently connects global weather and climate, and serves as one of critical predictability sources for extended-range forecasting. The zonal circulation of the MJO is characterized by low-level westerlies (easterlies) in and to the west (east) of the convective center, respectively. The direction of zonal winds in the upper troposphere is opposite to that in the lower troposphere. In addition to the convective signal as an identifier of the MJO initiation, certain characteristics of the zonal circulation been used as a standard metric for monitoring the state of MJO and investigating features of the MJO and its impact on other atmospheric phenomena. This paper documents a tool for investigating the generation of low-level westerly winds during the MJO life cycle. The tool is used for the momentum budget analysis to understand the respective contributions of various processes involved in the wind evolution associated with the MJO using European Centre for Medium-Range Weather Forecasts operational analyses during Dynamics of the Madden–Julian Oscillation field campaign.

* '''Challenge:''' This paper will cover how to reproduce two key figures from the paper that I recently submitted to Journal of Atmospheric Science. This will include detailed procedures related to generating the figures such as how/where to download data, how to transform the format of the data to be used as an input for my codes, and so on..
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?) This article is related to the part of the paper submitted to Journal of Atmospheric Science.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Ji_Hyun | Page]]
* '''Expected submission date:'''

=== [Pierce 2015] ===

* '''Authors and affiliations:''' [[Suzanne Pierce ^1,2^]], [[John Gentle^1^]], [[Daniel Noll^2,3^]]
1 Texas Advanced Computing Center
2 Jackson School of Geosciences, The University of Texas at Austin
3 International Fellows, US Department of Energy

* '''Keywords of research area:''' Hydrogeology, Risk
* '''Tentative title:''' [[
* '''Short abstract:'''

* '''Challenge:''' Fully document a new software application and framework using example case study data and tutorials.
* '''Relationship to other publications:''' This article is new content
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Suzanne_Pierce | Page]]
* '''Expected submission date:'''

=== [Pope 2015] ===

* '''Authors and affiliations:''' [[Allen Pope]], National Snow and Ice Data Center, University of Colorado, Boulder
* '''Keywords of research area:''' Glaciology, Remote Sensing, Landsat 8, Polar Science
* '''Tentative title:''' Data and Code for Estimating and Evaluating Supraglacial Lake Depth With Landsat 8 and other Multispectral Sensors
* '''Short abstract:''' Supraglacial lakes play a significant role in glacial hydrological systems – for example, transporting water to the glacier bed in Greenland or leading to ice shelf fracture and disintegration in Antarctica. To investigate these important processes, multispectral remote sensing provides multiple methods for estimating supraglacial lake depth – either through single-band or band-ratio methods, both empirical and physically-based. Landsat 8 is the newest satellite in the Landsat series. With new bands, higher dynamic range, and higher radiometric resolution, the Operational Land Imager (OLI) aboard Landsat 8 has a lot of potential.

This paper will document the data and code used in processing in situ reflectance spectra and depth measurements to investigate the ability of Landsat 8 to estimate lake depths using multiple methods, as well as quantify improvements over Landsat 7’s ETM+. A workflow, data, and code are provided to detail promising methods as applied to Landsat 8 OLI imagery of case study areas in Greenland, allowing calculation of regional volume estimates using 2013 and 2014 summer-season imagery. Altimetry from WorldView DEMs are used to validate lake depth estimates. The optimal method for supraglacial lake depth estimation with Landsat 8 is shown to be an average of single band depths by red and panchromatic bands. With this best method, preliminary investigation of seasonal behavior and elevation distribution of lakes is also discussed and documented.
* '''Challenge:''' Reproducibility, Dark Code
* '''Relationship to other publications:''' Documenting and explaining the data and code behind the analysis and results presented in another paper.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Allen_Pope | Page]]
* '''Expected submission date:''' Late June 2015

=== [Read and Winslow 2015] ===

* '''Authors and affiliations:''' [[Jordan Read]] and [[Luke Winslow]]
* '''Keywords of research area:'''
* '''Tentative title:'''
* '''Short abstract:'''
* '''Challenge:'''
* '''Relationship to other publications:''' (is the article based on a previously published article? is it new content?)
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Jordan_Read | Page]]
* '''Expected submission date:'''

=== [Tzeng 2015] ===

* '''Authors and affiliations:''' [[Mimi Tzeng]], Brian Dzwonkowski (DISL); Kyeong Park (TAMU Galveston)
* '''Keywords of research area:'''physical oceanography, remote sensing
* '''Tentative title:''' Fisheries Oceanography of Coastal Alabama (FOCAL): A Subset of a Time-Series of Hydrographic and Current Data from a Permanent Moored Station Outside Mobile Bay (27 Jan to 18 May 2011)
* '''Short abstract:'''The Fisheries Oceanography in Coastal Alabama (FOCAL) program began in 2006 as a way for scientists at Dauphin Island Sea Lab (DISL) to study the natural variability of Alabama's nearshore environment as it relates to fisheries production. FOCAL provided a long-term baseline data set that included time-series hydrographic data from a permanent offshore mooring (ADCP, vertical thermister array and CTDs at surface and bottom) and shipboard surveys (vertical CTD profiles and water sampling), as well as monthly ichthyoplankton and zooplankton (depth-discrete) sample collections at FOCAL sites. The subset of data presented here are from the mooring, and includes a vertical array of thermisters, CTDs at surface and bottom, an ADCP at the bottom, and vertical CTD profiles collected at the mooring during maintenance surveys. The mooring is located at 30 05.410'N 88 12.694'W, 25 km southwest of the entrance to Mobile Bay. Temperature, salinity, density, depth, and current velocity data were collected at 20-minute intervals from 2006 to 2012. Other parameters, such as dissolved oxygen, are available for portions of the time series depending on which instruments were deployed at the time.
* '''Challenge:''' My paper will be about the processing of data in a larger dataset, from which peer-reviewed papers have been written. The processing I did was not specific to any particular paper. I can point to an example paper that used some of the data from this dataset, that I processed, however all of the figures in the paper are composites that also include other data from elsewhere that I had nothing to do with (and it wouldn't be feasible to try to get hold of the other data within our timeframe).
* '''Relationship to other publications:''' A recent paper that used the part of the FOCAL data I'm documenting as the sample from the larger dataset: Dzwonkowski, Brian, Kyeong Park, Jungwoo Lee, Bret M. Webb, and Arnoldo Valle-Levinson. 2014. "Spatial variability of flow over a river-influenced inner shelf in coastal Alabama during spring." Continental Shelf Research 74:25-34.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Mimi_Tzeng | Page]]
* '''Expected submission date:'''

=== [Villamizar 2015] ===

* '''Authors and affiliations:''' [[Sandra Villamizar]], University of California, Merced
* '''Keywords of research area:''' river ecohydrology
* '''Tentative title:''' Producing long-term series of whole-stream metabolism using readily available data.
* '''Short abstract:''' Continuous water quality and river discharge data that are readily available through government websites may be used to produce valuable information about key processes within a river ecosystem. In this paper I describe in detail the steps for acquisition and processing of river flow, dissolved oxygen, temperature, and specific conductance data that, combined with atmospheric data and physical properties of the river reach of interest, allow for the production of a long-term series of whole stream metabolism. This information is key in understanding the structure and function of an ecosystem such as the San Joaquin River in the Central Valley of California which has been increasingly degraded during the last 60 years due to intensive human intervention but now, since 2010, has been going through a restoration effort. The key advantage of this tool is that it uses readily available information to produce knowledge about a river ecosystem. This set of scripts, written in the R code, can be used immediately for any other river for which the key parameters (river flow, dissolved oxygen, temperature, and specific conductivity) are available. The scripts can also be modified by users to fit their particular site conditions.

* '''Challenge:''' Document new software/applications. This set of scripts was written after the necessity of generating daily estimates of metabolic rates for long periods of time and at various sites within the San Joaquin River.
* '''Relationship to other publications:''' This will be a new publication
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Sandra_Villamizar | Page]]
* '''Expected submission date:''' To be defined

=== [Yu and Bhatt 2015] ===

* '''Authors and affiliations:''' [[Xuan Yu]], Department of Geological Sciences, University of Delaware. Gopal Bhatt, Department of Civil & Environmental Engineering, Pennsylvania State University.
* '''Keywords of research area:''' coupled processes, integrated hydrologic modeling, PIHM, surface flow, subsurface flow, open science
* '''Tentative title:''' Learning integrated modeling of surface and subsurface flow from scratch
* '''Short abstract:''' Integrated modeling of surface and subsurface flow has been of great interest in understanding not only intimate interconnectedness of hydrological processes, but also land-surface energy balance, biogeochemical and ecological processes, and landscape evolution. Although a growing number of complex hydrologic models have been used for resolving environmental processes, hypothesis testing, hydrologic predictions for effective management of watershed, very limited resources of the model implementation have been made accessible to a large group of model users. The users have to invest a significant amount of time and effort to reproduce, and to understand the workflow of hydrologic simulation in a modeling paper. To provide a challenging and stimulating introduction to integrated modeling of surface and subsurface flow in this paper, we revisit the development of Penn State Integrated Hydrologic Model (PIHM) by reproducing a numerical benchmarking example, and a real world catchment scale application. Specifically, we document PIHM and it’s modeling workflow to enable basic understanding of simulating coupled surface and subsurface flow processes. We provide model and data to highlight the reciprocal roles between the two. In addition, we incorporate user experience as third dimension in the modeling workflow to enable deeper communications between model developers and users. The workflow has important implications for smoothing and accelerating open scientific collaborations in geosciences research.
* '''Challenge:''' Reproduce published simulations by a existing model with the latest version. Benchmarking modeling application for numerical experiment and field data.
* '''Relationship to other publications:''' The article is based on a previously published article.
* '''Pointer to the wiki page that documents the article:''' [[Document_GPF_activities_by_Xuan_Yu | Page]]
* '''Expected submission date:''' End of June 2015

== Special Issue Editors ==

* Co-editor: Chris Duffy and/or Scott Peckham
* Co-editor: Cedric David
* Co-editor: possibly Karan Venayagamoorthy

The editors will only accept submissions that follow the [[Develop_proposal_for_special_issue#Special_Issue_Review_Criteria | special issue review criteria]].

The editors will select a set of reviewers to handle the submissions. Reviewers will include computer scientists, library scientists, and geoscientists.

== Special Issue Review Criteria ==

The reviewers will be asked to provide feedback on the papers according to the following criteria. Note that some papers will have good reasons for limiting the information (e.g. the data is from third parties and not openly available, etc), and in that case they would document those reasons.

* Documentation of the datasets: descriptions of datasets, unique identifiers, repositories.
* Documentation of software: description of all software used (including pre-processing of data, visualization steps, etc), unique identifiers, repositories.
* Documentation of the provenance of results: provenance for each figure or result, such as the workflow or the provenance record.

== Tentative Timeline ==

* Journal committed to special issue: April 15, 2015
* Submissions due to editors: June 30, 2015
* Reviews due: Sept 15, 2015
* Decisions out to authors: Sept 30, 2015
* Revisions due: October 31, 2015
* Final versions due November 15, 2015
* Issue published December 31, 2015



{{#set:
Owner=Chris_Duffy|
Participants=Yolanda_Gil|
Participants=Scott_Peckham|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Kyo_Lee|
Participants=Kim_Miller|
Participants=Heath_Mills|
Participants=Ji-Hyun_Oh|
Participants=Suzanne_Pierce|
Participants=Allen_Pope|
Participants=Jordan_Read|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Progress=20|
StartDate=2015-03-10|
TargetDate=2015-03-16|
Type=Low}}

Make data accessible

2015-03-20T17:31:19Z

Allen: /* What To Do */

[[Category:Task]]

== What This Task Involves ==

The training session and training materials indicate how to:

# Get a permanent unique identifier for your dataset in a public repository
# Specify general (creator, license, version) and domain metadata (categories, tags)
# Upload or specify a pointer to the dataset

== Training Materials ==

This training session will be held on February 20, 2015:

* '''[https://www.dropbox.com/s/lekb1yl1r7ruzs8/GPF-MakingDataAccessible-20Feb2015.pdf?dl=0 Presentation]'''

=== Suggested Readings ===

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003542 "Ten Simple Rules for the Care and Feeding of Scientific Data."] Alyssa Goodman, Alberto Pepe , Alexander W. Blocker, Christine L. Borgman, Kyle Cranmer, Merce Crosas, Rosanne Di Stefano, Yolanda Gil, Paul Groth, Margaret Hedstrom, David W. Hogg, Vinay Kashyap, Ashish Mahabal, Aneta Siemiginowska, Aleksandra Slavkovic. PLOS Computational Biology, Published: April 24, 2014. DOI: 10.1371/journal.pcbi.1003542
** ''A brief and practical introduction for how to publish and share data''

* [https://peerj.com/preprints/697/ "Achieving human and machine accessibility of cited data in scholarly publications."] Joan Starr, Eleni Castro, Mercè Crosas, Michel Dumontier, Robert R. Downs, Ruth Duerr, Laurel Haak, Melissa Haendel, Ivan Herman, Simon Hodson, Joe Hourclé, John Ernest Kratz, Jennifer Lin, Lars Holm Nielsen, Amy Nurnberger, Stefan Pröll, Andreas Rauber, Simone Sacchi, Arthur P. Smith, Michael Taylor, Tim Clark. PeerJ Preprint, Version of 11 February 2015.
** ''A good description of the different kinds of permanent unique identifiers''

* [http://www.dcc.ac.uk/resources/how-guides/cite-datasets "How to Cite Datasets and Link to Publications."] Alex Ball and Monica Duke. DCC How-to Guides. Edinburgh: Digital Curation Centre. Version of 20 June 2012.
** ''A good overview of the elements in a data citation and how to handle granularity and versioning''

* [http://commons.esipfed.org/node/308 "Data Citation Guidelines for Data Providers and Archives."] ESIP Technical Report. Version of 31 December 2011. doi:10.7269/P34F1NNJ
** ''Provides many examples of alternative formats for citing a dataset''

== What To Do ==

We described many options in the training. Here is a sketch of the most common approach:
# Create a public entry for your dataset with a permanent unique identifier.
## Select a repository
##* Option 1: Find a repository that your community uses
##* Option 2: Go to figshare.com or zenodo.org (supported by CERN) or similarly free service, create an account. Figshare has 250MB file limits and 1GB private storage, but unlimited open storage. Zenodo allows files up to 2GB (with the potential for higher, if you talk to the site managers) and no current total storage limit.
## Create an entry for each of your datasets
### Specify the metadata
### Include license information: choose from [http://www.creativecommons.org/licenses Creative Commons], for example CC-BY or CC0.
### Upload or point to the data
## The repository should give you a unique identifier (a DOI)
# Create a data citation for each of your datasets
#* Include: authors, date of publication, dataset name, repository name, permanent unique identifier, timestamp of retrieval.
#* Specify the data citation in the repository entry for each dataset, so others can use it
# Include the data citations in the GPF

Some interesting cases that you may run into:
* I have several related datasets in several files (e.g., each file has data for a time period)
** Create a DOI for each file and a DOI for the whole set. If there are too many files (dozens or hundreds, it may be best to create a DOI for the whole set.
* My data is in a public repository, it is not my data
** Create a DOI for the slice of data that you use. Describe the data by specifying the query that you did to the repository and put a pointer to the repository, so others can also retrieve it.
* My data is from a database
** Ask for permission to publish the data that you extracted, and mention that you will give appropriate credit. Get an understanding of the appropriate license to use. Put the data in a file and publish it.
* Some of the data that I use is from a colleague
** Encourage them to make the data public in Figshare or any public repository, and offer to help. Explain to them how the license works. If they do not want to make the data public, that is ok. In that case, you should create an entry that does not have the data but at least describes it with all the metadata, which would include information about your colleague as the data creator and other information about how to get the data from them.
* My data comes from many sources
** Credit each source, create repository entries as needed
** An option is to create in the paper a table with “microattributions” that summarize each data source
* My data has many versions (e.g., sensors that collect more data over time)
** Create an entry for either each slice or each snapshot
* My datasets are very large
** Leave the datasets in a repository that can contain data of that size, or put the data in a publicly accessible URL. Then get a PURL at [http://www.purl.org], and create an entry in Figshare or similar pointing to that PURL.



{{#set:
Expertise=Open_science|
Owner=Yolanda_Gil|
Participants=Erin_Robinson|
Progress=60|
StartDate=2015-02-20|
TargetDate=2015-03-06|
Type=Low}}

Develop proposal for special issue

2015-03-20T16:58:52Z

Allen: /* [Pope 2015] */

Develop proposal for special issue

2015-03-20T16:58:21Z

Allen: /* [Pope 2015] */

Write about making data accessible

2015-03-13T21:14:02Z

Allen: Set PropertyValue: Owner = Allen Pope

[[Category:Task]]


{{#set:
Owner=Allen_Pope|
Participants=Mimi_Tzeng|
SubTask=Write_about_using_data_from_public_repositories|
SubTask=Write_about_using_data_from_colleagues|
SubTask=Write_about_data_preparation|
SubTask=Write_about_large_datasets|
Type=Medium}}
'''Comments and General Discussion, Observations in the Group:'''

[[Data notation and DOIs]] Conversation at F2F meeting: In the conversation during the face-to-face meeting we looked at examples of each author's wiki posts. Ibrahim had posted data for his article and included the QR code with the entry. We discussed whether or not a QR code was appropriate to include in the actual journal articles to improve accessibility. The group determined that a hyperlink to the DOIs is an easy and accessible way to access data in the papers, whereas a QR code is more useful in the context of presentations (e.g. technical posters).
So, the group felt that including the datalink via DOIs is a the best practice in journal articles

Write about making data accessible

2015-03-13T21:13:09Z

Allen: Added PropertyValue: Participants = Mimi Tzeng

[[Category:Task]]


{{#set:
Owner=Allen_Pope,_Mimi_Tzeng|
Participants=Mimi_Tzeng|
SubTask=Write_about_using_data_from_public_repositories|
SubTask=Write_about_using_data_from_colleagues|
SubTask=Write_about_data_preparation|
SubTask=Write_about_large_datasets|
Type=Medium}}
'''Comments and General Discussion, Observations in the Group:'''

[[Data notation and DOIs]] Conversation at F2F meeting: In the conversation during the face-to-face meeting we looked at examples of each author's wiki posts. Ibrahim had posted data for his article and included the QR code with the entry. We discussed whether or not a QR code was appropriate to include in the actual journal articles to improve accessibility. The group determined that a hyperlink to the DOIs is an easy and accessible way to access data in the papers, whereas a QR code is more useful in the context of presentations (e.g. technical posters).
So, the group felt that including the datalink via DOIs is a the best practice in journal articles

Write about making data accessible

2015-03-13T21:12:11Z

Allen: Set PropertyValue: Owner = Allen Pope, Mimi Tzeng

[[Category:Task]]


{{#set:
Owner=Allen_Pope,_Mimi_Tzeng|
SubTask=Write_about_using_data_from_public_repositories|
SubTask=Write_about_using_data_from_colleagues|
SubTask=Write_about_data_preparation|
SubTask=Write_about_large_datasets|
Type=Medium}}
'''Comments and General Discussion, Observations in the Group:'''

[[Data notation and DOIs]] Conversation at F2F meeting: In the conversation during the face-to-face meeting we looked at examples of each author's wiki posts. Ibrahim had posted data for his article and included the QR code with the entry. We discussed whether or not a QR code was appropriate to include in the actual journal articles to improve accessibility. The group determined that a hyperlink to the DOIs is an easy and accessible way to access data in the papers, whereas a QR code is more useful in the context of presentations (e.g. technical posters).
So, the group felt that including the datalink via DOIs is a the best practice in journal articles

Develop proposal for special issue

2015-03-13T16:58:02Z

Allen: /* [Pope 2015] */

Develop proposal for special issue

2015-03-13T16:57:18Z

Allen: /* [Pope 2015] */

Discuss what we will consider a GPF

2015-03-13T16:36:05Z

Allen: /* What is a Geoscience Paper of the Future? */

[[Category:Task]]

= New Frameworks to Create a New Generation of Scientific Articles =

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an [http://www.executablepapers.com/about-challenge.html Executable Papers Challenge]. They have a new type of paper called a ''[http://www.elsevier.com/about/content-innovation/original-software-publications#overview software paper]''.

== The Case of the Tuberculosis Drugome ==

This is a case where a workflow system was used to make data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000976 the original "drugome" paper]
* [http://www.wings-workflows.org/drugome/ the web site that describes how that paper was reproduced]
* [http://www.wings-workflows.org/drugome/index.php/Main_Page#Augmenting_the_original_article detailed documentation of the drugome method]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080278 a publication that reports on that work].

= Looking at the Future =

== The Vision ==
In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

== What is a Geoscience Paper of the Future? ==
A paper is one thing (think of a larger wrapper, with conceptual framework) as opposed to the smaller bits (code, datasets, individual figures) that are updated along they way (e.g. get associated with your ORCID).
Don't want to get into the stigma of least publishable units. Recognize that there are different types of publications (letter, full paper, etc.) for different sized contributions, too.

A GPF paper includes:
* data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
* software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
* provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
* figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record. {Figures may be be a "prettyfied" version of the published version.}



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Yolanda_Gil|
Participants=Kyo_Lee|
Participants=Chris_Duffy|
Participants=Scott_Peckham|
Participants=Erin_Robinson|
Participants=Chris_Mattmann|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Heath_Mills|
Participants=Suzanne_Pierce|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Jordan_Read|
Participants=Kim_Miller|
Participants=Ji-Hyun_Oh|
Participants=Allen_Pope|
Progress=50|
StartDate=2015-02-06|
TargetDate=2015-05-29|
Type=Low}}

Discuss what we will consider a GPF

2015-03-13T16:22:12Z

Allen:

[[Category:Task]]

= New Frameworks to Create a New Generation of Scientific Articles =

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an [http://www.executablepapers.com/about-challenge.html Executable Papers Challenge]. They have a new type of paper called a ''[http://www.elsevier.com/about/content-innovation/original-software-publications#overview software paper]''.

== The Case of the Tuberculosis Drugome ==

This is a case where a workflow system was used to make data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000976 the original "drugome" paper]
* [http://www.wings-workflows.org/drugome/ the web site that describes how that paper was reproduced]
* [http://www.wings-workflows.org/drugome/index.php/Main_Page#Augmenting_the_original_article detailed documentation of the drugome method]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080278 a publication that reports on that work].

= Looking at the Future =

== The Vision ==
In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

== What is a Geoscience Paper of the Future? ==

A GPF paper includes:
* data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
* software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
* provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
* figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record. {Figures may be be a "prettyfied" version of the published version.}



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Yolanda_Gil|
Participants=Kyo_Lee|
Participants=Chris_Duffy|
Participants=Scott_Peckham|
Participants=Erin_Robinson|
Participants=Chris_Mattmann|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Heath_Mills|
Participants=Suzanne_Pierce|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Jordan_Read|
Participants=Kim_Miller|
Participants=Ji-Hyun_Oh|
Participants=Allen_Pope|
Progress=50|
StartDate=2015-02-06|
TargetDate=2015-05-29|
Type=Low}}

Discuss what we will consider a GPF

2015-03-13T16:20:08Z

Allen:

[[Category:Task]]

= New Frameworks to Create a New Generation of Scientific Articles =

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an [http://www.executablepapers.com/about-challenge.html Executable Papers Challenge]. They have a new type of paper called a ''[http://www.elsevier.com/about/content-innovation/original-software-publications#overview software paper]''.

== The Case of the Tuberculosis Drugome ==

This is a case where a workflow system was used to make data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000976 the original "drugome" paper]
* [http://www.wings-workflows.org/drugome/ the web site that describes how that paper was reproduced]
* [http://www.wings-workflows.org/drugome/index.php/Main_Page#Augmenting_the_original_article detailed documentation of the drugome method]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080278 a publication that reports on that work].

= Looking at the Future =

== The Vision ==
In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

== What is a Geoscience Paper of the Future? ==

A GPF paper includes:
* data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
* software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
* provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
* figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record. {Figures maybe be a "prettyfied" version of the published version.}



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Yolanda_Gil|
Participants=Kyo_Lee|
Participants=Chris_Duffy|
Participants=Scott_Peckham|
Participants=Erin_Robinson|
Participants=Chris_Mattmann|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Heath_Mills|
Participants=Suzanne_Pierce|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Jordan_Read|
Participants=Kim_Miller|
Participants=Ji-Hyun_Oh|
Participants=Allen_Pope|
Progress=50|
StartDate=2015-02-06|
TargetDate=2015-05-29|
Type=Low}}

Discuss what we will consider a GPF

2015-03-13T16:17:45Z

Allen:

[[Category:Task]]

= New Frameworks to Create a New Generation of Scientific Articles =

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an [http://www.executablepapers.com/about-challenge.html Executable Papers Challenge]. They have a new type of paper called a ''[http://www.elsevier.com/about/content-innovation/original-software-publications#overview software paper]''.

== The Case of the Tuberculosis Drugome ==

This is a case where a workflow system was used to make data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000976 the original "drugome" paper]
* [http://www.wings-workflows.org/drugome/ the web site that describes how that paper was reproduced]
* [http://www.wings-workflows.org/drugome/index.php/Main_Page#Augmenting_the_original_article detailed documentation of the drugome method]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080278 a publication that reports on that work].

= Looking at the Future =

== The Vision ==
In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

== What is a Geoscience Paper of the Future? ==

A GPF paper includes:
* data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
* software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
* provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
* figures/visualizations: Generated by explicit code (if possible) and are the result of a workflow or provenance record.



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Yolanda_Gil|
Participants=Kyo_Lee|
Participants=Chris_Duffy|
Participants=Scott_Peckham|
Participants=Erin_Robinson|
Participants=Chris_Mattmann|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Heath_Mills|
Participants=Suzanne_Pierce|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Jordan_Read|
Participants=Kim_Miller|
Participants=Ji-Hyun_Oh|
Participants=Allen_Pope|
Progress=50|
StartDate=2015-02-06|
TargetDate=2015-05-29|
Type=Low}}

Discuss what we will consider a GPF

2015-03-13T16:15:33Z

Allen:

[[Category:Task]]

= New Frameworks to Create a New Generation of Scientific Articles =

Several frameworks have been developed to document scientific articles so that they are more useful to researchers than just a simple PDF. These include iPython Notebook, Weaver (for R), etc.

Elsevier has invested in some initiatives in this direction. They carried out an [http://www.executablepapers.com/about-challenge.html Executable Papers Challenge]. They have a new type of paper called a ''[http://www.elsevier.com/about/content-innovation/original-software-publications#overview software paper]''.

== The Case of the Tuberculosis Drugome ==

This is a case where a workflow system was used to make data and software explicit and published as linked open data in RDF (i.e., accessible Web objects in the Semantic Web). The data were assigned DOIs, as was the workflow.

* [http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1000976 the original "drugome" paper]
* [http://www.wings-workflows.org/drugome/ the web site that describes how that paper was reproduced]
* [http://www.wings-workflows.org/drugome/index.php/Main_Page#Augmenting_the_original_article detailed documentation of the drugome method]
* [http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0080278 a publication that reports on that work].

= Looking at the Future =

== The Vision ==
In the future, scientists will use radically new tools to generate papers. As scientists do their work, those tools will be documenting the work and all the associated digital objects (data, software, etc) so that when it comes time to publish a paper everything will be easily documented and included. Today, several research tools exist for working in this way, but they are not routinely used and sometimes they do not always fit the scientist research workflow.

In the future, publishers will accept submissions that do not just contain PDF but also data, software, and other digital objects relevant to the research. Today, many journals accept datasets together with papers, some journals accept software and software papers, but no journal includes the full details of the data, software, workflow, and visualizations of a paper.

In the future, readers of papers will be able to interact with the paper document, modify its figures to explore the data, reproduce the results, run the method with new data. Today, readers simply get a static paper, and even if the data is available they have to download it and analyze it themselves.

In the future, data producers and software developers will get credit for the work that they do because all publications that build on their work will acknowledge their work through citations. Today, there is limited credit and reward for those that create data and software.

== What is a Geoscience Paper of the Future? ==

A GPF paper includes:
* data: documented, described in a public repository, has a license specified and is open if possible, and cited with DOIs
* software: documented, in a public repository, has a license specified and is open source if possible, and cited with DOIs
* provenance: explicitly documented as a workflow sketch, a formal workflow, or a provenance record (in PROV or similar standard), possibly in a shared repository and given a DOI
* figures/visualizations: encouraged to be generated by explicit code and are the result of a workflow or provenance record. (Consider sharing interactive versions, if possible.)



{{#set:
Expertise=Open_science|
Expertise=Geosciences|
Owner=Yolanda_Gil|
Participants=Kyo_Lee|
Participants=Chris_Duffy|
Participants=Scott_Peckham|
Participants=Erin_Robinson|
Participants=Chris_Mattmann|
Participants=Cedric_David|
Participants=Ibrahim_Demir|
Participants=Heath_Mills|
Participants=Suzanne_Pierce|
Participants=Mimi_Tzeng|
Participants=Sandra_Villamizar|
Participants=Xuan_Yu|
Participants=Wally_Fulweiler|
Participants=Leif_Karlstrom|
Participants=Jordan_Read|
Participants=Kim_Miller|
Participants=Ji-Hyun_Oh|
Participants=Allen_Pope|
Progress=50|
StartDate=2015-02-06|
TargetDate=2015-05-29|
Type=Low}}