MIDAS Seminar

MIDAS Reproducibility Challenge Showcase: Sharon Glotzer – An Open Software Approach for Reproducible Research for Materials Design

SHARON GLOTZERProfessor – Chemical Engineering and Materials Science and Engineering, University of Michigan
SHARE:

Dr. Joshua Anderson (Lead Research Specialist)*, Allen LaCour (PhD. Candidate)*, Dr. Tim Moore (Postdoctoral Researcher)*, Kelly Wang (PhD. Candidate)**

*Chemical Engineering, Biointerfaces Institute, University of Michigan **Macromolecular Science and Engineering, Biointerfaces
Institute, University of Michigan

AN OPEN SOFTWARE APPROACH FOR REPRODUCIBLE RESEARCH FOR MATERIALS DESIGN

Abstract: Self-assembly is a process in which a disordered system of components forms an organized structure or pattern as a consequence of interactions among the components, without external direction, and it is one of the most promising methods by which to create next generation materials. As experimentalists have been able to create more and more complex building blocks, simulations are now an essential first step to narrow down the enormous space of possible design parameters, providing needed explanation and guidance to experiments. Led by Prof. Sharon C. Glotzer, our research efforts, which are based on fundamental statistical thermodynamic principles, are at the forefront of this exciting new area of materials by design. To carry out our research, we develop open source software and use it to execute simulations using millions of hours of CPU and GPU time on Great Lakes and on national HPC resources every year, storing and accessing petabytes of simulation, analysis and visualization data. As a result of our professional software engineering practices and the powerful features they provide 1 , tools from our software stack are used by thousands of researchers worldwide.

Our research questions are domain-specific, but our software is applicable across many fields of research. We strive to ensure that our research is reproducible by making both the generation and analysis of our data transparent, reproducible, useable by others and extensible (TRUE) 2 . Our tools are written as Python packages that provide a complete API that users drive from scripts that can be shared and re-run by others. The generation and analysis of simulation data is complex and occurs over many coupled, yet logistically independent steps. Our signac data and workflow management toolset allows users to define these steps and reproducibly execute them on data spaces across heterogeneous compute environments. Our workhorse particle simulation toolkit, HOOMD-blue, performs molecular dynamics and hard particle Monte Carlo simulations. Our freud library provides users the ability to analyze simulation trajectories and calculate advanced metrics. We work with hundreds of external users to develop extensions that become permanent parts of our open-source codes, increasing the reproducibility of work done by the
community. We provide containers tuned for performance on Great Lakes and national supercomputing centers so that users can deploy our full software stack reproducibly on a variety of systems. Combining our tools with others in the scientific Python ecosystem, researchers can implement a complete research project from the selection of the data space, simulations, analysis and data visualization in a handful of Python scripts managed in source control and fully archived for reproducibility.

In this talk we will give an overview of our tools and discuss next steps for those in the MIDAS community who might be interested in adopting parts of our framework. We will also briefly describe the software engineering practices we follow to ensure that our tools yield reproducible results. We will present two exemplar research projects that demonstrate these ideas in practice. Both projects require nontrivial workflows; we illustrate how we use our software stack to carry out these projects in a way that allows researchers outside our group to reproduce the entire study (simulations included), reproduce only the analyses on our raw data, or apply the same methodology to similar systems.

The Reproducibility Showcase features a series of online presentations and tutorials from May to August, 2020.  Presenters are selected from the MIDAS Reproducibility Challenge 2020.

A significant challenge across scientific fields is the reproducibility of research results, and third-party assessment of such reproducibility. The goal of the MIDAS Reproducibility Challenge is to highlight high-quality, reproducible work at the University of Michigan by collecting examples of best practices across diverse fields.  We received a large number of entries that illustrate wonderful work in the following areas:

    1. Theory – A definition of reproducibility and what aspects of reproducibility are critical in a particular domain or in general.
    2. Reproducing a Particular Study – Comprehensive record of parameters and code that allows for others to reproduce the results in a particular project.
    3. Generalizable Tools – A general platform for coding or running analyses that standardizes the methods for reproducible results across studies.
    4. Robustness – Metadata, tools and processes to improve the robustness of results to variations in data, computational hardware and software, and human decisions.
    5. Assessments of Reproducibility – Methods to test the consistency of results from multiple projects, such as meta-analysis or the provision of parameters that can be compared across studies.
    6. Reproducibility under Constraints – Sharing code and/or data to reproduce results without violating privacy or other restrictions.