"Open science" encompasses efforts on the part of scientists to improve reproducibility of original research. This includes publishing data sets, providing free and open access to resulting publications, and releasing code under open source licenses. Proprietary software and closely guarded datasets have given way to vibrant open source communities and open access journals.
Applications in big data, however, have been uniquely challenging to incorporating into open science. These complications include datasets too large to host publicly, extensive codebases in highly customized compute platforms, and lack of available computing resources to efficiently replicate the original research environment.
Open source big data analytics frameworks, such as Apache Hadoop and Apache Spark, have been the first steps in the democratization of big data and its inclusion in open science. New peer-to-peer protocols and open data repositories have appeared, and platforms for serving interactive scientific notebooks and open repositories for containerized applications have enabled scientists to publish entire virtual research environments.