"Open science" encompasses efforts on the part of scientists to improve reproducibility of original research. This includes publishing data sets, providing free and open access to resulting publications, and releasing code under open source licenses. Proprietary software and closely guarded datasets have given way to vibrant open source communities and open access journals.
Applications in big data, however, have been uniquely challenging to incorporate into open science. These complications include datasets too large to host publicly, extensive codebases in highly customized compute platforms, and lack of available computing resources to efficiently replicate the original research environment.
This workshop will focus on the current practices of and future directions for democratizing big data analytics and improving reproducibility of research in big data. This includes, but is not limited to
- Core research in big data that uses open source frameworks
- Open source tools and subprojects for specific big data use-cases such as neuroimaging or multimodal data integration (e.g. DL4J, thunder, Alluxio, Arrow)
- Next-generation open source big data paradigms (e.g. Beam, Flink, Apex)
- Creating, sharing, and maintaining large and open datasets (e.g. dat)
- Open cloud resources for interacting with big data (e.g. Databricks Community Edition, mybinder)
- Containerized and open source big data applications and environments (e.g. Docker)
- Open science and big data in the classroom
- Other open science use cases in big data analytics
As such, submissions must have a strong open source / open science component, in addition to relevance in big data analytics.