Workshop Topics

Exploring the Intersection of Open Science and Big Data

"Open science" encompasses efforts on the part of scientists to improve reproducibility of original research. This includes publishing data sets, providing free and open access to resulting publications, and releasing code under open source licenses. Proprietary software and closely guarded datasets have given way to vibrant open source communities and open access journals.

Applications in big data, however, have been uniquely challenging to incorporating into open science. These complications include datasets too large to host publicly, extensive codebases in highly customized compute platforms, and lack of available computing resources to efficiently replicate the original research environment.

Open source big data analytics frameworks, such as Apache Hadoop and Apache Spark, have been the first steps in the democratization of big data and its inclusion in open science. New peer-to-peer protocols and open data repositories have appeared, and platforms for serving interactive scientific notebooks and open repositories for containerized applications have enabled scientists to publish entire virtual research environments.

This workshop will focus on the current practices of and future directions for democratizing big data analytics and improving reproducibility of research in big data. This includes, but is not limited to:

  • Fundamental research in big data that makes significant use of or extends open source frameworks in big data analytics (e.g. Spark, Hadoop)
  • Open source tools and subprojects developed for specific big data use-cases (e.g. DL4J, thunder)
  • Next-generation open source big data paradigms (e.g. Beam, Flink, Apex)
  • Open source tools and methods for creating, sharing, and maintaining large and open datasets (e.g. dat)
  • Open cloud resources for interacting with big data (e.g. mybinder)
  • Containerized and open source big data applications and environments (e.g. Docker)
  • Other open science use cases in big data analytics

In short: the OSBD workshop solicits papers in big data analytics that have significant open source / open science components.