CERN Accelerating science

Improving Reproducibility of Data Science Experiments

Date published: 
Tuesday, 2 February, 2016
Document type: 
Conference paper
Author(s): 
T. Likhomanenko
A. Rogozhnikov
A. Baranov
E. Khairullin
A. Ustyuzhanin
Data analysis in fundamental sciences nowadays is an essential process that pushes frontiers of our knowledge and leads to new discoveries. At the same time we can see that complexity of those analyses increases fast due to a) enormous volumes of datasets being analyzed, b) variety of techniques and algorithms one have to check inside a single analysis, c) dis- tributed nature of research teams that requires special communication media for knowledge and information exchange between individual researchers. There is a lot of resemblance between techniques and problems arising in the areas of industrial information retrieval and particle physics. To address those problems we propose Reproducible Experiment Platform (REP), a software infrastructure to support collaborative ecosystem for computational sci- ence. It is a Python based solution for research teams that allows running computational experiments on shared datasets, obtaining repeatable results, and consistent comparisons of the obtained results. Several analysis using Key features of REP are illustrated on several practical cases that were performed at LHCb experiment at CERN. Keywords: machine learning, reproducibility, computation infrastructure, analysis preser- vation
Event published at: 
AutoML 2015 workshop @ ICML 2015, Lille, France
Technical document file: