CERN Accelerating science

CMS Data-Services Ingestion into CERN’s Hadoop Big Data Analytics Infrastructure

Date published: 
Tuesday, 1 September, 2015
Document type: 
Summer student report
Author(s): 
A. Bose
This document introduces a new data ingestion framework called HLoader, built around Apache Sqoop to perform data ingestion jobs between RDBMS and Hadoop Distributed File System (HDFS). The HLoader framework deployed as a service inside CERN will be used for CMS Data Popularity ingestion into Hadoop clusters. HLoader could also be used for similar use cases like CMS and ATLAS Job Monitoring, ACCLOG databases, etc. The first part of the report describes the architectural details of HLoader, giving some background information about Apache Sqoop. The rest of the report focuses on the HLoader programming API, and is meant to be an entry point for developers describing how HLoader works, and possible directions of extending the framework in future.