SparkMonitor

Final Report | Installation | How it Works | Use Cases | Code | License

Google Summer of Code 2017 Final Report

Big Data Tools for Physics Analysis

Introduction

Jupyter Notebook is an interactive computing environment that is used to create notebooks which contain code, output, plots, widgets and theory. Jupyter notebook offers a convenient platform for interactive data analysis, scientific computing and rapid prototyping of code. A powerful tool used to perform complex computation intensive tasks is Apache Spark. Spark is a framework for large scale cluster computing in Big Data contexts. This project leverages these existing big data tools for use in an interactive scientific analysis environment. Spark jobs can be called from an IPython kernel in Jupyter Notebook using the pySpark module. The results of the computation can be visualized and plotted within the notebook interface. However to know what is happening to a running job, it is required to connect separately to the Spark web UI server. This project implements an extension called SparkMonitor to Jupyter Notebook that enables the monitoring of jobs sent from a notebook application, from within the notebook itself. The extension seamlessly integrates with the cell structure of the notebook and provides real time monitoring capabilities.

Features

The Monitoring Display

Jobs

Tasks

Timeline

Spark UI

Example Use Cases

The extension has been tested with a range of Spark applications. Here is a list of use cases the extension has been run with.

Integration in SWAN and CERN IT Infrastructure

Documentation

How it Works

Code Documentation

Installation

Future Work

Pending Work

Future Ideas