Telemetry/Custom analysis with spark: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(update example command)
No edit summary
Line 17: Line 17:
# A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.
# A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.


Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, and running the ssh shell command.
Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, and running the ssh shell command. For example:


For example:
ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com
ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com


Finally, you can launch IPython in Firefox by visiting http://localhost:8888.
Finally, you can launch IPython in Firefox by visiting http://localhost:8888.

Revision as of 14:16, 11 August 2016

This page is currently in-progress.

Introduction

Spark is a data processing engine designed to be fast and easy to use. We have setup Jupyter workbooks that use Spark to analyze our Telemetry data. Jupyter workbooks can be easily shared and updated among colleagues to enable richer analysis than SQL alone.

The Spark clusters can be spun up on analysis.telemetry.mozilla.org, which is abbreviated as atmo. The Spark Python API is called pyspark.

Setting Up a Spark Cluster On ATMO

  1. Go to analysis.telemetry.mozilla.org
  2. Click “Launch an ad-hoc Spark cluster”.
  3. Enter some details:
    1. The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
    2. Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
    3. Upload your SSH public key.
  4. Click “Submit”.
  5. A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.

Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, and running the ssh shell command. For example:

ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com

Finally, you can launch IPython in Firefox by visiting http://localhost:8888.