Telemetry/Custom analysis with spark: Difference between revisions

From MozillaWiki
Jump to navigation Jump to search
(→‎Using Jupyter: include spark)
(Deprecate page in favor of gitbook)
 
(19 intermediate revisions by 7 users not shown)
Line 1: Line 1:
This page is currently in-progress.
This document is now hosted in the unified documentation, [http://docs.telemetry.mozilla.org/tools/spark.html here]
 
== Introduction ==
Spark is a data processing engine designed to be fast and easy to use. We have setup Jupyter workbooks that use Spark to analyze our Telemetry data. Jupyter workbooks can be easily shared and updated among colleagues to enable richer analysis than SQL alone.
 
The Spark clusters can be spun up on analysis.telemetry.mozilla.org, which is abbreviated as atmo. The Spark Python API is called pyspark.
 
== Setting Up a Spark Cluster On atmo ==
 
# Go to analysis.telemetry.mozilla.org
# Click “Launch an ad-hoc Spark cluster”.
# Enter some details:
## The “Cluster Name” field should be a short descriptive name, like “chromehangs analysis”.
## Set the number of workers for the cluster. Please keep in mind to use resources sparingly; use a single worker to write and debug your job.
## Upload your SSH public key.
# Click “Submit”.
# A cluster will be launched on AWS preconfigured with Spark, IPython and some handy data analysis libraries like pandas and matplotlib.
 
Once the cluster is ready, you can tunnel IPython through SSH by following the instructions on the dashboard, and running the ssh shell command. For example:
 
ssh -i ~/.ssh/id_rsa -L 8888:localhost:8888 hadoop@ec2-54-70-129-221.us-west-2.compute.amazonaws.com
 
Finally, you can launch IPython in Firefox by visiting http://localhost:8888.
 
== The Python Jupyter Notebook ==
When you access http://localhost:8888, two example Jupyter notebooks are available to peruse. To create a new Jupyter notebook, select new -> Python 2.
 
Starting out, we recommend looking through the "Telemetry Hello World" notebook. It gives a nice overview of Jupyter and analyzing telemetry data using pyspark and plotly.
 
=== Using Jupyter ===
Jupyter Notebooks contain a series of cells. Each cell contains code or markdown. To switch between the two, use the dropdown at the top. To run a cell, use shift-enter; this either compiles the markdown or runs the code. To create new cell, select Insert -> Insert Cell Below.
 
A cell can output text or plots. To output plots inlined with the cell, run the following command, usually below your import statements:
  %pylab inline
 
The notebook is setup to work with Spark. See the "Using Spark" section for more information.
 
== MozTelemetry ==
We have provided a library that gives easy access to the raw telemetry ping data. For example usage, see the `Telemetry Hello World.ipynb` example notebook. Detailed documentation for the library can be found at the [http://python-moztelemetry.readthedocs.io Python MozTelemetry Documentation].
 
== FAQ ==
Please add more FAQ as questions are answer by you or for you.
 
=== How can I load parquet datasets in a Jupyter notebook? ===
Use sqlContext.read.load, e.g.:
 
  dataset = sqlContext.read.load("s3://the_bucket/the_prefix/the_version", "parquet")
 
=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===
AWS recycles hostnames, so removing the offending key from $HOME/.ssh/known_hosts will remove the warning. You can find the line to remove by finding the line in the output that says
  Offending key in /path/to/hosts/known_hosts:2
Where 2 is the line number of the key that can be deleted. Just remove that line, save the file, and try again.
 
=== Why is my notebook hanging? ===
There are a couple common causes for this:
1. Currently, our Spark notebooks can only run a single Python kernel at a time. If you open multiple notebooks on the same cluster and try to run both, the second notebook will hang. Be sure to close notebooks using "Close and Halt" under the "File" dropdown.
2. The connection from PySpark to the Spark driver might be lost. Unfortunately the best way to recover from this for the moment seems to be spinning up a new cluster.

Latest revision as of 18:31, 19 July 2017

This document is now hosted in the unified documentation, here