Telemetry/Custom analysis with spark: Difference between revisions

Jump to navigation Jump to search
Use spark instead of sqlContext
(Use spark instead of sqlContext)
Line 101: Line 101:


=== Spark SQL and Spark Dataframes/Datasets ===
=== Spark SQL and Spark Dataframes/Datasets ===
Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with `sqlContext`. For example:
Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the `spark` object. For example:


     longitudinal = sqlContext.sql('SELECT * FROM longitudinal')
     longitudinal = spark.sql('SELECT * FROM longitudinal')


creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS:
creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS:
Line 123: Line 123:


===Accessing the Spark UI===
===Accessing the Spark UI===
Go to localhost:8888/spark after sshing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.
Go to localhost:8888/spark after ssh-ing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.


== The MozTelemetry Library ==
== The MozTelemetry Library ==
Line 154: Line 154:


=== How can I load parquet datasets in a Jupyter notebook? ===
=== How can I load parquet datasets in a Jupyter notebook? ===
Use sqlContext.read.load, e.g.:
Use spark.read.parquet, e.g.:


   dataset = sqlContext.read.load("s3://the_bucket/the_prefix/the_version", "parquet")
   dataset = spark.read.parquet("s3://the_bucket/the_prefix/the_version")


=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===
=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===
3

edits

Navigation menu