Telemetry/Custom analysis with spark: Difference between revisions

Telemetry/Custom analysis with spark (view source)

Revision as of 20:15, 7 April 2017

13 bytes removed , 7 April 2017

Use spark instead of sqlContext

Amiyaguchi

3

edits

@@ Line 101: / Line 101: @@
 === Spark SQL and Spark Dataframes/Datasets ===
-Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with `sqlContext`. For example:
+Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the `spark` object. For example:
-     longitudinal = sqlContext.sql('SELECT * FROM longitudinal')
+     longitudinal = spark.sql('SELECT * FROM longitudinal')
 creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS:
@@ Line 123: / Line 123: @@
 ===Accessing the Spark UI===
-Go to localhost:8888/spark after sshing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.
+Go to localhost:8888/spark after ssh-ing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job.
 == The MozTelemetry Library ==
@@ Line 154: / Line 154: @@
 === How can I load parquet datasets in a Jupyter notebook? ===
-Use sqlContext.read.load, e.g.:
+Use spark.read.parquet, e.g.:
-   dataset = sqlContext.read.load("s3://the_bucket/the_prefix/the_version", "parquet")
+   dataset = spark.read.parquet("s3://the_bucket/the_prefix/the_version")
 === I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error ===

Telemetry/Custom analysis with spark: Difference between revisions

Telemetry/Custom analysis with spark (view source)

Revision as of 20:15, 7 April 2017

Navigation menu

Search