3
edits
m (→How do I load an external library into the cluster?: remove redundant steps) |
Amiyaguchi (talk | contribs) (Use spark instead of sqlContext) |
||
Line 101: | Line 101: | ||
=== Spark SQL and Spark Dataframes/Datasets === | === Spark SQL and Spark Dataframes/Datasets === | ||
Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with ` | Spark also supports traditional SQL, along with special data structures that require schemas. The Spark SQL API can be accessed with the `spark` object. For example: | ||
longitudinal = | longitudinal = spark.sql('SELECT * FROM longitudinal') | ||
creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS: | creates a DataFrame that contains all the longitudinal data. A Spark DataFrame is essentially a distributed table, a la Pandas or R Dataframes. Under the covers they are an RDD of Row objects, and thus the entirety of the RDD API is available for DataFrames, as well as a DataFrame specific API. For example, a sql-like way to get the count of a specific OS: | ||
Line 123: | Line 123: | ||
===Accessing the Spark UI=== | ===Accessing the Spark UI=== | ||
Go to localhost:8888/spark after | Go to localhost:8888/spark after ssh-ing into the spark cluster to see the Spark UI. It has information about job statuses and task completion, and may help you debug your job. | ||
== The MozTelemetry Library == | == The MozTelemetry Library == | ||
Line 154: | Line 154: | ||
=== How can I load parquet datasets in a Jupyter notebook? === | === How can I load parquet datasets in a Jupyter notebook? === | ||
Use | Use spark.read.parquet, e.g.: | ||
dataset = | dataset = spark.read.parquet("s3://the_bucket/the_prefix/the_version") | ||
=== I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error === | === I got a REMOTE HOST IDENTIFICATION HAS CHANGED! error === |
edits