With the release of Apache Spark 1.4.1 on July 15th, 2015, I wanted to write a step-by-step guide to help new users get up and running with SparkR locally on a Windows machine using command shell and RStudio. SparkR provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows R-Users to run large scale data analysis from the R shell. The steps listed here WILL also be documented in my upcoming online book titled “Getting Started with SparkR for Big Data Analysis” which can be accessed at: http://www.danielemaasit.com/getting-started-with-sparkr/. These steps will get you up and running in less than 5 mins.
Make sure you have Java 6+ installed on your computer and the system environments set.
Step 1: Download Spark
Open your web browser and open this web page: http://spark.apache.org/. This is the official website for the Apache Spark project. You should see a large green button to the right of the page that reads “Download Spark”, as shown in Figure 1. Click the green button.
Clicking the green button will take you to the download page as shown in Figure 2 below.
You should follow the steps 1 to 3 to create a download link for a Spark Package of your choice. On the “2. Choose a package type” option, select any pre-built package type from the drop-down list (Figure 3). Since we want to experiment locally on windows, a pre-built package for Hadoop 2.6 and later will suffice.
On the “3. Choose a download type” option, select “Direct Download” from the drop-down list (Figure 4).
After selecting the download type, a link is created next to the option “4. Download Spark” (Figure 5). Click this link to download Spark.
Save the zipped file to your computer (Figure 6).
Step 2: Unzip Built Package
Unzip and save the files to a directory folder of your choice. In Figure 7 below, I chose to save to “C:/Apache/Spark-1.4.1”.
Step 3: Run in Command Prompt
Now start your favorite command shell and change directory to your Spark folder as shown in Figure 8.
To start SparkR, simply run the command
".\bin\sparkR" on the top-level Spark directory as shown in Figure 9 below.
You will see logs on your screen that should take at most 15 seconds to launch SparkR. If everything ran smoothly you should see a welcome message that reads “Welcome to SparkR!” as shown in Figure 10.
At this point you are ready to start prototyping with SparkR on the command shell. Note that a Spark context and a SQL Context have been initialized for you as “sc” and “sqlContext” respectively. You can now start experimenting using the example shown in Step 4.5.
Running in RStudio
While using SparkR in the command shell is good for quickly getting started, most R users typically use an Integrated Development Environment (IDE) like RStudio for development and running production ready code. Step 4 below will guide you to get started using SparkR in RStudio.
Step 4: Run in RStudio
- Step 4.1: Set System Environment
Once you have opened RStudio, you need to set the system environment first. You have to point your R session to the installed version of SparkR. Use the code shown in Figure 11 below but replace the SPARK_HOME variable using the path to your Spark folder. Mine is “C:/Apache/Spark-1.4.1”.
- Step 4.2: Set the Library Paths
Second, you have to set the library path for Spark a shown in Figure 12 below.
- Step 4.3: Load SparkR Library
Next, you can now load SparkR just as you would any other R library using the
library() command as shown in Figure 13.
- Step 4.4: Initialize Spark Context and SQL Context
Initialize SparkR by creating a Spark context using the command
sparkR.init(). The argument in this command is master = “local[N]”, where N stands for the number of threads that you want to use.
Also, you need to create a SQL context to be able to work with DataFrames (the main abstraction in SparkR). Use the command
sparkRSQL.init() to create a SQL context from your Spark context as shown in Figure 14.
When you run the above commands (From step 4.1 to 4.4), this invokes the “spark-submit” script that launches java, as shown in Figure 15. If this runs successfully, your Spark context and SQL context should be created and at this stage you should be able to start experimenting with SparkR.
- Step 4.5: A Quick Example
You can start experimenting with SparkR on the command shell and in RStudio using the example provided below. You can monitor your Spark jobs using the Spark UI at localhost:4040
The purpose of this blog post was to get you up and running quickly with SparkR locally on a personal computer. In the next blog post, I will show you how to use SparkR on a cloud computing framework like Amazon Elastic Compute Cloud (EC2) to manipulate large datasets with millions of records.