The objective of this blog post is demonstrate how to use Apache SparkR to power Shiny applications. I have been curious about what the use cases for a “Shiny-SparkR” application would be and how to develop and deploy such an app.
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
In this blog post, we shall learn how to launch a Spark stand alone cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) for analysis of Big Data. This is a continuation from our previous blog, which showed us how to download Apache Spark and start SparkR locally on windows OS and RStudio.
We shall use Spark 1.5.1 (released on October 02, 2015) which has a spark-ec2 script that is used to install stand alone Spark on AWS EC2. A nice feature about this spark-ec2 script is that it installs RStudio server as well. This means that you don’t need to install RStudio server separately. Thus you can start working with your data immediately after Spark is installed.
With the release of Apache Spark 1.4.1 on July 15th, 2015, I wanted to write a step-by-step guide to help new users get up and running with SparkR locally on a Windows machine using command shell and RStudio. SparkR provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows R-Users to run large scale data analysis from the R shell. The steps listed here WILL also be documented in my upcoming online book titled “Getting Started with SparkR for Big Data Analysis” which can be accessed at: http://www.danielemaasit.com/getting-started-with-sparkr/. These steps will get you up and running in less than 5 mins.