The objective of this blog post is demonstrate how to use Apache SparkR to power Shiny applications. I have been curious about what the use cases for a “Shiny-SparkR” application would be and how to develop and deploy such an app.
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.
The objective of this blog post is to help you get started with Apache Zeppelin notebook for your R data science requirements. Zeppelin is a web-based notebook that enables interactive data analytics. You can make beautiful data-driven, interactive and collaborative documents with Scala(with Apache Spark), Python(with Apache Spark), SparkSQL, Hive, Markdown, Shell and more.
However, the latest official release, version 0.5.0, does not yet support the R programming language. Fortunately NFLabs, the company driving this open source project, pointed me to this pull request that provides an R Interpreter. An Interpreter is a plug-in which enables zeppelin users to use a specific language/data-processing-backend. For example to use scala code in Zeppelin, you need a
spark interpreter. So, if you are impatient like I am for R-integration into Zeppelin, this tutorial will show you how to setup Zeppelin for use with R by building from source.
In this blog post, we shall learn how to launch a Spark stand alone cluster on Amazon Web Services (AWS) Elastic Compute Cloud (EC2) for analysis of Big Data. This is a continuation from our previous blog, which showed us how to download Apache Spark and start SparkR locally on windows OS and RStudio.
We shall use Spark 1.5.1 (released on October 02, 2015) which has a spark-ec2 script that is used to install stand alone Spark on AWS EC2. A nice feature about this spark-ec2 script is that it installs RStudio server as well. This means that you don’t need to install RStudio server separately. Thus you can start working with your data immediately after Spark is installed.
With the release of Apache Spark 1.4.1 on July 15th, 2015, I wanted to write a step-by-step guide to help new users get up and running with SparkR locally on a Windows machine using command shell and RStudio. SparkR provides an R frontend to Apache Spark and using Spark’s distributed computation engine allows R-Users to run large scale data analysis from the R shell. The steps listed here WILL also be documented in my upcoming online book titled “Getting Started with SparkR for Big Data Analysis” which can be accessed at: http://www.danielemaasit.com/getting-started-with-sparkr/. These steps will get you up and running in less than 5 mins.