Exposes the Spark API (written in Scala language) from R.
Spark allows access to the Hadoop ecosystem.
Simply, using dplyr syntax!.
Created by the Rstudio team in 2016.
Reading and writing data at a very big scale!
by:
“hadoop distributed filesystem”
Different to NTFS, FAT, Ext3…
Optimized for files bigger than >100 Mb.
“sequential file reading… but, how?”
Resilient Distributed Database.
Basic file format for HDFS.
Composed by blocks, duplicated in several nodes.
They are recoverable in case any node is lost.
They form the basis of the dataframes that sparklyr will use.
Blocks of 128 Mb.
Replicated 3 times (by default) in different nodes.
NameNode is the directory and metadata tree:
Data goes where the data-processing software is located (MSExcel).
Big-Data paradigm: data-processing programs go where data is located (Spark).
# Spark context: local (for 'low memory' tasks only!)
sc <- spark_connect(master = "local",
spark_home = "/usr/hdp/2.4.2.0-258/spark")
# Spark context: yarn (for loading bigger datasets)
sc <- spark_connect(master = "yarn-client",
spark_home = "/usr/hdp/2.4.2.0-258/spark")
From the CESGA console we can access two different filesystems:
Driver: there resides the program that connects with Spark. Translates sparklyr into Spark.
Master: launches the operations (local, client, cluster).
YARN manages the resources asked for by Spark.
Executors implement file operations.
Concept introduced by Google in 2004.
Implemented in Apache-Hadoop.
Map
Reduce
Concept introduced by Google in 2004.
Implemented in Apache-Hadoop.
apply()
summarise()
Map – apply()
Filter – filter() group_by()
Reduce – count() summarise()
Collect – collect()
collect triggers the execution of all the previously declared operations, and downloads resulting data to $HOME NFS.
Spark = ‘Lazy evaluation’.
Municipality data
compostela %>%
filter(area_services > area_built * 0.5) %>%
head(6) %>%
collect()
compostela %>%
group_by(district, boulevard) %>%
summarise(max_num_habitants = max(num_habitants),
mean_num_estates = mean(num_estates),
total_num_houses = sum(num_houses)) %>%
head(4) %>%
collect()
Book reviews in Amazon
Github: github.com/aurora-mareviv
Sparklyr tutorial at CESGA: /sparklyr_start
Web UI CESGA: hadoop.cesga.es
Carlos Gil Bellosta (datanalytics.com) @gilbellosta
Comunidad de Usuarios de R de Galicia.
Another Jupyter notebook for interactive data exploration & modelling with a -much- bigger dataset.
Visualize it in nbviewer.jupyter.org
spark-submit --deploy-mode client sparklyr_script.R
git clone https://github.com/aurora-mareviv/sparklyr_test
git clone https://github.com/aurora-mareviv/sparklyr_start