Sparklyr: a Big Data API for R

16 July, 2019

Slides 100%

i.gal/mFA0F

Background

What is sparklyr?

Exposes the Spark API (written in Scala language) from R.
Spark allows access to the Hadoop ecosystem.
Simply, using dplyr syntax!.
Created by the Rstudio team in 2016.

What is ‘Big Data’?

Reading and writing data at a very big scale!
by:
- Volume (Gb, Tb).
- Variety / Complexity.
- Velocity (streaming).

What is ‘Big Data’?

Big Data and Sparklyr

How is data proccessed?: HDFS

“hadoop distributed filesystem”

Different to NTFS, FAT, Ext3…
Optimized for files bigger than >100 Mb.
r/w format written in Java.
- Parallel writing.
- Parallel and sequential reading.

“sequential file reading… but, how?”

RDD’s

Resilient Distributed Database.

Basic file format for HDFS.
Composed by blocks, duplicated in several nodes.
They are recoverable in case any node is lost.
They form the basis of the dataframes that sparklyr will use.

HDFS Architecture

Blocks of 128 Mb.
Replicated 3 times (by default) in different nodes.
NameNode is the directory and metadata tree:
- Indexes and maps/locates blocks and files.

How does Spark work?

Compute-centric

Data goes where the data-processing software is located (MSExcel).

Data-centric

Big-Data paradigm: data-processing programs go where data is located (Spark).

How does Spark work?

# Spark context: local (for 'low memory' tasks only!)
sc <- spark_connect(master = "local", 
                    spark_home = "/usr/hdp/2.4.2.0-258/spark")

# Spark context: yarn (for loading bigger datasets)
sc <- spark_connect(master = "yarn-client", 
                    spark_home = "/usr/hdp/2.4.2.0-258/spark")

How does Spark work?

From the CESGA console we can access two different filesystems:
- $HOME: NFS Remote filesystem (UNIX-like).
- $HOME: Parallel, remote filesystem, designed for MapReduce applications: HDFS.
Driver: there resides the program that connects with Spark. Translates sparklyr into Spark.
Master: launches the operations (local, client, cluster).
YARN manages the resources asked for by Spark.
Executors implement file operations.

File operations in Spark

Map-Reduce

Concept introduced by Google in 2004.
Implemented in Apache-Hadoop.
Map
Reduce

Map-Reduce

Concept introduced by Google in 2004.
Implemented in Apache-Hadoop.
apply()
summarise()

Relevant Operations

Map – apply()
Filter – filter() group_by()
Reduce – count() summarise()
Collect – collect()

collect triggers the execution of all the previously declared operations, and downloads resulting data to $HOME NFS.

Spark = ‘Lazy evaluation’.

Examples

Municipality data

compostela %>%
  filter(area_services > area_built * 0.5) %>% 
  
  head(6) %>% 
  collect()

compostela %>%
  group_by(district, boulevard) %>%
  summarise(max_num_habitants   = max(num_habitants),
            mean_num_estates  = mean(num_estates),
            total_num_houses  = sum(num_houses)) %>% 
  
  head(4) %>% 
  collect()

Examples

Book reviews in Amazon

Machine-learning and Regression models

And much more

Github: github.com/aurora-mareviv
Sparklyr tutorial at CESGA: /sparklyr_start
Web UI CESGA: hadoop.cesga.es

AUTHORS

Aurora Baluja (CHUS)
- @aurora-mareviv @maureviv
Javier López Cacheiro (CESGA)
- @javicacheiro

ACKNOWLEDGEMENTS

Carlos Gil Bellosta (datanalytics.com) @gilbellosta
Comunidad de Usuarios de R de Galicia.

Stickers!

Sparklyr Tutorial

Sparklyr tutorial

Jupyter Notebooks

sparklyr_test
- Jupyter notebook for interactive data exploration & modelling.
sparklyr_test2
- Another Jupyter notebook for interactive data exploration & modelling with a -much- bigger dataset.
- Visualize it in nbviewer.jupyter.org

Sparklyr tutorial

R script

sparklyr_script
- A .R script that launches as an application.

spark-submit --deploy-mode client sparklyr_script.R

Clone this repo

git clone https://github.com/aurora-mareviv/sparklyr_test
git clone https://github.com/aurora-mareviv/sparklyr_start

Sparklyr: a Big Data API for R

16 July, 2019

Slides 100%

Background

What is sparklyr?

What is ‘Big Data’?

What is ‘Big Data’?

Big Data and Sparklyr

How is data proccessed?: HDFS

RDD’s

HDFS Architecture

How does Spark work?

How does Spark work?

Compute-centric

Data-centric

How does Spark work?

How does Spark work?

File operations in Spark

Map-Reduce

Map-Reduce

Relevant Operations

Examples

Examples

Machine-learning and Regression models

And much more

AUTHORS

ACKNOWLEDGEMENTS

Stickers!

Sparklyr Tutorial

Sparklyr tutorial

Jupyter Notebooks

Sparklyr tutorial

R script

Clone this repo

References