Sparklyr: a Big Data API for R

16 July, 2019

Slides   100%

i.gal/mFA0F

Background  

What is sparklyr?

  • Exposes the Spark API (written in Scala language) from R.

  • Spark allows access to the Hadoop ecosystem.

  • Simply, using dplyr syntax!.

  • Created by the Rstudio team in 2016.

What is ‘Big Data’?

  • Reading and writing data at a very big scale!

  • by:

    • Volume (Gb, Tb).
    • Variety / Complexity.
    • Velocity (streaming).

What is ‘Big Data’?

Big Data and Sparklyr

How is data proccessed?: HDFS

“hadoop distributed filesystem”

  • Different to NTFS, FAT, Ext3…

  • Optimized for files bigger than >100 Mb.

  • r/w format written in Java.
    • Parallel writing.
    • Parallel and sequential reading.

“sequential file reading… but, how?”

RDD’s

Resilient Distributed Database.

  • Basic file format for HDFS.

  • Composed by blocks, duplicated in several nodes.

  • They are recoverable in case any node is lost.

  • They form the basis of the dataframes that sparklyr will use.

HDFS Architecture

  • Blocks of 128 Mb.

  • Replicated 3 times (by default) in different nodes.

  • NameNode is the directory and metadata tree:

    • Indexes and maps/locates blocks and files.

How does Spark work?  

How does Spark work?

Compute-centric

Data goes where the data-processing software is located (MSExcel).

Data-centric

Big-Data paradigm: data-processing programs go where data is located (Spark).

How does Spark work?

# Spark context: local (for 'low memory' tasks only!)
sc <- spark_connect(master = "local", 
                    spark_home = "/usr/hdp/2.4.2.0-258/spark")

# Spark context: yarn (for loading bigger datasets)
sc <- spark_connect(master = "yarn-client", 
                    spark_home = "/usr/hdp/2.4.2.0-258/spark")

How does Spark work?

  • From the CESGA console we can access two different filesystems:

    • $HOME: NFS Remote filesystem (UNIX-like).
    • $HOME: Parallel, remote filesystem, designed for MapReduce applications: HDFS.
  • Driver: there resides the program that connects with Spark. Translates sparklyr into Spark.

  • Master: launches the operations (local, client, cluster).

  • YARN manages the resources asked for by Spark.

  • Executors implement file operations.

File operations in Spark  

Map-Reduce

  • Concept introduced by Google in 2004.

  • Implemented in Apache-Hadoop.

  • Map

  • Reduce

Map-Reduce

  • Concept introduced by Google in 2004.

  • Implemented in Apache-Hadoop.

  • apply()

  • summarise()

Relevant Operations

  • Map – apply()

  • Filter – filter() group_by()

  • Reduce – count() summarise()

  • Collect – collect()

collect triggers the execution of all the previously declared operations, and downloads resulting data to $HOME NFS.

Spark = ‘Lazy evaluation’.

Examples

Municipality data

compostela %>%
  filter(area_services > area_built * 0.5) %>% 
  
  head(6) %>% 
  collect()
compostela %>%
  group_by(district, boulevard) %>%
  summarise(max_num_habitants   = max(num_habitants),
            mean_num_estates  = mean(num_estates),
            total_num_houses  = sum(num_houses)) %>% 
  
  head(4) %>% 
  collect()

Examples

Book reviews in Amazon

Machine-learning and Regression models

And much more

AUTHORS


ACKNOWLEDGEMENTS

Stickers!

Sparklyr Tutorial  

Sparklyr tutorial

Jupyter Notebooks

Sparklyr tutorial

R script

spark-submit --deploy-mode client sparklyr_script.R


Clone this repo

git clone https://github.com/aurora-mareviv/sparklyr_test
git clone https://github.com/aurora-mareviv/sparklyr_start

References