Introduction

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")

library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1])

   [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
   [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

Split the blob.
Transform the resulting list into a character vector with unlist.
Trim leading white spaces with stringr::str_trim.

library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
    [1] "EDAIC Part I 2017                                                  Overall Results"
    [2] "Candidate N°       Result"                                                         
    [3] "107131            FAIL"                                                            
    [4] "119233            PASS"                                                            
    [5] "123744            FAIL"                                                            
    [6] "127988            FAIL"                                                            
    [7] "133842            PASS"                                                            
    [8] "135692            PASS"                                                            
    [9] "140341            FAIL"                                                            
   [10] "142595            FAIL"

Remove the very first row.
Transform into a tibble.

tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
   # A tibble: 2,579 x 1
      .                        
      <chr>                    
    1 Candidate N°       Result
    2 107131            FAIL   
    3 119233            PASS   
    4 123744            FAIL   
    5 127988            FAIL   
    6 133842            PASS   
    7 135692            PASS   
    8 140341            FAIL   
    9 142595            FAIL   
   10 151479            PASS   
   # … with 2,569 more rows

Use tidyr::separate to split each row into two columns.
Remove all spaces.

tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
# tx4[c(1:6, 48:52),]
tx4
   # A tibble: 2,579 x 2
      key       value   
      <chr>     <chr>   
    1 Candidate N°Result
    2 107131    FAIL    
    3 119233    PASS    
    4 123744    FAIL    
    5 127988    FAIL    
    6 133842    PASS    
    7 135692    PASS    
    8 140341    FAIL    
    9 142595    FAIL    
   10 151479    PASS    
   # … with 2,569 more rows

Remove rows that do not represent table elements.

tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
   # A tibble: 2,424 x 2
      key    value
      <chr>  <chr>
    1 107131 FAIL 
    2 119233 PASS 
    3 123744 FAIL 
    4 127988 FAIL 
    5 133842 PASS 
    6 135692 PASS 
    7 140341 FAIL 
    8 142595 FAIL 
    9 151479 PASS 
   10 151632 PASS 
   # … with 2,414 more rows

Extracting the results

We already have the table! now it’s time to get to the summary:

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()

value	count	percent
FAIL	1017	42 %
PASS	1407	58 %

From these results we see that the EDAIC-Part 1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

The EDAIC-Part 2 is an oral examination which may take place in English, French, German, Spanish or Scandinavian languages, in which the examinee is evaluated by 4 pairs of examiners in 4 different sessions of 1-hour duration. Part 2 is currently considered the toughest by examinees.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.

Taming exam results in pdf with pdftools

Introduction

Mining the table

Extracting the results

aurora-mareviv

Taming exam results in pdf with pdftools

Introduction

Mining the table

Extracting the results

aurora-mareviv

Playing with post-hoc power with R - why we shouldn’t do it

R Blogdown Setup in GitHub (2)

An introduction to joint modeling in R

A minimal Project Tree in R

Taming exam results in pdf with pdftools

Tracking Septic Patients in a Hospital

R Blogdown Setup in GitHub

Quick wordclouds from PubMed abstracts - using PMID lists in R

R/Shiny for clinical trials: simple randomization tables

Compile BayesX from source code - via Fink in OSX 10.10