Introduction  

There are several ways to mine tables and other content from a pdf, using R. After a lot of trial & error, here’s how I managed to extract global exam results from an international, massive, yearly examination, the EDAIC.

This is my first use case of “pdf mining” with R, and also a fairly simple one. However, more complex and very fine examples of this can be found elsewhere, using both pdftools and tabulizer packages.

As can be seen from the original pdf, exam results are anonymous. They consist on a numeric, 6-digit code and a binary result: “FAIL / PASS”. I was particularly interested into seeing how many of them passed the exam, as some indirect measure of how “hard” it can be.

Mining the table  

In this case I preferred pdftools as it allowed me to extract the whole content from the pdf:

install.packages("pdftools")
library(pdftools) 
txt <- pdf_text("EDAIC.pdf") 
txt[1] 
class(txt[1]) 
   [1] "EDAIC Part I 2017                                                  Overall Results\n                                         Candidate N°       Result\n                                            107131            FAIL\n                                            119233            PASS\n                                            123744            FAIL\n                                            127988            FAIL\n                                            133842            PASS\n                                            135692            PASS\n                                            140341            FAIL\n                                            142595            FAIL\n                                            151479            PASS\n                                            151632            PASS\n                                            152787            PASS\n                                            157691            PASS\n                                            158867            PASS\n                                            160211            PASS\n                                            161970            FAIL\n                                            162536            PASS\n                                            163331            PASS\n                                            164442            FAIL\n                                            164835            PASS\n                                            165734            PASS\n                                            165900            PASS\n                                            166469            PASS\n                                            167241            FAIL\n                                            167740            PASS\n                                            168151            FAIL\n                                            168331            PASS\n                                            168371            FAIL\n                                            168711            FAIL\n                                            169786            PASS\n                                            170721            FAIL\n                                            170734            FAIL\n                                            170754            PASS\n                                            170980            PASS\n                                            171894            PASS\n                                            171911            PASS\n                                            172047            FAIL\n                                            172128            PASS\n                                            172255            FAIL\n                                            172310            PASS\n                                            172706            PASS\n                                            173136            FAIL\n                                            173229            FAIL\n                                            174336            PASS\n                                            174360            PASS\n                                            175177            FAIL\n                                            175180            FAIL\n                                            175184            FAIL\nYour candidate number is indicated on your admission document        Page 1 of 52\n"
   [1] "character"

These commands return a lenghty blob of text. Fortunately, there are some \n symbols that signal the new lines in the original document.

We will use these to split the blob into something more approachable, using tidyversal methods…

  • Split the blob.
  • Transform the resulting list into a character vector with unlist.
  • Trim leading white spaces with stringr::str_trim.
library(tidyverse) 
library(stringr) 
tx2 <- strsplit(txt, "\n") %>% # divide by carriage returns
  unlist() %>% 
  str_trim(side = "both") # trim white spaces
tx2[1:10]
    [1] "EDAIC Part I 2017                                                  Overall Results"
    [2] "Candidate N°       Result"                                                         
    [3] "107131            FAIL"                                                            
    [4] "119233            PASS"                                                            
    [5] "123744            FAIL"                                                            
    [6] "127988            FAIL"                                                            
    [7] "133842            PASS"                                                            
    [8] "135692            PASS"                                                            
    [9] "140341            FAIL"                                                            
   [10] "142595            FAIL"
  • Remove the very first row.
  • Transform into a tibble.
tx3 <- tx2[-1] %>% 
  data_frame() 
tx3
   # A tibble: 2,579 x 1
      .                        
      <chr>                    
    1 Candidate N°       Result
    2 107131            FAIL   
    3 119233            PASS   
    4 123744            FAIL   
    5 127988            FAIL   
    6 133842            PASS   
    7 135692            PASS   
    8 140341            FAIL   
    9 142595            FAIL   
   10 151479            PASS   
   # … with 2,569 more rows
  • Use tidyr::separate to split each row into two columns.
  • Remove all spaces.
tx4 <- separate(tx3, ., c("key", "value"), " ", extra = "merge") %>%  
  mutate(key = gsub('\\s+', '', key)) %>%
  mutate(value = gsub('\\s+', '', value)) 
# tx4[c(1:6, 48:52),]
tx4
   # A tibble: 2,579 x 2
      key       value   
      <chr>     <chr>   
    1 Candidate N°Result
    2 107131    FAIL    
    3 119233    PASS    
    4 123744    FAIL    
    5 127988    FAIL    
    6 133842    PASS    
    7 135692    PASS    
    8 140341    FAIL    
    9 142595    FAIL    
   10 151479    PASS    
   # … with 2,569 more rows
  • Remove rows that do not represent table elements.
tx5 <- tx4[grep('^[0-9]', tx4[[1]]),] 
tx5
   # A tibble: 2,424 x 2
      key    value
      <chr>  <chr>
    1 107131 FAIL 
    2 119233 PASS 
    3 123744 FAIL 
    4 127988 FAIL 
    5 133842 PASS 
    6 135692 PASS 
    7 140341 FAIL 
    8 142595 FAIL 
    9 151479 PASS 
   10 151632 PASS 
   # … with 2,414 more rows

Extracting the results  

We already have the table! now it’s time to get to the summary:    

library(knitr)
tx5 %>%
  group_by(value) %>%
  summarise (count = n()) %>%
  mutate(percent = paste( round( (count / sum(count)*100) , 1), "%" )) %>% 
  kable()
value count percent
FAIL 1017 42 %
PASS 1407 58 %

From these results we see that the EDAIC-Part 1 exam doesn’t have a particularly high clearance rate. It is currently done by medical specialists, but its dificulty relies in a very broad list of subjects covered, ranging from topics in applied physics, the entire human physiology, pharmacology, clinical medicine and latest guidelines.

The EDAIC-Part 2 is an oral examination which may take place in English, French, German, Spanish or Scandinavian languages, in which the examinee is evaluated by 4 pairs of examiners in 4 different sessions of 1-hour duration. Part 2 is currently considered the toughest by examinees.

Despite being a hard test to pass -and also the exam fee-, it’s becoming increasingly popular among anesthesiologists and critical care specialists that wish to stay up-to date with the current medical knowledge and practice.