[Solved] Improving my R code – advice wanted on better way of coding? [closed]

Question

You could try using purrr instead of the loop as follows:

require(rvest)
require(purrr)
require(tibble)

URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

Timing:

   user  system elapsed 
  2.939   2.746   5.699

The step that take the most time is the crawling via map(read_html).
To paralyze that you can use e.g. the parallel backend of plyr as follows:

require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

Somehow my Rstudio crashed using plyr::llply(URLs, read_html, .parallel = TRUE) thats why i use the underlying httr::GET and parse the result in the next step via map(read_html). So the scraping is done in parallel but the parsing of the response is done sequentially.

Timing:

   user  system elapsed 
  2.505   0.337   2.940

In both cases the result looks as follows:

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows

Accepted Answer

You could try using purrr instead of the loop as follows:

require(rvest)
require(purrr)
require(tibble)

URLs %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

Timing:

   user  system elapsed 
  2.939   2.746   5.699

The step that take the most time is the crawling via map(read_html).
To paralyze that you can use e.g. the parallel backend of plyr as follows:

require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>% 
  map(read_html) %>% 
  map(html_nodes, "#yw1 .spielprofil_tooltip") %>% 
  map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))

Somehow my Rstudio crashed using plyr::llply(URLs, read_html, .parallel = TRUE) thats why i use the underlying httr::GET and parse the result in the next step via map(read_html). So the scraping is done in parallel but the parsing of the response is done sequentially.

Timing:

   user  system elapsed 
  2.505   0.337   2.940

In both cases the result looks as follows:

# A tibble: 1,036 × 2
          Player                                P_URL
           <chr>                                <chr>
1   David de Gea   /david-de-gea/profil/spieler/59377
2      D. de Gea   /david-de-gea/profil/spieler/59377
3  Sergio Romero  /sergio-romero/profil/spieler/30690
4      S. Romero  /sergio-romero/profil/spieler/30690
5  Sam Johnstone /sam-johnstone/profil/spieler/110864
6   S. Johnstone /sam-johnstone/profil/spieler/110864
7    Daley Blind    /daley-blind/profil/spieler/12282
8       D. Blind    /daley-blind/profil/spieler/12282
9    Eric Bailly   /eric-bailly/profil/spieler/286384
10     E. Bailly   /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows