You could try using purrr
instead of the loop as follows:
require(rvest)
require(purrr)
require(tibble)
URLs %>%
map(read_html) %>%
map(html_nodes, "#yw1 .spielprofil_tooltip") %>%
map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))
Timing:
user system elapsed
2.939 2.746 5.699
The step that take the most time is the crawling via map(read_html)
.
To paralyze that you can use e.g. the parallel backend of plyr
as follows:
require(httr)
doMC::registerDoMC(cores=3) # cores depending on your system
plyr::llply(URLs, GET, .parallel = TRUE) %>%
map(read_html) %>%
map(html_nodes, "#yw1 .spielprofil_tooltip") %>%
map_df(~tibble(Player = html_text(.), P_URL = html_attr(., "href")))
Somehow my Rstudio crashed using plyr::llply(URLs, read_html, .parallel = TRUE)
thats why i use the underlying httr::GET
and parse the result in the next step via map(read_html)
. So the scraping is done in parallel but the parsing of the response is done sequentially.
Timing:
user system elapsed
2.505 0.337 2.940
In both cases the result looks as follows:
# A tibble: 1,036 × 2
Player P_URL
<chr> <chr>
1 David de Gea /david-de-gea/profil/spieler/59377
2 D. de Gea /david-de-gea/profil/spieler/59377
3 Sergio Romero /sergio-romero/profil/spieler/30690
4 S. Romero /sergio-romero/profil/spieler/30690
5 Sam Johnstone /sam-johnstone/profil/spieler/110864
6 S. Johnstone /sam-johnstone/profil/spieler/110864
7 Daley Blind /daley-blind/profil/spieler/12282
8 D. Blind /daley-blind/profil/spieler/12282
9 Eric Bailly /eric-bailly/profil/spieler/286384
10 E. Bailly /eric-bailly/profil/spieler/286384
# ... with 1,026 more rows
2
solved Improving my R code – advice wanted on better way of coding? [closed]