web-scraping Archives - Page 3 of 5

[Solved] BeautifulSoup table data extraction – data not showing up

October 11, 2022 by Kirat

As you yourself found out, the element is not present in the page source, and is loaded dynamically through an AJAX request. The urllib module (or requests) returns the page source, which is why you won’t be able to get that value directly. Go to Developer Tools > Network > XHR and refresh the page. … Read more

[Solved] How to scrape multiple result having same tags and class

October 11, 2022 by Kirat

You need to parse your data from the script tag rather than the spans and divs. Try this: import requests from bs4 import BeautifulSoup import re import pandas as pd from pandas import json_normalize import json def get_page(url): response = requests.get(url) if not response.ok: print(‘server responded:’, response.status_code) else: soup = BeautifulSoup(response.text, ‘lxml’) return soup def … Read more

[Solved] Scraping data off site using 4 urls for one day using R

October 11, 2022 by Kirat

You can turn all the tables into a wide data frame with list operations: library(rvest) library(magrittr) library(dplyr) date <- 20130701 rng <- c(1:4) my_tabs <- lapply(rng, function(i) { url <- sprintf(“http://apims.doe.gov.my/apims/hourly%d.php?date=%s”, i, date) pg <- html(url) pg %>% html_nodes(“table”) %>% extract2(1) %>% html_table(header=TRUE) }) glimpse(plyr::join_all(my_tabs, by=colnames(my_tabs[[1]][1:2]))) ## Observations: 52 ## Variables: ## $ NEGERI / … Read more

[Solved] Can’t extract an email address from a webpage

October 10, 2022 by Kirat

There are no email addresses on that page. This is a typical way that is used to make contacting possible without giving an email address to the public. What happens when you press the “Send enquiry” -button is that your browser sends a HTTP POST request towards some address*, to a webserver, which then handles … Read more

[Solved] How to get data from a combobox using Beautifulsoup and Python?

October 8, 2022 by Kirat

From what I can see of the html, there is no span with id=”sexo- button”, so BeautifulSoup(login_request.text, ‘lxml’).find(“span”,id=”sexo- button”) would have returned None, which is why you got the error from get_text. As for your second attempt, I don’t think bs4 Tags have a value property, which is why you’d be getting None that time. … Read more

[Solved] How to get a link with web scraping

October 8, 2022 by Kirat

In the future, provide some code to show what you have attempted. I have expanded on Fabix answer. The following code gets the Youtube link, song name, and artist for all 20 pages on the source website. from bs4 import BeautifulSoup import requests master_url=”https://www.last.fm/tag/rock/tracks?page={}” headers = { “User-Agent”: “Mozilla/5.0 (iPhone; CPU iPhone OS 5_1 like … Read more

[Solved] organizing data that I am pulling and saving to CSV

October 6, 2022 by Kirat

You can use pandas to do that. Collect all the data into a dataframe, then just write the dataframe to file. import pandas as pd import requests import bs4 root_url=”https://www.estatesales.net” url_list=[‘https://www.estatesales.net/companies/NJ/Northern-New-Jersey’] results = pd.DataFrame() for url in url_list: response = requests.get(url) soup = bs4.BeautifulSoup(response.text, ‘html.parser’) companies = soup.find_all(‘app-company-city-view-row’) for company in companies: try: link = … Read more

[Solved] How to work around a site forbidding me to scrape their images with PHP

October 4, 2022 by Kirat

Actually it was quite simple. As @Leigh suggested, it only took spoofing an http referrer with the option CURLOPT_REFERER. In fact for every request, I just provided the domain name as the referrer and it worked. 0 solved How to work around a site forbidding me to scrape their images with PHP

[Solved] How to make this crawler more efficient [closed]

October 4, 2022 by Kirat

Provided your intentions are not nefarious– As mentioned in the comment, one way to achieve this is executing the crawler in parallel (multithreading)—as opposed to doing one domain at a time. Something like: exec(‘php crawler.php > /dev/null 2>&1 &’); exec(‘php crawler.php > /dev/null 2>&1 &’); exec(‘php crawler.php > /dev/null 2>&1 &’); exec(‘php crawler.php > /dev/null … Read more

[Solved] I believe my scraper got blocked, but I can access the website via a regular browser, how can they do this? [closed]

October 3, 2022 by Kirat

I am wondering both how the website was able to do this without blocking my IP outright and … By examining all manner of things about your request, some straight-forward and some arcane. Straight-forward items include user-agent headers, cookies, correctly spelling of dynamic URLs. Arcane items include your IP address, the timing of your request, … Read more

[Solved] How to navigate through HTMl pages that have paging for their content using Python? [closed]

October 3, 2022 by Kirat

As I can see in this page, you need to interact with java script that is invoked by button Go or Next Page button. For Go button you need to fill the textbox each time. You can use different approaches to work around this: 1) Selenium – Web Browser Automation 2) spynner – Programmatic web … Read more

[Solved] Web Scraping From .asp URLs

October 2, 2022 by Kirat

I would recommend using JSoup for this. To do so add below to pom.xml <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.2</version> </dependency> Then you fire a first request to just get cookied Connection.Response initialPage = Jsoup.connect(“https://www.flightview.com/flighttracker/”) .headers(headers) .method(Connection.Method.GET) .userAgent(userAgent) .execute(); Map<String, String> initialCookies = initialPage.cookies(); Then you fire the next request with these cookies Connection.Response flights = Jsoup.connect(“https://www.flightview.com/TravelTools/FlightTrackerQueryResults.asp”) … Read more

[Solved] Extracting variables from Javascript inside HTML

October 2, 2022 by Kirat

You could use BeautifulSoup to extract the <script> tag, but you would still need an alternative approach to extract the information inside. Some Python can be used to first extract flashvars and then pass this to demjson to convert the Javascript dictionary into a Python one. For example: import demjson content = “””<script type=”text/javascript”>/* <![CDATA[ … Read more

[Solved] Click on “Show more deals” in webpage with Selenium

September 30, 2022 by Kirat

To click on the element with text as Show 10 more deals on the page https://www.uswitch.com/broadband/compare/deals_and_offers/ you can use the following solution: Code Block: from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By url = “https://www.uswitch.com/broadband/compare/deals_and_offers/” options = webdriver.ChromeOptions() options.add_argument(“start-maximized”) options.add_argument(‘disable-infobars’) browser = webdriver.Chrome(chrome_options=options, executable_path=r’C:\Utility\BrowserDrivers\chromedriver.exe’) browser.get(url) … Read more

[Solved] Python – ETFs Daily Data Web Scraping

September 28, 2022 by Kirat

Yes, I agree that Beautiful Soup is a good approach. Here is some Python code which uses the Beautiful Soup library to extract the intraday price from the IVV fund page: import requests from bs4 import BeautifulSoup r = requests.get(“https://www.marketwatch.com/investing/fund/ivv”) html = r.text soup = BeautifulSoup(html, “html.parser”) if soup.h1.string == “Pardon Our Interruption…”: print(“They detected … Read more