web-scraping Archives

[Solved] Parsing bot protected site

February 9, 2023 by Kirat

There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you. One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it’s identity. Generally speaking, a browser will … Read more

[Solved] Web-scraping a website which uses javascript [closed]

February 6, 2023 by Kirat

I am considering you are using jQuery: $(‘#cat-152 dt’).filter(function() { return $(this).text() == “Highest 2 man personal rating” }).siblings(‘dd’).text() 3 solved Web-scraping a website which uses javascript [closed]

[Solved] How to extract only certain data with file_get_contents

January 27, 2023 by Kirat

The best solution is probably to process the $homepage variable after it has been loaded. Have a look at String functions and regular expressions. file_get_contents() supports offset and maxlen options that can be used to control what parts of the file get loaded, but offset has behavior described by the documentation as “unpredictable” when used … Read more

[Solved] How many times a word is present in a web page using htmlagility C#

January 21, 2023 by Kirat

You could treat the whole page/web request as a string and do something like this: https://msdn.microsoft.com/en-us/library/bb546166.aspx It might not be efficient and it would search CSS classes and everything else but it might be a starting point. Else you need to use the agility pack and scrape through each not and check each bit of … Read more

[Solved] Started to get error while downloading data from yahoo finance using R

January 20, 2023 by Kirat

Yahoo made some changes, so you can’t access the download without the respective “crumb”, see Question few days ago. Unless you want to manually copy-paste this Download-URL for each stock, I’d highly suggest you to use the quantmod package. It works, after applying a short fix (which will probably soon be included in a new … Read more

[Solved] Regex for specific html tag in C# [duplicate]

January 19, 2023 by Kirat

instead of using a regex using something like an xml parser may be more useful to your situation. Load it up into an xml document and then use something like SelectNodes to get out your data you are looking for http://msdn.microsoft.com/en-us/library/4bektfx9.aspx 2 solved Regex for specific html tag in C# [duplicate]

[Solved] What would be the appropriate syntax for clicking the “send to” drop down menu? (See image for reference)

January 18, 2023 by Kirat

Try an attribute = value CSS selector to target the element by an attribute and its value. IE.document.querySelector(“[sourcecontent=”send_to_menu”]”).click Make sure you have a sufficient page load wait before trying to click. As a minimum you need While IE.Busy Or IE.readyState < 4: DoEvents: Wend IE.document.querySelector(“[sourcecontent=”send_to_menu”]”).click You could also use IE.document.querySelector(“#sendto > a”).click 0 solved What … Read more

[Solved] Why is my parser not working

January 18, 2023 by Kirat

The text you are looking for is in the 2nd TR inside the table. So, print tbl.findAll(‘tr’)[2] There really isn’t any good structure inside that text, however, so you’re on your own about getting the contact name, etc. solved Why is my parser not working

[Solved] Scraped CSV pandas dataframe I get: ValueError(‘Length of values does not match length of ‘ ‘index’)

January 14, 2023 by Kirat

You need merge with inner join: print(‘####CURRIES###’) df1 = pd.read_csv(‘C:\\O\\df1.csv’, index_col=False, usecols=[0,1,2], names=[“EW”, “WE”, “DA”], header=None) print(df1.head()) ####CURRIES### EW WE \ 0 can v can 1.90 1 Lanus U20 v Argentinos Jrs U20 2.10 2 Botafogo RJ U20 v Toluca U20 1.83 3 Atletico Mineiro U20 v Bahia U20 2.10 4 FC Porto v Monaco … Read more

[Solved] This code for Web Scraping using python returning None. Why? Any help would be appreciated

January 12, 2023 by Kirat

Your code works fine but there is a robot check before the product page so your request looks for the span tag in that robot check page, fails and returns None. Here is a link which may help you: python requests & beautifulsoup bot detection solved This code for Web Scraping using python returning None. … Read more

[Solved] how to scrape web page that is not written directly using HTML, but is auto-generated using JavaScript? [closed]

January 7, 2023 by Kirat

Run this script and I suppose it will give you everything the table contains including a csv output. import csv from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() wait = WebDriverWait(driver, 10) outfile = open(‘table_data.csv’,’w’,newline=””) writer = csv.writer(outfile) driver.get(“http://washingtonmonthly.com/college_guide?ranking=2016-rankings-national-universities”) wait.until(EC.frame_to_be_available_and_switch_to_it(“iFrameResizer0”)) wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, ‘table.tablesaw’))) … Read more

[Solved] Can’t deal with some complicated laid-out content from a webpage

January 6, 2023 by Kirat

You can take advantage of CSS selector span[id$=lblResultsRaceName], which finds all spans that’s id ends with lblResultsRaceName and ‘td > span’, which finds all spans that have direct parent <td>: This code snippet will go through all racing result and prints all races: import requests from bs4 import BeautifulSoup url = “https://www.thedogs.com.au/Racing/Results.aspx?SearchDate=3-Jun-2018” def get_info(session,link): session.headers[‘User-Agent’] … Read more

[Solved] scrapy/Python crawls but does not scrape data

December 29, 2022 by Kirat

Your imports didn’t work that well over here, but that might be a configuration issue on my side. I think the scraper below does what you’re searching for: import scrapy class YelpSpider(scrapy.Spider): name=”yelp_spider” allowed_domains=[“yelp.com”] headers=[‘venuename’,’services’,’address’,’phone’,’location’] def __init__(self): self.start_urls = [‘https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1’] def start_requests(self): requests = [] for item in self.start_urls: requests.append(scrapy.Request(url=item, headers={‘Referer’:’http://www.google.com/’})) return requests def parse(self, … Read more

[Solved] How to crawl the url of url in scrapy?

December 24, 2022 by Kirat

At last i have done this, please follow below code to implement crawl values form url of url. def parse(self, response): item=ProductItem() url_list = [content for content in response.xpath(“//div[@class=”listing”]/div/a/@href”).extract()] item[‘product_DetailUrl’] = url_list for url in url_list: request = Request(str(url),callback=self.page2_parse) request.meta[‘item’] = item yield request def page2_parse(self,response): item=ProductItem() item = response.meta[‘item’] item[‘product_ColorAvailability’] = [content for content … Read more