scrapy Archives

[Solved] How can I determine the name of the website by scraping the webpage?

February 18, 2023 by Kirat

You can look for the tag in the . For example, one can use something like: response.xpath(‘//title’) 2 solved How can I determine the name of the website by scraping the webpage?

[Solved] Where are my mistakes in my scrapy codes?

January 21, 2023 by Kirat

The main issue with your code is using of .select instead of .css. Here is what do you need but I’m not sure about titles part (may be you need it on other pages): def parse(self, response): titles = response.xpath(“//div[@class=”artist”]”) # items = [] for title in titles: item = ArtistlistItem() item[“artist”] = title.css(“h2::text”).get() item[“biograpy”] … Read more

[Solved] Scrapy crawler function not executing

December 29, 2022 by Kirat

Found the issue, the problem was that because i was using extract() its output is a list, so i had a list within a list ( with only one element ) and the request wasn’t calling the url, changed it to a extract_first() and now it works. HierarchyItem[“hierarchy_url”] = lvl3.css(“a::attr(href)”).extract_first() solved Scrapy crawler function not … Read more

[Solved] scrapy/Python crawls but does not scrape data

December 29, 2022 by Kirat

Your imports didn’t work that well over here, but that might be a configuration issue on my side. I think the scraper below does what you’re searching for: import scrapy class YelpSpider(scrapy.Spider): name=”yelp_spider” allowed_domains=[“yelp.com”] headers=[‘venuename’,’services’,’address’,’phone’,’location’] def __init__(self): self.start_urls = [‘https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1’] def start_requests(self): requests = [] for item in self.start_urls: requests.append(scrapy.Request(url=item, headers={‘Referer’:’http://www.google.com/’})) return requests def parse(self, … Read more

[Solved] Deleting an item in dict while looping

December 24, 2022 by Kirat

Skip printing empty elements Add a conditional to your code to only print elements that have content, e.g. as below. if var: print(var) (Empty dict’s evaluate to False in python conditionals). 1 solved Deleting an item in dict while looping

[Solved] How to crawl the url of url in scrapy?

December 24, 2022 by Kirat

At last i have done this, please follow below code to implement crawl values form url of url. def parse(self, response): item=ProductItem() url_list = [content for content in response.xpath(“//div[@class=”listing”]/div/a/@href”).extract()] item[‘product_DetailUrl’] = url_list for url in url_list: request = Request(str(url),callback=self.page2_parse) request.meta[‘item’] = item yield request def page2_parse(self,response): item=ProductItem() item = response.meta[‘item’] item[‘product_ColorAvailability’] = [content for content … Read more

[Solved] i want to scrape this part

December 16, 2022 by Kirat

This data is taken from additional request to https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=142632059. There you will get json with whole information. UPD: url_id = re.search(r’/(\d+)\.htm’, response.url).group(1) details_url=”https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce={}” # make request to url yield Request(details_url.format(url_id)) 5 solved i want to scrape this part

[Solved] How to build a powerful crawler like google’s? [closed]

December 7, 2022 by Kirat

For Python you could go with Frontera by Scrapinghub https://github.com/scrapinghub/frontera https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst They’re the same guys that make Scrapy. There’s also Apache Nutch which is a much older project. http://nutch.apache.org/ 1 solved How to build a powerful crawler like google’s? [closed]

[Solved] Why am I getting IndexError: list index out of range? [closed]

November 25, 2022 by Kirat

This loop is indeed not exceeding the range of the list: for i in range(len(my_list)): Within that loop, you can access list elements using i as the index safely. But that’s not what you’re doing, you’re using hard-coded index values: motMake = motordata.get(‘displayAttributes’)[0][‘value’] or None motModel = motordata.get(‘displayAttributes’)[1][‘value’] or None motYear = motordata.get(‘displayAttributes’)[2][‘value’] or None … Read more

[Solved] How do i get a list of all the urls from a website with python? [closed]

November 10, 2022 by Kirat

Since all the links have a class in common (class=”blue”), you can select all the web elements using this code, and then get the “href” attribute values: elements = driver.find_elements_by_class_name(‘blue’); urls = [elements.get_attribute(‘href’) for elements in elements] I recommend this site if you want to learn more about Selenium Python : Learn to Locate Elements … Read more

[Solved] organizing data that I am pulling and saving to CSV

October 6, 2022 by Kirat

You can use pandas to do that. Collect all the data into a dataframe, then just write the dataframe to file. import pandas as pd import requests import bs4 root_url=”https://www.estatesales.net” url_list=[‘https://www.estatesales.net/companies/NJ/Northern-New-Jersey’] results = pd.DataFrame() for url in url_list: response = requests.get(url) soup = bs4.BeautifulSoup(response.text, ‘html.parser’) companies = soup.find_all(‘app-company-city-view-row’) for company in companies: try: link = … Read more

[Solved] How to retrieve data from json response with scrapy?

October 1, 2022 by Kirat

You’ll want to access: [‘hits’][‘hits’][x][‘_source’][‘apply_url’] Where x is the number of items/nodes under hits. See https://jsoneditoronline.org/#left=cloud.22e871cf105e40a5ba32408f6aa5afeb&right=cloud.e1f56c3bd6824a3692bf3c80285ae727 As you can see, there are 10 items or nodes under hits -> hits. apply_url is under _source for each item. def parse(self, response): jsonresponse = json.loads(response.body_as_unicode()) print(“============================================================================================================================”) for x, node in enumerate(jsonresponse): print(jsonresponse[‘hits’][‘hits’][x][‘_source’][‘apply_url’]) For example, print(jsonresponse[‘hits’][‘hits’][0][‘_source’][‘apply_url’]) would produce: … Read more

[Solved] How to remove \r\n in command prompt after running?

September 26, 2022 by Kirat

strip() can remove \r\n only at the end of string, but not inside. If you have \r\n inside text then use text = text.replace(\r\n’, ”) it seems you get \r\n in list created by extract() so you have to use list comprehension to remove from every element on list data = response.css(find).extract() data = [x.replace(‘\r\n’, … Read more

[Solved] Scraping data from a dynamic web database with Python [closed]

September 19, 2022 by Kirat

You can solve it with requests (for maintaining a web-scraping session) + BeautifulSoup (for HTML parsing) + regex for extracting a value of a javascript variable containing the desired data inside a script tag and ast.literal_eval() for making a python list out of js list: from ast import literal_eval import re from bs4 import BeautifulSoup … Read more

[Solved] Scrapy keeps getting blocked

September 16, 2022 by Kirat

Your xpath expressions aren’t correct. When you are using relative xpath expressions they need to start with a “./” and using class specifiers is much easier than indexing in my opinion. def parse(self, response): for row in response.xpath(‘//table[@class=”list”]//tr’): name = row.xpath(‘./td[@class=”name”]/a/text()’).get() address = row.xpath(‘./td[@class=”location”]/text()’).get() yield { ‘Name’:name, ‘Address’:address, } next_page = response.xpath(“//a[@class=”next-page”]/@href”).get() if next_page: yield … Read more