[Solved] Where are my mistakes in my scrapy codes?

The main issue with your code is using of .select instead of .css. Here is what do you need but I’m not sure about titles part (may be you need it on other pages): def parse(self, response): titles = response.xpath(“//div[@class=”artist”]”) # items = [] for title in titles: item = ArtistlistItem() item[“artist”] = title.css(“h2::text”).get() item[“biograpy”] … Read more

[Solved] Scrapy crawler function not executing

Found the issue, the problem was that because i was using extract() its output is a list, so i had a list within a list ( with only one element ) and the request wasn’t calling the url, changed it to a extract_first() and now it works. HierarchyItem[“hierarchy_url”] = lvl3.css(“a::attr(href)”).extract_first() solved Scrapy crawler function not … Read more

[Solved] scrapy/Python crawls but does not scrape data

Your imports didn’t work that well over here, but that might be a configuration issue on my side. I think the scraper below does what you’re searching for: import scrapy class YelpSpider(scrapy.Spider): name=”yelp_spider” allowed_domains=[“yelp.com”] headers=[‘venuename’,’services’,’address’,’phone’,’location’] def __init__(self): self.start_urls = [‘https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1’] def start_requests(self): requests = [] for item in self.start_urls: requests.append(scrapy.Request(url=item, headers={‘Referer’:’http://www.google.com/’})) return requests def parse(self, … Read more

[Solved] How to crawl the url of url in scrapy?

At last i have done this, please follow below code to implement crawl values form url of url. def parse(self, response): item=ProductItem() url_list = [content for content in response.xpath(“//div[@class=”listing”]/div/a/@href”).extract()] item[‘product_DetailUrl’] = url_list for url in url_list: request = Request(str(url),callback=self.page2_parse) request.meta[‘item’] = item yield request def page2_parse(self,response): item=ProductItem() item = response.meta[‘item’] item[‘product_ColorAvailability’] = [content for content … Read more

[Solved] i want to scrape this part

This data is taken from additional request to https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=142632059. There you will get json with whole information. UPD: url_id = re.search(r’/(\d+)\.htm’, response.url).group(1) details_url=”https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce={}” # make request to url yield Request(details_url.format(url_id)) 5 solved i want to scrape this part

[Solved] How to build a powerful crawler like google’s? [closed]

For Python you could go with Frontera by Scrapinghub https://github.com/scrapinghub/frontera https://github.com/scrapinghub/frontera/blob/distributed/docs/source/topics/distributed-architecture.rst They’re the same guys that make Scrapy. There’s also Apache Nutch which is a much older project. http://nutch.apache.org/ 1 solved How to build a powerful crawler like google’s? [closed]

[Solved] Why am I getting IndexError: list index out of range? [closed]

This loop is indeed not exceeding the range of the list: for i in range(len(my_list)): Within that loop, you can access list elements using i as the index safely. But that’s not what you’re doing, you’re using hard-coded index values: motMake = motordata.get(‘displayAttributes’)[0][‘value’] or None motModel = motordata.get(‘displayAttributes’)[1][‘value’] or None motYear = motordata.get(‘displayAttributes’)[2][‘value’] or None … Read more

[Solved] How do i get a list of all the urls from a website with python? [closed]

Since all the links have a class in common (class=”blue”), you can select all the web elements using this code, and then get the “href” attribute values: elements = driver.find_elements_by_class_name(‘blue’); urls = [elements.get_attribute(‘href’) for elements in elements] I recommend this site if you want to learn more about Selenium Python : Learn to Locate Elements … Read more

[Solved] organizing data that I am pulling and saving to CSV

You can use pandas to do that. Collect all the data into a dataframe, then just write the dataframe to file. import pandas as pd import requests import bs4 root_url=”https://www.estatesales.net” url_list=[‘https://www.estatesales.net/companies/NJ/Northern-New-Jersey’] results = pd.DataFrame() for url in url_list: response = requests.get(url) soup = bs4.BeautifulSoup(response.text, ‘html.parser’) companies = soup.find_all(‘app-company-city-view-row’) for company in companies: try: link = … Read more

[Solved] How to retrieve data from json response with scrapy?

You’ll want to access: [‘hits’][‘hits’][x][‘_source’][‘apply_url’] Where x is the number of items/nodes under hits. See https://jsoneditoronline.org/#left=cloud.22e871cf105e40a5ba32408f6aa5afeb&right=cloud.e1f56c3bd6824a3692bf3c80285ae727 As you can see, there are 10 items or nodes under hits -> hits. apply_url is under _source for each item. def parse(self, response): jsonresponse = json.loads(response.body_as_unicode()) print(“============================================================================================================================”) for x, node in enumerate(jsonresponse): print(jsonresponse[‘hits’][‘hits’][x][‘_source’][‘apply_url’]) For example, print(jsonresponse[‘hits’][‘hits’][0][‘_source’][‘apply_url’]) would produce: … Read more

[Solved] How to remove \r\n in command prompt after running?

strip() can remove \r\n only at the end of string, but not inside. If you have \r\n inside text then use text = text.replace(\r\n’, ”) it seems you get \r\n in list created by extract() so you have to use list comprehension to remove from every element on list data = response.css(find).extract() data = [x.replace(‘\r\n’, … Read more

[Solved] Scraping data from a dynamic web database with Python [closed]

You can solve it with requests (for maintaining a web-scraping session) + BeautifulSoup (for HTML parsing) + regex for extracting a value of a javascript variable containing the desired data inside a script tag and ast.literal_eval() for making a python list out of js list: from ast import literal_eval import re from bs4 import BeautifulSoup … Read more

[Solved] Scrapy keeps getting blocked

Your xpath expressions aren’t correct. When you are using relative xpath expressions they need to start with a “./” and using class specifiers is much easier than indexing in my opinion. def parse(self, response): for row in response.xpath(‘//table[@class=”list”]//tr’): name = row.xpath(‘./td[@class=”name”]/a/text()’).get() address = row.xpath(‘./td[@class=”location”]/text()’).get() yield { ‘Name’:name, ‘Address’:address, } next_page = response.xpath(“//a[@class=”next-page”]/@href”).get() if next_page: yield … Read more