[Solved] scrapy/Python crawls but does not scrape data


Your imports didn’t work that well over here, but that might be a configuration issue on my side. I think the scraper below does what you’re searching for:

import scrapy

class YelpSpider(scrapy.Spider):
    name="yelp_spider"
    allowed_domains=["yelp.com"]
    headers=['venuename','services','address','phone','location']

    def __init__(self):
        self.start_urls = ['https://www.yelp.com/search?find_desc=&find_loc=Springfield%2C+IL&ns=1']

    def start_requests(self):
        requests = []
        for item in self.start_urls:
            requests.append(scrapy.Request(url=item, headers={'Referer':'http://www.google.com/'}))
            return requests

    def parse(self, response):
        for restaurant in response.xpath('//div[@class="biz-listing-large"]'):
            item={}
            item['venuename']=restaurant.xpath('.//h3[@class="search-result-title"]/span/a/span/text()').extract_first()
            item['services']=u",".join(line.strip() for line in restaurant.xpath('.//span[@class="category-str-list"]/a/text()').extract())
            item['address']=restaurant.xpath('.//address/text()').extract_first()
            item['phone']=restaurant.xpath('.//span[@class="biz-phone"]/text()').extract_first()
            item['location']=response.xpath('.//input[@id="dropperText_Mast"]/@value').extract_first()
            item['url']=response.url
            yield item

Some explanation:

I’ve changed the start url. This url actually provides an overview of all restaurants while the other did not (or at least not when viewed from my location).

I’ve removed the pipeline as it was not defined in my system and I couldn’t try it out with the non-existent pipeline in the code.

The parse function is the one I made the real changes to. The xpaths you defined weren’t very clear. Now the code loops over each listed restaurant.

response.xpath('//div[@class="biz-listing-large"]')

This code captures all the restaurants data. I’ve used this in a for loop, so we can perform actions for each restaurant. This data is available in the variable restaurant.

So if I want to extract data from a restaurant, I use this variable. In addition, we need to start the xpath with a . because the script will otherwise start from the beginning of the webpage (which would be the same as using response).

In order to understand the xpaths in my answer, I could explain this to you but there is a lot of documentation available and they’re probably better in explaining this than I am.

Some documentation

And some more

Note that I’ve used restaurant for most values of item. Values from location and url are not really restaurant data but are located elsewhere on the webpage. This is why those values use response instead of restaurant.

4

solved scrapy/Python crawls but does not scrape data