Depending on how redundant is the structure of the data you want to extract, you could use several tools.
- If you’re looking for extracting data always stored in the same DOM structure, Scrapy could do the job.
- If the data is sparse and is stored in various places, maybe BeautfulSoup4 or lxml could help you.
- If the data is generated by some JS code, have a look at Selenium
Here are a couple of resources you might find useful:
- PyCon 2012 Tutorial about web-scraping: http://pyvideo.org/video/609/web-scraping-reliably-and-efficiently-pull-data/
- http://isbullsh.it/2012/04/Web-crawling-with-scrapy/ (full disclosure, I wrote that)
- http://www.packtpub.com/article/web-scraping-with-python
- http://wwwsearch.sourceforge.net/mechanize/
1
solved scrape the about page of websites with Python [closed]