This code extracts that particular site’s content a little better.
def keyInfo(div):
print(div.find("h1").get_text())
article = div.find("article")
divText = article.find("div", id="storytext")
[a.extract() for a in divText.findAll("aside")]
[d.extract() for d in divText.findAll("div")]
print(divText.get_text())
Approach
After looking at the structure of the content using Chrome dev tools, I noticed the story content was in article > div[id=storytext]
, but div[id=storytext]
also included a few asides and divs with non-article content. Removing those left the paragraphs of the article.
Looking for something a little more generic?
If you’re looking for something a little more generic, you may want to consider something like Boilerpipe. Here is a Python wrapper for Boilerpipe: https://github.com/misja/python-boilerpipe
1
solved Python Web Scraping Get the main content only