[Solved] Python Web Scraping Get the main content only

Question

This code extracts that particular site’s content a little better.

def keyInfo(div):
  print(div.find("h1").get_text())
  article = div.find("article")
  divText = article.find("div", id="storytext")
  [a.extract() for a in divText.findAll("aside")]
  [d.extract() for d in divText.findAll("div")]
  print(divText.get_text())

Approach

After looking at the structure of the content using Chrome dev tools, I noticed the story content was in article > div[id=storytext], but div[id=storytext] also included a few asides and divs with non-article content. Removing those left the paragraphs of the article.

Looking for something a little more generic?

If you’re looking for something a little more generic, you may want to consider something like Boilerpipe. Here is a Python wrapper for Boilerpipe: https://github.com/misja/python-boilerpipe

Accepted Answer

This code extracts that particular site’s content a little better.

def keyInfo(div):
  print(div.find("h1").get_text())
  article = div.find("article")
  divText = article.find("div", id="storytext")
  [a.extract() for a in divText.findAll("aside")]
  [d.extract() for d in divText.findAll("div")]
  print(divText.get_text())

Approach

After looking at the structure of the content using Chrome dev tools, I noticed the story content was in article > div[id=storytext], but div[id=storytext] also included a few asides and divs with non-article content. Removing those left the paragraphs of the article.

Looking for something a little more generic?

If you’re looking for something a little more generic, you may want to consider something like Boilerpipe. Here is a Python wrapper for Boilerpipe: https://github.com/misja/python-boilerpipe