[Solved] Parsing bot protected site

Question

There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.

One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it’s identity.

Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.

However, for this particular case, you can simply lie to the server by sending your own User Agent header.

You can see a example here. Try to use your browsers user agent.

Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.

What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium

Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.

Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text

Accepted Answer