{"id":33515,"date":"2023-02-09T09:59:05","date_gmt":"2023-02-09T04:29:05","guid":{"rendered":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/"},"modified":"2023-02-09T09:59:05","modified_gmt":"2023-02-09T04:29:05","slug":"solved-parsing-bot-protected-site","status":"publish","type":"post","link":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/","title":{"rendered":"[Solved] Parsing bot protected site"},"content":{"rendered":"<p> [ad_1]<br \/>\n<\/p>\n<div id=\"answer-50274125\" class=\"answer js-answer accepted-answer js-accepted-answer\" data-answerid=\"50274125\" data-parentid=\"49739994\" data-score=\"0\" data-position-on-page=\"1\" data-highest-scored=\"1\" data-question-has-accepted-highest-score=\"1\" itemprop=\"acceptedAnswer\" itemscope itemtype=\"https:\/\/schema.org\/Answer\">\n<div class=\"post-layout\">\n<div class=\"votecell post-layout--left\"><\/div>\n<div class=\"answercell post-layout--right\">\n<div class=\"s-prose js-post-body\" itemprop=\"text\">\n<p>There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.<\/p>\n<p>One common way of blocking requests is to look at the <code>User Agent<\/code> header. The client ( in your case the <code>requests<\/code> library ) will inform the server about it&#8217;s identity.<\/p>\n<p>Generally speaking, a browser will say <code>I am a browser<\/code> and a library will say <code>I am a library<\/code>. The server can then say <code>I allow browsers but not libraries to access my content<\/code>.<\/p>\n<p>However, for this particular case, you can simply lie to the server by sending your own <code>User Agent<\/code> header.<\/p>\n<p>You can see a <a rel=\"nofollow noopener\" target=\"_blank\" href=\"http:\/\/docs.python-requests.org\/en\/master\/user\/quickstart\/#custom-headers\">example<\/a> here. Try to use your browsers user agent.<\/p>\n<p>Other blocking techniques include ip ranges. One way to bypass this is via a <code>vpn<\/code>. <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/kylemanna\/docker-openvpn\">This<\/a> is one of the easiest <code>vpns<\/code> to set up. Just spin up a machine on amazon and get this container running.<\/p>\n<p>What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that <code>get<\/code> requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/developers.google.com\/web\/updates\/2017\/04\/headless-chrome\">Google Chrome Headless<\/a> however there are others. You can also use <a rel=\"nofollow noopener\" target=\"_blank\" href=\"http:\/\/selenium-python.readthedocs.io\/\">Selenium<\/a><\/p>\n<p>Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.<\/p>\n<hr>\n<p>Also, as a quick mention, my advice is to avoid <code>from bs4 import BeautifulSoup as soup<\/code>. I would recommend <a rel=\"nofollow noopener\" target=\"_blank\" href=\"https:\/\/github.com\/aaronsw\/html2text\"><code>html2text<\/code><\/a><\/p>\n<\/p><\/div>\n<div class=\"mt24\"><\/div>\n<\/div>\n<p>            <span class=\"d-none\" itemprop=\"commentCount\"><\/span> <\/p><\/div>\n<\/div>\n<p>[ad_2]<\/p>\n<p>solved Parsing bot protected site <\/p>\n","protected":false},"excerpt":{"rendered":"<p>[ad_1] There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you. One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it&#8217;s identity. Generally speaking, a browser &#8230; <a title=\"[Solved] Parsing bot protected site\" class=\"read-more\" href=\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\" aria-label=\"More on [Solved] Parsing bot protected site\">Read more<\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[320],"tags":[380,349,760],"class_list":["post-33515","post","type-post","status-publish","format-standard","hentry","category-solved","tag-parsing","tag-python","tag-web-scraping"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v26.5 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>[Solved] Parsing bot protected site - JassWeb<\/title>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"[Solved] Parsing bot protected site - JassWeb\" \/>\n<meta property=\"og:description\" content=\"[ad_1] There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you. One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it&#8217;s identity. Generally speaking, a browser ... Read more\" \/>\n<meta property=\"og:url\" content=\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\" \/>\n<meta property=\"og:site_name\" content=\"JassWeb\" \/>\n<meta property=\"article:published_time\" content=\"2023-02-09T04:29:05+00:00\" \/>\n<meta name=\"author\" content=\"Kirat\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kirat\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\"},\"author\":{\"name\":\"Kirat\",\"@id\":\"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31\"},\"headline\":\"[Solved] Parsing bot protected site\",\"datePublished\":\"2023-02-09T04:29:05+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\"},\"wordCount\":247,\"publisher\":{\"@id\":\"https:\/\/jassweb.com\/solved\/#organization\"},\"keywords\":[\"parsing\",\"python\",\"web-scraping\"],\"articleSection\":[\"Solved\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\",\"url\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\",\"name\":\"[Solved] Parsing bot protected site - JassWeb\",\"isPartOf\":{\"@id\":\"https:\/\/jassweb.com\/solved\/#website\"},\"datePublished\":\"2023-02-09T04:29:05+00:00\",\"breadcrumb\":{\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/jassweb.com\/solved\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"[Solved] Parsing bot protected site\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/jassweb.com\/solved\/#website\",\"url\":\"https:\/\/jassweb.com\/solved\/\",\"name\":\"JassWeb\",\"description\":\"Build High-quality Websites\",\"publisher\":{\"@id\":\"https:\/\/jassweb.com\/solved\/#organization\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/jassweb.com\/solved\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Organization\",\"@id\":\"https:\/\/jassweb.com\/solved\/#organization\",\"name\":\"Jass Web\",\"url\":\"https:\/\/jassweb.com\/solved\/\",\"logo\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/\",\"url\":\"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png\",\"contentUrl\":\"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png\",\"width\":693,\"height\":132,\"caption\":\"Jass Web\"},\"image\":{\"@id\":\"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/\"}},{\"@type\":\"Person\",\"@id\":\"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31\",\"name\":\"Kirat\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/jassweb.com\/solved\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1775798750\",\"contentUrl\":\"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1775798750\",\"caption\":\"Kirat\"},\"sameAs\":[\"http:\/\/jassweb.com\"],\"url\":\"https:\/\/jassweb.com\/solved\/author\/jaspritsinghghumangmail-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"[Solved] Parsing bot protected site - JassWeb","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/","og_locale":"en_US","og_type":"article","og_title":"[Solved] Parsing bot protected site - JassWeb","og_description":"[ad_1] There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you. One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it&#8217;s identity. Generally speaking, a browser ... Read more","og_url":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/","og_site_name":"JassWeb","article_published_time":"2023-02-09T04:29:05+00:00","author":"Kirat","twitter_card":"summary_large_image","twitter_misc":{"Written by":"Kirat","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#article","isPartOf":{"@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/"},"author":{"name":"Kirat","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31"},"headline":"[Solved] Parsing bot protected site","datePublished":"2023-02-09T04:29:05+00:00","mainEntityOfPage":{"@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/"},"wordCount":247,"publisher":{"@id":"https:\/\/jassweb.com\/solved\/#organization"},"keywords":["parsing","python","web-scraping"],"articleSection":["Solved"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/","url":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/","name":"[Solved] Parsing bot protected site - JassWeb","isPartOf":{"@id":"https:\/\/jassweb.com\/solved\/#website"},"datePublished":"2023-02-09T04:29:05+00:00","breadcrumb":{"@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/jassweb.com\/solved\/solved-parsing-bot-protected-site\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/jassweb.com\/solved\/"},{"@type":"ListItem","position":2,"name":"[Solved] Parsing bot protected site"}]},{"@type":"WebSite","@id":"https:\/\/jassweb.com\/solved\/#website","url":"https:\/\/jassweb.com\/solved\/","name":"JassWeb","description":"Build High-quality Websites","publisher":{"@id":"https:\/\/jassweb.com\/solved\/#organization"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/jassweb.com\/solved\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Organization","@id":"https:\/\/jassweb.com\/solved\/#organization","name":"Jass Web","url":"https:\/\/jassweb.com\/solved\/","logo":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/","url":"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png","contentUrl":"https:\/\/jassweb.com\/wp-content\/uploads\/2021\/02\/jass-website-logo-1.png","width":693,"height":132,"caption":"Jass Web"},"image":{"@id":"https:\/\/jassweb.com\/solved\/#\/schema\/logo\/image\/"}},{"@type":"Person","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/person\/65c9c7b7958150c0dc8371fa35dd7c31","name":"Kirat","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/jassweb.com\/solved\/#\/schema\/person\/image\/","url":"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1775798750","contentUrl":"https:\/\/jassweb.com\/solved\/wp-content\/litespeed\/avatar\/1261af3c9451399fa1336d28b98ea3bb.jpg?ver=1775798750","caption":"Kirat"},"sameAs":["http:\/\/jassweb.com"],"url":"https:\/\/jassweb.com\/solved\/author\/jaspritsinghghumangmail-com\/"}]}},"_links":{"self":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts\/33515","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/comments?post=33515"}],"version-history":[{"count":0,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/posts\/33515\/revisions"}],"wp:attachment":[{"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/media?parent=33515"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/categories?post=33515"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jassweb.com\/solved\/wp-json\/wp\/v2\/tags?post=33515"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}