web-scraping Archives - Page 2 of 5

[Solved] Extracting information from website [closed]

December 23, 2022 by Kirat

Use cURL to fetch the page and then use something like DomDocument to get the exact data that you want. 1 solved Extracting information from website [closed]

[Solved] i want to scrape this part

December 16, 2022 by Kirat

This data is taken from additional request to https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce=142632059. There you will get json with whole information. UPD: url_id = re.search(r’/(\d+)\.htm’, response.url).group(1) details_url=”https://www.seloger.com/detail,json,caracteristique_bien.json?idannonce={}” # make request to url yield Request(details_url.format(url_id)) 5 solved i want to scrape this part

[Solved] How to programatically download a file from a website for which a static URL is not available or how to form a static URL

December 8, 2022 by Kirat

Here is the answer for someone who has no code: Use this URL: https://340bopais.hrsa.gov/reports Connect to this URL with ‘WebClient’ Get the Page with ‘HtmlPage’ Wait until JavaScript files loaded. Download execute it and download result to given path. Mabe this already asked example code can help you. 2 solved How to programatically download a … Read more

[Solved] How can I find the target URLs of the tiles on this webpage? (And hidden data too, if possible) [closed]

November 29, 2022 by Kirat

Yes. Pull it from the API: import requests import pandas as pd url=”https://api.verivest.com/sponsors/find” payload = { ‘page[number]’: ‘1’, ‘page[size]’: ‘9999’, ‘sort’: ‘-capital_managed,name’, ‘returns’: ‘compact’} jsonData = requests.get(url, params=payload).json() data = jsonData[‘data’] df = pd.json_normalize(data) df[‘links’] = ‘https://verivest.com/s/’ + df[‘attributes.slug’] Output: print(df[‘links’]) 0 https://verivest.com/s/fairway-america 1 https://verivest.com/s/trion-properties 2 https://verivest.com/s/procida-funding-advisors 3 https://verivest.com/s/legacy-group-capital 4 https://verivest.com/s/tricap-residential-group 1291 https://verivest.com/s/zapolski-real-estate-llc 1292 https://verivest.com/s/zaragon-inc … Read more

[Solved] returning only the first Value while i am printing a list [closed]

November 29, 2022 by Kirat

Indent the business and append line so it is inside the for loop: for item in data: phone_url = “https://yellowpages.com.eg” + item[“data-tooltip-phones”] title = item.find_previous(class_=”item-title”).text address = item.find_previous(class_=”address-text”).text.strip().replace(‘\n’, ”) phones = requests.get(phone_url).json() business = { ‘name’: title, ‘address’: address, ‘telephone’: phones } my_list.append(business) solved returning only the first Value while i am printing a list … Read more

[Solved] How to scrape HTML using Python for NOWTV available movies

November 25, 2022 by Kirat

You can mimic what the page is doing in terms of paginated results (https://www.nowtv.com/stream/all-movies/page/1) and extract movies from the script tag of each page. Although the below could use some re-factoring it shows how to obtain the total number of films, calculate the films per page, and issue requests to get all films using Session … Read more

[Solved] How to click on the Search company name button,type company name and search using Selenium and Python?

November 24, 2022 by Kirat

Try this: from selenium import webdriver import time from selenium.webdriver.common.keys import Keys from bs4 import BeautifulSoup from bs4.element import Tag driver = webdriver.Chrome(“C:/Users/RoshanB/Desktop/sentiment1/chromedriver_win32/chromedriver”) driver.get(‘http://www.careratings.com/brief-rationale.aspx’) time.sleep(4) companyArray = [] try: search = driver.find_element_by_name(‘txtSearchCompany_brief’) search.send_keys(“Reliance Capital Limited”) search.send_keys(Keys.RETURN) time.sleep(4) soup = BeautifulSoup(driver.page_source, ‘lxml’) companies = soup.find(“table”,class_=”table1″) for tag in companies.findChildren(): if isinstance(tag, Tag) and tag.name in ‘a’ … Read more

[Solved] How do i get a list of all the urls from a website with python? [closed]

November 10, 2022 by Kirat

Since all the links have a class in common (class=”blue”), you can select all the web elements using this code, and then get the “href” attribute values: elements = driver.find_elements_by_class_name(‘blue’); urls = [elements.get_attribute(‘href’) for elements in elements] I recommend this site if you want to learn more about Selenium Python : Learn to Locate Elements … Read more

[Solved] How to scrape all product review from lazada in python

November 1, 2022 by Kirat

To do pagination use infinite while loop and #Check for button next-pagination-item have **disable** attribute then jump from loop else click on the next button. Code: from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By import time driver=webdriver.Chrome(executable_path=”chromedriver”) driver.get(“https://www.lazada.sg/products/loreal-paris-uv-perfect-even-complexion-sunscreen-spf50pa-30ml-i214861100-s325723972.html?spm=a2o42.seller.list.1.758953196tH2Mn&mp=1”) review_csv=[] product_csv = [] rating_csv =[] date_review_csv … Read more

[Solved] scraping with selenium web driver

November 1, 2022 by Kirat

You should fix your XPath expressions. Use findElement for the first 3. findElements for the last. To get the home odd : //td[a[.=”bet365″]]/following-sibling::td[span][1]/span To get the draw odd : //td[a[.=”bet365″]]/following-sibling::td[span][2]/span To get the away odd : //td[a[.=”bet365″]]/following-sibling::td[span][3]/span To get them all : //td[a[.=”bet365″]]/following-sibling::td[span]/span Getting them all is probably better since you call driver.find_elements_by_xpath 1 time. … Read more

[Solved] What approaches are available to reduce the time needed for a large site scrape? [closed]

October 30, 2022 by Kirat

To answer my own question: I implemented the same logic with Jsoup and the time bench mark yielded the results for a fixed amount of data: Selenium: 2 minutes 46 seconds Jsoup: 16 seconds Thus it seems that Selenium is much slower. I cannot give a technical reason why this is so. I can only … Read more

[Solved] Extracting images from a webpage under a specific tag

October 27, 2022 by Kirat

I love this library for scraping the internets http://jsoup.org/. I had a parser up and running in about 30 mins and have only been writing java in my spare time for 3 months. solved Extracting images from a webpage under a specific tag

[Solved] Non Type object has no attribute get error

October 27, 2022 by Kirat

You need to check if titletext.a is None before you can use it for sure. for titles in title: titleheading = soup.findAll(‘h2’) for titletext in titleheading: if titletext.a: titlename = titletext.a titlelink =titlename.get(‘href’) print(i) print(titlelink) i+=1 2 solved Non Type object has no attribute get error

[Solved] How to download and save all PDF from a dynamic web?

October 15, 2022 by Kirat

You have to make a post http requests with appropriate json parameter. Once you get the response, you have to parse two fields objectId and nombreFichero to use them to build right links to the pdf’s. The following should work: import os import json import requests url=”https://bancaonline.bankinter.com/publico/rs/documentacionPrix/list” base=”https://bancaonline.bankinter.com/publico/DocumentacionPrixGet?doc={}&nameDoc={}” payload = {“cod_categoria”: 2,”cod_familia”: 3,”divisaDestino”: None,”vencimiento”: None,”edadActuarial”: … Read more

[Solved] BeautifulSoup – Amazon and Google identify me as a robot; how can i fix it?

October 12, 2022 by Kirat

Rotating proxies Delays Avoid the same pattern IP rate limit (probably your issue) IP rate limit. It’s a basic security system that can ban or block incoming requests from the same IP. It means that a regular user would not make 100 requests to the same domain in a few seconds with the exact same … Read more