[Solved] How to extract URL from HTML anchor element using Python3? [closed]


You can use built-in xml.etree.ElementTree instead:

>>> import xml.etree.ElementTree as ET
>>> url="<a rel="nofollow" href="https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip">XYZ</a>"
>>> ET.fromstring(url).attrib.get('href')
"https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip"

This works on this particular example, but xml.etree.ElementTree is not an HTML parser. Consider using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(url).a.get('href')
"https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip"

Or, lxml.html:

>>> import lxml.html
>>> lxml.html.fromstring(url).attrib.get('href')
"https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip"

Personally, I prefer BeautifulSoup – it makes html-parsing easy, transparent and fun.


To follow the link and download the file, you need to make a full url including the schema and domain (urljoin() would help) and then use urlretrieve(). Example:

>>> BASE_URL = 'http://example.com'
>>> from urllib.parse import urljoin
>>> from urllib.request import urlretrieve
>>> href = BeautifulSoup(url).a.get('href')
>>> urlretrieve(urljoin(BASE_URL, href))

UPD (for the different html posted in comments):

>>> from bs4 import BeautifulSoup
>>> data="<html> <head> <body><example><example2> <a rel="nofollow" href="https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip">XYZ</a> </example2></example></body></head></html>"
>>> href = BeautifulSoup(data).find('a', text="XYZ").get('href')
"https://stackoverflow.com/example/hello/get/9f676bac2bb3.zip"

1

solved How to extract URL from HTML anchor element using Python3? [closed]