[Solved] Extract breadcrumb from html with regex and remove html tags


Using DomDocument and xpath you can load the entire html and query for the li elements.
Then it’s a matter of simply outputting the nodeValue

The xpath->query method below will search for all li elements that belong to a parent ul that has a class of breadcrumb

Example

$html="
    <html>
        <body>
            <div class="container">
                <ul itemprop="breadcrumb" class="breadcrumb">
                     <li><a href="https://stackoverflow.com/">Home</a><i class="ico-breadcrumb"></i></li>
                     <li><a href="http://stackoverflow.com/inspiration/0.iroot">Inspiration</a><i class="ico-breadcrumb"></i></li>
                     <li><a href="http://stackoverflow.com/inspiration/loft/CC_npccat_100031.icat">Loft</a><i class="ico-breadcrumb"></i></li>
                     <li>First impressions count - bringing your hallway to life</li>
                </ul>
            </div>
        </body>
    </html>";

$doc = new DOMDocument();
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);

$categories = $xpath->query('//ul[contains(@class,"breadcrumb")]/li');

foreach($categories as $category){
    print $category->nodeValue . PHP_EOL;
}

This will output

Home
Inspiration
Loft
First impressions count - bringing your hallway to life

solved Extract breadcrumb from html with regex and remove html tags