I discovered that my browser and wget
both add a non-empty user agent field in the header, so I am assuming Goutte sets nothing here. Adding this header to the browser object prior to the fetch fixes the problem:
// Load a crawler/browser system
require_once 'vendor/goutte/goutte.phar';
// Here's a demo of a page we want to parse
$uri = '(removed)';
use Goutte\Client;
// Set up headers
$client = new Client();
$headers = array(
'User-Agent' => 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:21.0) Gecko/20100101 Firefox/21.0',
);
foreach ($headers as $header => $value)
{
$client->setHeader($header, $value);
}
$crawler = $client->request('GET', $uri);
echo $crawler->text() . "\n";
Here I’ve copied in my browser agent string, but in this case I think anything would work – as long as it is set.
Incidentally, I used a browser UA here as I was trying to accurately replicate the browser environment for debugging this particular problem. Once it worked I switched to a custom UA, so target sites can detect it as a bot if they wish to (for this project I don’t think anyone has).
solved Goutte won’t load an ASP SSL page