Advertisement
Help Keep Boards Alive. Support us by going ad free today. See here: https://subscriptions.boards.ie/.
If we do not hit our goal we will be forced to close the site.

Current status: https://keepboardsalive.com/

Annual subs are best for most impact. If you are still undecided on going Ad Free - you can also donate using the Paypal Donate option. All contribution helps. Thank you.
https://www.boards.ie/group/1878-subscribers-forum

Private Group for paid up members of Boards.ie. Join the club.
Hi all, please see this major site announcement: https://www.boards.ie/discussion/2058427594/boards-ie-2026

PHP screen scrape help

  • 01-12-2009 10:19AM
    #1
    Registered Users, Registered Users 2 Posts: 872 ✭✭✭


    Hi,

    I need to extract car images from a website to include in another site. I was checking this tutorial and it's kind of working but i am having problems with the regular expression.

    Basically i need a regex to search for li class='car' but whenever i include the class name in the regex it returns nothing.

    Below is the code i could like to scrape, its the A tag with the IMG inside that i need, everything else can go !
    <li class="car">	                
      <a href="#"><img src="http://image" alt="BMW 3 Series"/></a>     
      <span><img src="a.gif" class="icon"/>8</span>    
    </li>
    

    My code so far

    [PHP]$url = "http://www.url.com";

    $raw = file_get_contents($url);

    $newlines = array("\t","\n","\r","\x20\x20","\0","\x0B");

    $content = str_replace($newlines, "", html_entity_decode($raw));

    $start = strpos($content,'<ol"');

    $end = strpos($content,'</ol>',$start) + 6;

    $table = substr($content,$start,$end-$start);

    preg_match_all("|<li class='cars'(.*)</li>|U",$table,$rows);

    foreach ($rows[0] as $row){

    if ((strpos($row,'<th')===false)){

    preg_match_all("|<a(.*)/>|U",$row,$cells);

    $number = strip_tags($cells[0][0]);

    echo "{Number {$number} <br>\n";

    }

    }
    [/PHP]

    Does anyone know what i am doing wrong ?

    Thanks in advance


Comments

  • Registered Users, Registered Users 2 Posts: 1,505 ✭✭✭viking


    A quick look shows your preg_match_all contains class='cars' however your html that you want to scrape is <li class="car">. car!=cars and you are using different quote types, single v. double

    You could always create a DomDocument object representing the HTML, then use getElementsByTagName('li') to get all your <li> nodes and then check for the class attribute "car". When you find one, use getElementsByTagName('img') to grab your <img> tag and getAttribute('src') to get your image URL.

    http://php.net/manual/en/class.domdocument.php


  • Registered Users, Registered Users 2 Posts: 2,238 ✭✭✭techguy


    Are you trying to grab both images here or just one of them. You probably only need the first one as the second one is probably a standard icon button or a small version of the first image.

    This regex will catch the first image. It only works for me when all the html is one line i.e. no line breaks. It may or may not work for you in PHP with the line breaks. -> Try googeling how to include line breaks in regex.
    <li class="car">.*<img src=".*"/></a>

    P.S. How do you test/create your expressions? If you do it in code via trial and error maybe you should take a look at RegexBuddy. Also, www.regular-expressions.info is a good site for learning Regex.

    HTH


  • Registered Users, Registered Users 2 Posts: 6,476 ✭✭✭MOH


    A bit off topic, but you'd want to be careful about scraping content off sites to use in another site - whoever owns the rights to the images mightn't be best pleased.


Advertisement