Is using regex or php's DOM is best for extracting image src from the string(HTML content)

I would like to create a page where all images which reside on my website are listed with title and alternative representation.

I already wrote me a little program to find and load all HTML files, but now I am stuck at how to extract src, title and alt from this HTML:

<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny" />

I guess this should be done with some regex, but since the order of the tags may vary, and I need all of them, I don't really know how to parse this in an elegant way (I could do it the hard char by char way, but that's painful).

Replay

EDIT : now that I know better

Using regexp to solve this kind of problem is a bad idea and will likely lead in unmaintainable and unreliable code. Better use an HTML parser.

Solution With regexp

In that case it's better to split the process into two parts :

  • get all the img tag
  • extract their metadata

I will assume your doc is not xHTML strict so you can't use an XML parser. E.G. with this web page source code :

/* preg_match_all match the regexp in all the $html string and output everything as
an array in $result. "i" option is used to make it case insensitive */

preg_match_all('/<img[^>]+>/i',$html, $result); 

print_r($result);
Array
(
    [0] => Array
        (
            [0] => <img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />
            [1] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />
            [2] => <img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />
            [3] => <img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />
            [4] => <img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />

[...]
        )

)

Then we get all the img tag attributes with a loop :

$img = array();
foreach( $result as $img_tag)
{
    preg_match_all('/(alt|title|src)=("[^"]*")/i',$img_tag, $img[$img_tag]);
}

print_r($img);

Array
(
    [<img src="/Content/Img/stackoverflow-logo-250.png" width="250" height="70" alt="logo link to homepage" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/Content/Img/stackoverflow-logo-250.png"
                    [1] => alt="logo link to homepage"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "/Content/Img/stackoverflow-logo-250.png"
                    [1] => "logo link to homepage"
                )

        )

    [<img class="vote-up" src="/content/img/vote-arrow-up.png" alt="vote up" title="This was helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-up.png"
                    [1] => alt="vote up"
                    [2] => title="This was helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-up.png"
                    [1] => "vote up"
                    [2] => "This was helpful (click again to undo)"
                )

        )

    [<img class="vote-down" src="/content/img/vote-arrow-down.png" alt="vote down" title="This was not helpful (click again to undo)" />] => Array
        (
            [0] => Array
                (
                    [0] => src="/content/img/vote-arrow-down.png"
                    [1] => alt="vote down"
                    [2] => title="This was not helpful (click again to undo)"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                    [2] => title
                )

            [2] => Array
                (
                    [0] => "/content/img/vote-arrow-down.png"
                    [1] => "vote down"
                    [2] => "This was not helpful (click again to undo)"
                )

        )

    [<img src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG" height=32 width=32 alt="gravatar image" />] => Array
        (
            [0] => Array
                (
                    [0] => src="http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => alt="gravatar image"
                )

            [1] => Array
                (
                    [0] => src
                    [1] => alt
                )

            [2] => Array
                (
                    [0] => "http://www.gravatar.com/avatar/df299babc56f0a79678e567e87a09c31?s=32&d=identicon&r=PG"
                    [1] => "gravatar image"
                )

        )

   [..]
        )

)

Regexps are CPU intensive so you may want to cache this page. If you have no cache system, you can tweak your own by using ob_start and loading / saving from a text file.

How does this stuff work ?

First, we use preg_ match_ all, a function that gets every string matching the pattern and ouput it in it's third parameter.

The regexps :

<img[^>]+>

We apply it on all html web pages. It can be read as every string that starts with "<img", contains non ">" char and ends with a >.

(alt|title|src)=("[^"]*")

We apply it successively on each img tag. It can be read as every string starting with "alt", "title" or "src", then a "=", then a ' " ', a bunch of stuff that are not ' " ' and ends with a ' " '. Isolate the sub-strings between ().

Finally, every time you want to deal with regexps, it handy to have good tools to quickly test them. Check this online regexp tester.

EDIT : answer to the first comment.

It's true that I did not think about the (hopefully few) people using single quotes.

Well, if you use only ', just replace all the " by '.

If you mix both. First you should slap yourself :-), then try to use ("|') instead or " and [^ø] to replace [^"].

$url="http://example.com";

$html = file_get_contents($url);

$doc = new DOMDocument();
@$doc->loadHTML($html);

$tags = $doc->getElementsByTagName('img');

foreach ($tags as $tag) {
       echo $tag->getAttribute('src');
}

Just to give a small example of using PHP's XML functionality for the task:

$doc=new DOMDocument();
$doc->loadHTML("<html><body>Test<br><img src=\"myimage.jpg\" title=\"title\" alt=\"alt\"></body></html>");
$xml=simplexml_import_dom($doc); // just to make xpath more simple
$images=$xml->xpath('//img');
foreach ($images as $img) {
    echo $img['src'] . ' ' . $img['alt'] . ' ' . $img['title'];
}

I did use the DOMDocument::loadHTML() method because this method can cope with HTML-syntax and does not force the input document to be XHTML. Strictly speaking the conversion to a SimpleXMLElement is not necessary - it just makes using xpath and the xpath results more simple.

Use xpath.

For php you can use simplexml or domxml

see also this question

If it's XHTML, your example is, you need only simpleXML.

<?php
$input = '<img src="/image/fluffybunny.jpg" title="Harvey the bunny" alt="a cute little fluffy bunny"/>';
$sx = simplexml_load_string($input);
var_dump($sx);
?>

Output:

object(SimpleXMLElement)#1 (1) {
  ["@attributes"]=>
  array(3) {
    ["src"]=>
    string(22) "/image/fluffybunny.jpg"
    ["title"]=>
    string(16) "Harvey the bunny"
    ["alt"]=>
    string(26) "a cute little fluffy bunny"
  }
}

$url="your url";

$page = file_get_contents($url);

$newDom = new DOMDocument();
@$newDom->loadHTML($page);

$tag = $newDom->getElementsByTagName('img');

foreach ($tag as $tag1) {
       echo $tag1->getAttribute('src');
}

The script must be edited like this

foreach( $result[0] as $img_tag)

because preg_match_all return array of arrays

RE this solution:

    $url="http://example.com";

    $html = file_get_contents($url);

    $doc = new DOMDocument();
    @$doc->loadHTML($html);

    $tags = $doc->getElementsByTagName('img');

    foreach ($tags as $tag) {
            echo $tag->getAttribute('src');
    }

How do you get the tag and attribute from multiple files/urls?

Doing this didn't work for me:

    foreach (glob("path/to/files/*.html") as $html) {

      $doc = new DOMDocument();
      $doc->loadHTML($html);

      $tags = $doc->getElementsByTagName('img');

      foreach ($tags as $tag) {
         echo $tag->getAttribute('src');
      }
    }

You may use simplehtmldom. Most of the jQuery selectors are supported in simplehtmldom. An example is given below

// Create DOM from URL or file
$html = file_get_html('http://www.google.com/');

// Find all images
foreach($html->find('img') as $element)
       echo $element->src . '<br>';

// Find all links
foreach($html->find('a') as $element)
       echo $element->href . '<br>';

Here's A PHP Function I hobbled together from all of the above info for a similar purpose, namely adjusting image tag width and length properties on the fly ... a bit clunky, perhaps, but seems to work dependably:

function ReSizeImagesInHTML($HTMLContent,$MaximumWidth,$MaximumHeight) {

// find image tags
preg_match_all('/<img[^>]+>/i',$HTMLContent, $rawimagearray,PREG_SET_ORDER); 

// put image tags in a simpler array
$imagearray = array();
for ($i = 0; $i < count($rawimagearray); $i++) {
    array_push($imagearray, $rawimagearray[$i][0]);
}

// put image attributes in another array
$imageinfo = array();
foreach($imagearray as $img_tag) {

    preg_match_all('/(src|width|height)=("[^"]*")/i',$img_tag, $imageinfo[$img_tag]);
}

// combine everything into one array
$AllImageInfo = array();
foreach($imagearray as $img_tag) {

    $ImageSource = str_replace('"', '', $imageinfo[$img_tag][2][0]);
    $OrignialWidth = str_replace('"', '', $imageinfo[$img_tag][2][1]);
    $OrignialHeight = str_replace('"', '', $imageinfo[$img_tag][2][2]);

    $NewWidth = $OrignialWidth;
    $NewHeight = $OrignialHeight;
    $AdjustDimensions = "F";

    if($OrignialWidth > $MaximumWidth) {
        $diff = $OrignialWidth-$MaximumHeight;
        $percnt_reduced = (($diff/$OrignialWidth)*100);
        $NewHeight = floor($OrignialHeight-(($percnt_reduced*$OrignialHeight)/100));
        $NewWidth = floor($OrignialWidth-$diff);
        $AdjustDimensions = "T";
    }

    if($OrignialHeight > $MaximumHeight) {
        $diff = $OrignialHeight-$MaximumWidth;
        $percnt_reduced = (($diff/$OrignialHeight)*100);
        $NewWidth = floor($OrignialWidth-(($percnt_reduced*$OrignialWidth)/100));
        $NewHeight= floor($OrignialHeight-$diff);
        $AdjustDimensions = "T";
    } 

    $thisImageInfo = array('OriginalImageTag' => $img_tag , 'ImageSource' => $ImageSource , 'OrignialWidth' => $OrignialWidth , 'OrignialHeight' => $OrignialHeight , 'NewWidth' => $NewWidth , 'NewHeight' => $NewHeight, 'AdjustDimensions' => $AdjustDimensions);
    array_push($AllImageInfo, $thisImageInfo);
}

// build array of before and after tags
$ImageBeforeAndAfter = array();
for ($i = 0; $i < count($AllImageInfo); $i++) {

    if($AllImageInfo[$i]['AdjustDimensions'] == "T") {
        $NewImageTag = str_ireplace('width="' . $AllImageInfo[$i]['OrignialWidth'] . '"', 'width="' . $AllImageInfo[$i]['NewWidth'] . '"', $AllImageInfo[$i]['OriginalImageTag']);
        $NewImageTag = str_ireplace('height="' . $AllImageInfo[$i]['OrignialHeight'] . '"', 'height="' . $AllImageInfo[$i]['NewHeight'] . '"', $NewImageTag);

        $thisImageBeforeAndAfter = array('OriginalImageTag' => $AllImageInfo[$i]['OriginalImageTag'] , 'NewImageTag' => $NewImageTag);
        array_push($ImageBeforeAndAfter, $thisImageBeforeAndAfter);
    }
}

// execute search and replace
for ($i = 0; $i < count($ImageBeforeAndAfter); $i++) {
    $HTMLContent = str_ireplace($ImageBeforeAndAfter[$i]['OriginalImageTag'],$ImageBeforeAndAfter[$i]['NewImageTag'], $HTMLContent);
}

return $HTMLContent;

}

I used preg_match to do it.

In my case, I had a string containing exactly one <img> tag (and no other markup) that I got from Wordpress and I was trying to get the src attribute so I could run it through timthumb.

// get the featured image
$image = get_the_post_thumbnail($photos[$i]->ID);

// get the src for that image
$pattern = '/src="([^"]*)"/';
preg_match($pattern, $image, $matches);
$src = $matches[1];
unset($matches);

In the pattern to grab the title or the alt, you could simply use $pattern = '/title="([^"]*)"/'; to grab the title or $pattern = '/title="([^"]*)"/'; to grab the alt. Sadly, my regex isn't good enough to grab all three (alt/title/src) with one pass though.

Here is THE solution, in PHP:

Just download QueryPath, and then do as follows:

$doc= qp($myHtmlDoc);

foreach($doc->xpath('//img') as $img) {

   $src= $img->attr('src');
   $title= $img->attr('title');
   $alt= $img->attr('alt');

}

That's it, you're done !

You can also try SimpleXML if the HTML is guaranteed to be XHTML - it will parse the markup for you and you will be able to access the attributes just by their name. (There are DOM libraries as well if it's just HTML and you can't depend on the XML syntax.)

You can write a regexp to get all img tags (<img[^>]*>), and then use simple explode: $res = explode("\"", $tags), the output will be something like this:

$res[0] = "<img src=";
$res[1] = "/image/fluffybunny.jpg";
$res[2] = "title=";
$res[3] = "Harvey the bunny";
$res[4] = "alt=";
$res[5] = "a cute little fluffy bunny";
$res[6] = "/>";

If you delete the <img tag before the explode, then you will get an array in the form of

property=
value

so the order of the properties are irrelevant, you only use what you will like.

the below code worked for me in wordpress...

it extracts all the image sources from the code

$search = "any html code with image tags";

preg_match_all( '/src="([^"]*)"/', $search, $matches);

if ( isset( $matches ) )
{
    foreach ($matches as $match)
    {
        if(strpos($match[0], "src")!==false)
        {
            $res = explode("\"", $match[0]);
            $image = parse_url($res[1], PHP_URL_PATH);
            $xml .= " <image:image>\n";
            $xml .= " <image:loc>".home_url().$image."</image:loc>\n";
            $xml .= " <image:caption>".htmlentities($title)."</image:caption>\n";
            $xml .= " <image:license>".home_url()."</image:license>\n";
            $xml .= " </image:image>\n";
        }
    }
}

cheers!

$content =  "<img src='http://google.com/2af5e6ae749d523216f296193ab0b146.jpg' width='40' height='40'>";
$image   =  preg_match_all('~<img rel="imgbot" remote="(.*?)" width="(.*?)" height="(.*?)" linktext="(.*?)" linkhref="(.*?)" src="(.*?)" />~is', $content, $matches);

If you want to use regEx why not as easy as this:

preg_match_all('% (.*)=\"(.*)\"%Uis', $code, $matches, PREG_SET_ORDER);

This will return something like:

array(2) {
    [0]=>
    array(3) {
        [0]=>
        string(10) " src="abc""
        [1]=>
        string(3) "src"
        [2]=>
        string(3) "abc"
    }
    [1]=>
    array(3) {
        [0]=>
        string(10) " bla="123""
        [1]=>
        string(3) "bla"
        [2]=>
        string(3) "123"
    }
}

There is my solution for retriving only images from the content of any post in wordpress or html content. `

$content = get_the_content();
$count = substr_count($content, '<img');
$start = 0;
for ($i=0;$i<$count;$i++) {
  if ($i == 0){
    $imgBeg = strpos($content, '<img', $start);
    $post = substr($content, $imgBeg);
  } else {
    $imgBeg = strpos($post, '<img', $start);
    $post = substr($post, $imgBeg-2);
  }
  $imgEnd = strpos($post, '>');
  $postOutput = substr($post, 0, $imgEnd+1);
  $postOutput = preg_replace('/width="([0-9]*)" height="([0-9]*)"/', '',$postOutput);
  $image[$i] = $postOutput;
  $start= $imgEnd + 1;
}
print_r($image);

`

"]+>]+>/)?>"

this will extract anchor tag nested with image tag

How about using a regular expression to find the img tags (something like "<img[^>]*>"), and then, for each img tag, you could use another regular expression to find each attribute.

Maybe something like " ([a-zA-Z]+)=\"([^"]*)\"" to find the attributes, though you might want to allow for quotes not being there if you're dealing with tag soup... If you went with that, you could get the parameter name and value from the groups within each match.

Maybe this will give you the right answers :

<img.*?(?:(?:\s+(src)="([^"]+)")|(?:\s+(alt)="([^"]+)")|(?:\s+(title)="([^"]+)")|(?:\s+[^\s]+))+.*/>

Category: php Time: 2008-09-26 Views: 1

Related post

iOS development

Android development

Python development

JAVA development

Development language

PHP development

Ruby development

search

Front-end development

Database

development tools

Open Platform

Javascript development

.NET development

cloud computing

server

Copyright (C) avrocks.com, All Rights Reserved.

processed in 0.217 (s). 12 q(s)