Russian language

DOMDocument::saveHTML

Создаёт HTML-документ из представления DOM. Эту функцию обычно вызывают после построения нового DOM-документа, как показано в примере ниже.

Список параметров

Необязательный аргумент для вывода подмножества документа.

Возвращаемые значения

Возвращает HTML или false в случае возникновения ошибки.

Примеры

Пример #1 Сохранение HTML-дерева в виде строки

$root = $doc -> createElement ( ‘html’ );
$root = $doc -> appendChild ( $root );

$head = $doc -> createElement ( ‘head’ );
$head = $root -> appendChild ( $head );

$title = $doc -> createElement ( ‘title’ );
$title = $head -> appendChild ( $title );

$text = $doc -> createTextNode ( ‘Это заголовок’ );
$text = $title -> appendChild ( $text );

Смотрите также

  • DOMDocument::saveHTMLFile() — Сохраняет документ из внутреннего представления в файл, используя форматирование HTML
  • DOMDocument::loadHTML() — Загрузка HTML из строки
  • DOMDocument::loadHTMLFile() — Загрузка HTML из файла

User Contributed Notes 18 notes

As of PHP 5.4 and Libxml 2.6, there is currently simpler approach:

when you load html as this

$html->loadHTML($content, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

in the output, there will be no doctype, html or body tags

When saving HTML fragment initiated with LIBXML_HTML_NOIMPLIED option, it will end up being «broken» as libxml requires root element. libxml will attempt to fix the fragment by adding closing tag at the end of string based on the first opened tag it encounters in the fragment.

Foo

bar

Foo

bar

Easiest workaround is adding root tag yourself and stripping it later:

$html->loadHTML(‘‘ . $content .’‘, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$content = str_replace(array(‘‘,’‘) , » , $html->saveHTML());

This method, as of 5.2.6, will automatically add and tags to the document if they are missing, without asking whether you want them. In my application, I needed to use the DOM methods to manipulate just a fragment of html, so these tags were rather unhelpful.

Here’s a simple hack to remove them in case, like me, all you wanted to do was perform a few operations on an HTML fragment.

I am using this solution to prevent tags and the doctype from being added to the HTML string automatically:

$html = ‘

Hello world!

‘ ;
$html = ‘

‘ . $html . ‘

‘ ;
$doc = new DOMDocument ;
$doc -> loadHTML ( $html );
echo substr ( $doc -> saveXML ( $doc -> getElementsByTagName ( ‘div’ )-> item ( 0 )), 5 , — 6 )

// Outputs: «

Hello world!

»
?>

Since PHP/5.3.6, DOMDocument->saveHTML() accepts an optional DOMNode parameter similarly to DOMDocument->saveXML():

If you load HTML from a string ensure the charset is set.

Otherwise the charset will be ISO-8859-1!

Tested in PHP 5.2.9-2 and PHP 5.2.17.
saveHTML() игнорирует свойство DOMDocument->encoding. Метод saveHTML() сохраняет html-документ в кодировке, которая указана в теге исходного (загруженного) html-документа.
saveHTML() ignores property DOMDocument->encoding. Method saveHTML() saves the html-document encoding, which is specified in the tag source (downloaded) html-document.
Example:
file.html. Кодировка файла должна совпадать с указанной в теге . The encoding of the file must match the specified tag .


Русский язык

error_reporting (- 1 );
$document =new domDocument ( ‘1.0’ , ‘UTF-8’ );
$document -> preserveWhiteSpace = false ;
$document -> loadHTMLFile ( ‘file.html’ );
$document -> formatOutput = true ;
$document -> encoding = ‘UTF-8’ ;
$htm = $document -> saveHTML ();
echo «Записано байт. Recorded bytes: » . file_put_contents ( ‘file_new.html’ , $htm );
?>
file_new.html будет в кодировке Windows-1251 (НЕ в UTF-8).
file_new.html will be encoded in Windows-1251 (not in UTF-8).

saveHTML() и file_put_contents() позволяют преодолеть недостаток метода saveHTMLFile().
Смотрите мой комментарий к методу saveHTMLFile().
saveHTML() and file_put_contents() allows you to overcome the lack of a method saveHTMLFile().
See my comment on the method saveHTMLFile().
http://php.net/manual/ru/domdocument.savehtmlfile.php

To solve the script tag problem just add an empty text node to the script node and DOMDocument will render nicely.

To avoid script tags from being output as

$doc = new DOMDocument ();
$doc -> loadXML ( $xmlstring );
$fragment = $doc -> createDocumentFragment ();
/* Append the script element to the fragment using raw XML strings (will be preserved in their raw form) and if succesful proceed to insert it in the DOM tree */
if( $fragment -> appendXML ( «» ) <
$xpath = new DOMXpath ( $doc );
$resultlist = $xpath -> query ( «//*[local-name() = ‘html’]/*[local-name() = ‘head’]» ); /* namespace-safe method to find all head elements which are childs of the html element, should only return 1 match */
foreach( $resultlist as $headnode ) // insert the script tag
$headnode -> appendChild ( $fragment );
>
$doc -> saveXML (); /* and our script tags will still be */

If you want a simpler way to get around the

$script = $doc -> createElement ( ‘script’ );\
// Creating an empty text node forces
$script -> appendChild ( $doc -> createTextNode ( » ));
$head -> appendChild ( $script );

If created your DOMDocument object using loadHTML() (where the source is from another site) and want to pass your changes back to the browser you should make sure the HTTP Content-Type header matches your meta content-type tags value because modern browsers seem to ignore the meta tag and trust just the HTTP header. For example if you’re reading an ISO-8859-1 document and your web server is claiming UTF-8 you need to correct it using the header() function.

header ( ‘Content-Type: text/html; charset=iso-8859-1’ );
?>

Источник

DOMDocument::loadHTML

The function parses the HTML contained in the string source . Unlike loading XML, HTML does not have to be well-formed to load. This function may also be called statically to load and create a DOMDocument object. The static invocation may be used when no DOMDocument properties need to be set prior to loading.

Parameters

Since Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters.

Return Values

Returns true on success or false on failure. If called statically, returns a DOMDocument or false on failure.

Errors/Exceptions

If an empty string is passed as the source , a warning will be generated. This warning is not generated by libxml and cannot be handled using libxml’s error handling functions.

Prior to PHP 8.0.0 this method could be called statically, but would issue an E_DEPRECATED error. As of PHP 8.0.0 calling this method statically throws an Error exception

While malformed HTML should load successfully, this function may generate E_WARNING errors when it encounters bad markup. libxml’s error handling functions may be used to handle these errors.

Examples

Example #1 Creating a Document

See Also

  • DOMDocument::loadHTMLFile() — Load HTML from a file
  • DOMDocument::saveHTML() — Dumps the internal document into a string using HTML formatting
  • DOMDocument::saveHTMLFile() — Dumps the internal document into a file using HTML formatting

User Contributed Notes 19 notes

You can also load HTML as UTF-8 using this simple hack:

$doc = new DOMDocument ();
$doc -> loadHTML ( » . $html );

// dirty fix
foreach ( $doc -> childNodes as $item )
if ( $item -> nodeType == XML_PI_NODE )
$doc -> removeChild ( $item ); // remove hack
$doc -> encoding = ‘UTF-8’ ; // insert proper

DOMDocument is very good at dealing with imperfect markup, but it throws warnings all over the place when it does.

This isn’t well documented here. The solution to this is to implement a separate aparatus for dealing with just these errors.

Set libxml_use_internal_errors(true) before calling loadHTML. This will prevent errors from bubbling up to your default error handler. And you can then get at them (if you desire) using other libxml error functions.

When using loadHTML() to process UTF-8 pages, you may meet the problem that the output of dom functions are not like the input. For example, if you want to get «Cạnh tranh», you will receive «Cạnh tranh». I suggest we use mb_convert_encoding before load UTF-8 page :
$pageDom = new DomDocument ();
$searchPage = mb_convert_encoding ( $htmlUTF8Page , ‘HTML-ENTITIES’ , «UTF-8» );
@ $pageDom -> loadHTML ( $searchPage );

Pay attention when loading html that has a different charset than iso-8859-1. Since this method does not actively try to figure out what the html you are trying to load is encoded in (like most browsers do), you have to specify it in the html head. If, for instance, your html is in utf-8, make sure you have a meta tag in the html’s head section:

If you do not specify the charset like this, all high-ascii bytes will be html-encoded. It is not enough to set the dom document you are loading the html in to UTF-8.

Warning: This does not function well with HTML5 elements such as SVG. Most of the advice on the Web is to turn off errors in order to have it work with HTML5.

If we are loading html5 tags such as

, there is following error:

DOMDocument::loadHTML(): Tag section invalid in Entity

We can disable standard libxml errors (and enable user error handling) using libxml_use_internal_errors(true); before loadHTML();

This is quite useful in phpunit custom assertions as given in following example (if using phpunit test cases):

// Create a DOMDocument
$dom = new DOMDocument();

// fix html5/svg errors
libxml_use_internal_errors(true);

// Load html
$dom->loadHTML(» «);
$htmlNodes = $dom->getElementsByTagName(‘section’);

if ($htmlNodes->length == 0) $this->assertFalse(TRUE);
> else $this->assertTrue(TRUE);
>

Remember: If you use an HTML5 doctype and a meta element like so

your HTML code will get interpreted as ISO-8859-something and non-ASCII chars will get converted into HTML entities. However the HTML4-like version will work (as has been pointed out 10 years ago by «bigtree at 29a»):

It should be noted that when any text is provided within the body tag
outside of a containing element, the DOMDocument will encapsulate that
text into a paragraph tag (

).

For those of you who want to get an external URL’s class element, I have 2 usefull functions. In this example we get the ‘


elements back (search result headers) from google search:

1. Check the URL (if it is reachable, existing)
# URL Check
function url_check ( $url ) <
$headers = @ get_headers ( $url );
return is_array ( $headers ) ? preg_match ( ‘/^HTTP\\/\\d+\\.\\d+\\s+2\\d\\d\\s+.*$/’ , $headers [ 0 ]) : false ;
>;
?>

2. Clean the element you want to get (remove all tags, tabs, new-lines etc.)
# Function to clean a string
function clean ( $text ) $clean = html_entity_decode ( trim ( str_replace ( ‘;’ , ‘-‘ , preg_replace ( ‘/\s+/S’ , » » , strip_tags ( $text ))))); // remove everything
return $clean ;
echo ‘\n’ ; // throw a new line
>
?>

After doing that, we can output the search result headers with following method:
$searchstring = ‘djceejay’ ;
$url = ‘http://www.google.de/webhp#q=’ . $searchstring ;
if( url_check ( $url )) $doc = new DomDocument ;
$doc -> validateOnParse = true ;
$doc -> loadHtml ( file_get_contents ( $url ));
$output = clean ( $doc -> getElementByClass ( ‘r’ )-> textContent );
echo $output . ‘
‘ ;
>else echo ‘URL not reachable!’ ; // Throw message when URL not be called
>
?>

Be aware that this function doesn’t actually understand HTML — it fixes tag-soup input using the general rules of SGML, so it creates well-formed markup, but has no idea which element contexts are allowed.

For example, with input like this where the first element isn’t closed:

loadHTML will change it to this, which is well-formed but invalid:

Источник

Читайте также:  Свойство объекта массив php
Оцените статью