Skip to content
Ilhan Yumer edited this page May 16, 2022 · 6 revisions

The typical use of html5-php is to parse html5 to a DOM or to turn a DOM into html5.

Instantiating

To create a new HTML5 parser just write

// composer autoload
require "vendor/autoload.php";

use Masterminds\HTML5;
$html5 = new HTML5($options);

Parsing

The three ways to easily parse html5 are html5 strings, html5 files, and html5 fragments.

Parsing html5 strings

// An example HTML document:
$html = <<< 'HERE'
  <html>
  <head>
    <title>TEST</title>
  </head>
  <body id='foo'>
    <h1>Hello World</h1>
    <p>This is a test of the HTML5 parser.</p>
  </body>
  </html>
HERE;

// Parse the document. $dom is a DOMDocument.
$dom = $html5->loadHTML($html);

DOMDocument is the same object returned when parsing html4, xml, and xhtml with the built in tools from libxml.

Parsing html5 files

Parsing a file or resource can happen without loading the markup to a string.

// Parse the document. $dom is a DOMDocument.
$dom = $html5->loadHTMLFile('path/to/file.html');

Parsing html5 fragments

// An example HTML fragment:
$fragment = "<p>This is a test of the HTML5 parser.</p>";

// Parse the document. $dom is a DOMDocumentFragment.
$dom = $html5->loadHTMLFragment($fragment);

DOMDocumentFragment is similar to DOMDocument in that it is a container for elements. DOMDocumentFragments can be attached to DOMDocuments. When that happens all the children are moved to the DOMDocument.

Serializing (Writing)

The serializer can write DOMDocuments and DOMDocumentFragments to strings and files.

Writing to a string

// $dom is either a DOMDocument, DOMDocumentFragment, or DOMNodeList.
$string = $html5->saveHTML($dom);

Writing to a file

// $dom is either a DOMDocument, DOMDocumentFragment, or DOMNodeList.
$string = $html5->save($dom, 'path/to/file.html');

Entity Encoding

html5 has a long list of entities to encode going beyond the typical use cases. These include characters like periods, commas, and thousands of other common characters. There is an option of whether to encode the entire list or whether to encode only the basics as done by htmlspecialchars. The default is only the basic characters.

To change the default value to encode all entities:

$html5 = new HTML5(array('encode_entities' => TRUE));

To encode all entities at call time:

// $dom is either a DOMDocument, DOMDocumentFragment, or DOMNodeList.
$string = $html5->saveHTML($dom, array('encode_entities' => TRUE));