問題描述
如何防止將文檔類型添加到 HTML 中? (How to prevent the doctype from being added to the HTML?)
I have been working on this tidy‑up‑messy‑html tags with DOM, but now I realise a bigger problem,
$content = '<p><a href="#">this is a link</a></p>';
function tidy_html($content,$allowable_tags = null, $span_regex = null)
{
$dom = new DOMDocument();
$dom‑>loadHTML($content);
// other codes
return $dom‑>saveHTML();
}
echo tidy_html($content);
It will output the entire DOM,
<!DOCTYPE html PUBLIC "‑//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC‑html40/loose.dtd">
<html><body><p><a href="#">this is a link</a></p></body></html>
but I only want something like this in the return,
<p><a href="#">this is a link</a></p>
I don't want,
<!DOCTYPE html PUBLIC "‑//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC‑html40/loose.dtd">
<html><body>...</body></html>
Is this possible??
EDIT:
the innerHTML
simulation generates some strange codes in my database, like
, Â
, ’
<p>Monday July 5th 10am ‑ 3.30pm £20</p>
<p>Be one of the first visitors to the ...at this special event.Â</p>
<p>All participants will receive a free copy of the ‘Contemporary Art Kit’ produced exclusively for Art on....</p>
the innerHTML
simulation,
$innerHHTML = '';
$nodeBody = $dom‑>getElementsByTagName('body')‑>item(0);
foreach($nodeBody‑>childNodes as $child) {
$innerHTML .= $nodeBody‑>ownerDocument‑>saveXML($child);
}
I found out that the reason it creates the strange codes when there is a break is caused by saveXML($child)
So when I have something like this,
$content = '<p><br/><a href="#">xx</a></p>
<p><br/><a href="#">xx</a></p>';
It will return something like this,
<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>
But I want something this actually,
<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>
‑‑‑‑‑
參考解法
方法 1:
If you're working on a fragment, you normally need only the body contents.
DomDocument in PHP does not offer something like innerHTML
. You can simulate it however:
$innerHHTML = '';
$nodeBody = $dom‑>getElementsByTagName('body')‑>item(0);
foreach($nodeBody‑>childNodes as $child) {
$innerHTML .= $nodeBody‑>ownerDocument‑>saveXML($child);
}
If you just want to repair a fragment, you can make use of the tidy library as well:
$html = tidy_repair_string($html, array('output‑xhtml'=>1,'show‑body‑only'=>1));
方法 2:
Hakre already mentioned the show‑body‑only option to HTML Tidy, which is probably what you want.
Ps. Here's the Tidy config file used by MediaWiki for pretty much just this purpose.
(by Run、hakre、Ilmari Karonen)