如何防止將文檔類型添加到 HTML 中？ (How to prevent the doctype from being added to the HTML?)

問題描述

I have been working on this tidy‑up‑messy‑html tags with DOM, but now I realise a bigger problem,

$content = '<p><a href="#">this is a link</a></p>';

function tidy_html($content,$allowable_tags = null, $span_regex = null)
{      
    $dom = new DOMDocument();
    $dom‑>loadHTML($content);

        // other codes
    return $dom‑>saveHTML();
}

echo tidy_html($content);

It will output the entire DOM,

<!DOCTYPE html PUBLIC "‑//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC‑html40/loose.dtd"> 
<html><body><p><a href="#">this is a link</a></p></body></html>

but I only want something like this in the return,

<p><a href="#">this is a link</a></p>

I don't want,

<!DOCTYPE html PUBLIC "‑//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC‑html40/loose.dtd"> 
    <html><body>...</body></html>

Is this possible??

EDIT:

the innerHTML simulation generates some strange codes in my database, like  , Â , â€™

<p>Monday July 5th 10am ‑ 3.30pm Â£20</p>&#13;
<p>Be one of the first visitors to the ...at this special event.Â</p>&#13;
<p>All participants will receive a free copy of the â€˜Contemporary Art Kitâ€™ produced exclusively for Art on....</p>&#13;

the innerHTML simulation,

$innerHHTML = '';
$nodeBody = $dom‑>getElementsByTagName('body')‑>item(0);
foreach($nodeBody‑>childNodes as $child) {
  $innerHTML .= $nodeBody‑>ownerDocument‑>saveXML($child);
}

I found out that the reason it creates the strange codes when there is a break is caused by saveXML($child)

So when I have something like this,

$content = '<p><br/><a href="#">xx</a></p>
<p><br/><a href="#">xx</a></p>';

It will return something like this,

<p><a href="#">xx</a></p>&#13;
<p><a href="#">xx</a></p>

But I want something this actually,

<p><a href="#">xx</a></p>
<p><a href="#">xx</a></p>

‑‑‑‑‑

參考解法

方法 1:

If you're working on a fragment, you normally need only the body contents.

DomDocument in PHP does not offer something like innerHTML. You can simulate it however:

$innerHHTML = '';
$nodeBody = $dom‑>getElementsByTagName('body')‑>item(0);
foreach($nodeBody‑>childNodes as $child) {
  $innerHTML .= $nodeBody‑>ownerDocument‑>saveXML($child);
}

If you just want to repair a fragment, you can make use of the tidy library as well:

$html = tidy_repair_string($html, array('output‑xhtml'=>1,'show‑body‑only'=>1));

方法 2:

Hakre already mentioned the show‑body‑only option to HTML Tidy, which is probably what you want.

Ps. Here's the Tidy config file used by MediaWiki for pretty much just this purpose.

(by Run、hakre、Ilmari Karonen)

參考文件

How to prevent the doctype from being added to the HTML? (CC BY‑SA 3.0/4.0)

如何防止將文檔類型添加到 HTML 中？ (How to prevent the doctype from being added to the HTML?)

問題描述

參考解法

方法 1:

方法 2:

參考文件

相關問題

留言討論

如何防止將文檔類型添加到 HTML 中？ (How to prevent the doctype from being added to the HTML?)

問題描述

參考解法

方法 1:

方法 2:

參考文件

相關問題

PHP/DOMDocument: unset() 不釋放資源 (PHP/DOMDocument: unset() does not release resources)

C++ Xerces Parser 加載 HTML 並蒐索 HTML 元素 (C++ Xerces Parser Load HTML and Search for HTML Elements)

Cách lấy tên thuộc tính kiểu bằng PHP xpath (How to get the style property name using PHP xpath)

DOMDocument：如何解析類似 bbcode 的標籤？ (DOMDocument : how to parse a bbcode like tag?)

如何使用 DOMDocument 獲取此 html 中的 url (How to use DOMDocument to get url in this html)

DomDocument 未能為 RSS 提要添加“鏈接”元素 (DomDocument failing to add a "link" element for RSS feed)

如何防止將文檔類型添加到 HTML 中？ (How to prevent the doctype from being added to the HTML?)

PHP DOM 文檔回顯問題 (PHP DOMdocument echoing problem)

使用 PHP 將數據放到服務器上（新的 DOMdocument 不起作用） (Use PHP to put data onto server ( new DOMdocument not working))

有沒有辦法構建類似於 DOMDocument 構建 HTML 文檔的 SQL 查詢？ (Is there a way to build SQL queries similar to how DOMDocument builds HTML document?)

來自 URL 的 file_get_contents 僅適用於本地服務器 (file_get_contents from URL works on local server only)

使用多個 <table> 標記抓取 HTML 頁面並從特定的 <a> 標記後代中提取文本 (Scrape HTML page with multiple <table> tags and extract text from specific <a> tag descendants)

留言討論