如何確定html標籤是否跨多行 (How to determine if an html tag splits across multiple lines)


問題描述

如何確定html標籤是否跨多行拆分 (How to determine if an html tag splits across multiple lines)

我正在編寫一個涉及抓取網頁的 PHP 腳本。目前,該腳本逐行分析頁面,但如果有一個跨越多行的標籤,它就會中斷,例如

<img src="example.jpg"
alt="example">

如果情況更糟,我可以通過刪除所有換行符來預處理頁面,然後在最近的 > 處重新插入它們,但這似乎是一個雜項。

理想情況下,我能夠檢測到一個跨行的標籤,只連接那些線,並繼續處理。 那麼檢測這種情況的最佳方法是什麼?


參考解法

方法 1:

This is one of my pet peeves: never parse HTML by hand. Never parse HTML with regexps. Never parse HTML with string comparisons. Always use an HTML parser to parse HTML – that's what they're there for.

It's been a long time since I've done any PHP, but a quick search turned up this PHP5 HTML parser.

方法 2:

Don't write a parser, use someone else's: DOMDocument::loadHTML ‑ that's just one, I think there are a lot of others.

方法 3:

Well, this doesn't answer the question and is more of an opinion, but...

I think that the best scraping strategy (and consequently, to eliminate this problem) is not to analyze an HTML line by line, which is unnatural to HTML, but to analyze it by its natural delimiter: <> pairs.

There will be two types of course:

  • Tag elements that are immediately closed, e.g., < br />
  • Tag elements that need a separate closing tag, e.g., < p > text < /p >

You can immediately see the advantage of using this strategy in the case of paragraph(p) tags: It will be easier to parse mutiline paragraphs instead of having to track where the closing tag is.

方法 4:

Perhaps for future projects I'll use a parsing library, but that's kind of aside from the question at hand. This is my current solution. rstrpos is strpos, but from the reverse direction. Example use:

for($i=0; $i<count($lines); $i++)
{
    $line = handle_mulitline_tags(&$i, $line, $lines);
}

And here's that implementation:

function rstrpos($string, $charToFind, $relativePos)
{
    $searchPos = $relativePos;
    $searchChar = '';

    while (($searchChar != $charToFind)&&($searchPos>‑1))
    {
        $newPos = $searchPos‑1;
        $searchChar = substr($string,$newPos,strlen($charToFind));
        $searchPos = $newPos;
    }

    if (!empty($searchChar))
    {
        return $searchPos;
        return TRUE;
    }
    else
    {
        return FALSE;
    }
}

function handle_multiline_tags(&$i, $line, $lines)
{
    //if a tag is opened but not closed before a line break,

    $open = rstrpos($line, '<', strlen($line));
    $close = rstrpos($line, '>', strlen($line));
    if(($open > $close)&&($open > ‑1)&&($close > ‑1))
    {
        $i++;
        return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
    }
    else
    {
        return trim($line);
    }
}

This could probably be optimized in some way, but for my purposes, it's sufficient.

方法 5:

Why don't you read in a line, and set it to a string, then check the string for tag openings and closings, If a tag spans more then one line add the next line to the string and move the part before the opening brace to your processed string. Then just parse through the entire file doing this. Its not beautiful but it should work.

(by Factor MysticJörg W MittagJoshJon LimjapFactor Mysticcorymathews)

參考文件

  1. How to determine if an html tag splits across multiple lines (CC BY‑SA 2.5/3.0/4.0)

#scripting #PHP #html






相關問題

等待進程完成 (Wait for a process to finish)

如何使用 Inno Setup 根據註冊表項選擇在文件夾中安裝插件/文件? (How do I use Inno Setup to optionally install a plugin/file in a folder based on a registry entry?)

Python:遍歷列表但重複一些項目 (Python: Loop through list but repeat some of the items)

Skrip Perl untuk memeriksa server jarak jauh untuk proses (Perl script to check remote server for process)

持續集成中的數據庫變更管理 (Database change management in continuous integration)

如何確定html標籤是否跨多行 (How to determine if an html tag splits across multiple lines)

打開具有特定顏色和標題的 CMD (Open CMD with specific color and title)

用於搜索 XML 文檔的表單 (Form to search XML document)

反編譯 Lua 字節碼的最佳工具? (Best tool(s) for decompiling Lua bytecode?)

如何在 Blender 中通過矩陣反轉變換? (How to reverse a transformation by matrix in Blender?)

在命令行(終端)上使用 R 腳本的最佳方式是什麼? (What's the best way to use R scripts on the command line (terminal)?)

Google Sheets Script,如何創建工作表標籤的新鬆散對象 (Google Sheets Script, how to create a NEW loose object of a sheet tab)







留言討論