問題描述
如何確定html標籤是否跨多行拆分 (How to determine if an html tag splits across multiple lines)
我正在編寫一個涉及抓取網頁的 PHP 腳本。目前,該腳本逐行分析頁面,但如果有一個跨越多行的標籤,它就會中斷,例如
<img src="example.jpg"
alt="example">
如果情況更糟,我可以通過刪除所有換行符來預處理頁面,然後在最近的 >
處重新插入它們,但這似乎是一個雜項。
理想情況下,我能夠檢測到一個跨行的標籤,只連接那些線,並繼續處理。 那麼檢測這種情況的最佳方法是什麼?
參考解法
方法 1:
This is one of my pet peeves: never parse HTML by hand. Never parse HTML with regexps. Never parse HTML with string comparisons. Always use an HTML parser to parse HTML – that's what they're there for.
It's been a long time since I've done any PHP, but a quick search turned up this PHP5 HTML parser.
方法 2:
Don't write a parser, use someone else's: DOMDocument::loadHTML ‑ that's just one, I think there are a lot of others.
方法 3:
Well, this doesn't answer the question and is more of an opinion, but...
I think that the best scraping strategy (and consequently, to eliminate this problem) is not to analyze an HTML line by line, which is unnatural to HTML, but to analyze it by its natural delimiter: <> pairs.
There will be two types of course:
- Tag elements that are immediately closed, e.g., < br />
- Tag elements that need a separate closing tag, e.g., < p > text < /p >
You can immediately see the advantage of using this strategy in the case of paragraph(p) tags: It will be easier to parse mutiline paragraphs instead of having to track where the closing tag is.
方法 4:
Perhaps for future projects I'll use a parsing library, but that's kind of aside from the question at hand. This is my current solution. rstrpos
is strpos, but from the reverse direction. Example use:
for($i=0; $i<count($lines); $i++)
{
$line = handle_mulitline_tags(&$i, $line, $lines);
}
And here's that implementation:
function rstrpos($string, $charToFind, $relativePos)
{
$searchPos = $relativePos;
$searchChar = '';
while (($searchChar != $charToFind)&&($searchPos>‑1))
{
$newPos = $searchPos‑1;
$searchChar = substr($string,$newPos,strlen($charToFind));
$searchPos = $newPos;
}
if (!empty($searchChar))
{
return $searchPos;
return TRUE;
}
else
{
return FALSE;
}
}
function handle_multiline_tags(&$i, $line, $lines)
{
//if a tag is opened but not closed before a line break,
$open = rstrpos($line, '<', strlen($line));
$close = rstrpos($line, '>', strlen($line));
if(($open > $close)&&($open > ‑1)&&($close > ‑1))
{
$i++;
return trim($line).trim(handle_multiline_tags(&$i, $lines[$i], $lines));
}
else
{
return trim($line);
}
}
This could probably be optimized in some way, but for my purposes, it's sufficient.
方法 5:
Why don't you read in a line, and set it to a string, then check the string for tag openings and closings, If a tag spans more then one line add the next line to the string and move the part before the opening brace to your processed string. Then just parse through the entire file doing this. Its not beautiful but it should work.
(by Factor Mystic、Jörg W Mittag、Josh、Jon Limjap、Factor Mystic、corymathews)