問題描述
使用多個
標記抓取 HTML 頁面並從特定的我在數據庫字段中有這個 html 源代碼。我想分析這段代碼,特別是一些表格的字段,並將它們打印在屏幕上。這是關於表的代碼:
<table cellspacing="1" cellpadding="1" class="troop_details inReturn"
>
<thead>
<tr>
<td class="role">
<a href="/karte.php?d=91628">01] #WorkInProgress</a>
</td>
<td colspan="11" class="troopHeadline">
<a href="/karte.php?d=91611">Return from 01‑soldier</a>
</td>
</tr>
</thead>
<tbody class="units">
<tr>
<th class="coords">
‭<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(‭−‭1‬‬</span><span class="coordinatePipe">|</span><span class="coordinateY">‭−‭28‬‬)</span></span>‬ </th>
<td class="uniticon">
<img class="unit u21" title="Phalanx: 1:12:51" alt="Phalanx" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u22" title="Swordsman: 1:25:00" alt="Swordsman" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u23" title="Pathfinder: 0:30:00" alt="Pathfinder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u24" title="Theutates Thunder: 0:26:51" alt="Theutates Thunder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u25" title="Druidrider: 0:31:53" alt="Druidrider" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u26" title="Haeduan: 0:39:14" alt="Haeduan" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u27" title="Ram: 2:07:30" alt="Ram" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u28" title="Trebuchet: 2:50:00" alt="Trebuchet" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u29" title="Chieftain: 1:42:00" alt="Chieftain" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u30" title="Settler: 1:42:00" alt="Settler" src="/img/x.gif" /> </td>
<td class="uniticon last">
<img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" /> </td>
</tr>
</tbody>
<tbody class="units last">
<tr>
<th>Troops</th>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit">
500 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none last">
0 </td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Bounty</th>
<td colspan="11">
<div class="res">
<div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6758</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">8093</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">6908</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">15741</span></div></div> </div>
<div class="carry">
<img class="carry full" title="carry"
alt="carry"
src="/img/x.gif"/> ‭‭37500‬ / ‭37500‬‬ </div>
</td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Arrival</th>
<td colspan="11">
<div class="in">in <span class="timer" counting="down" value="85">0:01:25</span> hrs.</div>
<div class="at"><span>at 00:43:10</span><span> </span></div>
</td>
</tr>
</tbody>
</table>
<a name="at"></a>
<table cellspacing="1" cellpadding="1" class="troop_details inReturn"
>
<thead>
<tr>
<td class="role">
<a href="/karte.php?d=91628">01] #WorkInProgress</a>
</td>
<td colspan="11" class="troopHeadline">
<a href="/karte.php?d=94829">Return from 0‑New Hulk</a>
</td>
</tr>
</thead>
<tbody class="units">
<tr>
<th class="coords">
‭<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(‭−‭1‬‬</span><span class="coordinatePipe">|</span><span class="coordinateY">‭−‭28‬‬)</span></span>‬ </th>
<td class="uniticon">
<img class="unit u21" title="Phalanx: 0:45:33" alt="Phalanx" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u22" title="Swordsman: 0:53:09" alt="Swordsman" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u23" title="Pathfinder: 0:18:46" alt="Pathfinder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u24" title="Theutates Thunder: 0:16:47" alt="Theutates Thunder" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u25" title="Druidrider: 0:19:56" alt="Druidrider" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u26" title="Haeduan: 0:24:32" alt="Haeduan" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u27" title="Ram: 1:19:44" alt="Ram" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u28" title="Trebuchet: 1:46:18" alt="Trebuchet" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u29" title="Chieftain: 1:03:47" alt="Chieftain" src="/img/x.gif" /> </td>
<td class="uniticon">
<img class="unit u30" title="Settler: 1:03:47" alt="Settler" src="/img/x.gif" /> </td>
<td class="uniticon last">
<img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" /> </td>
</tr>
</tbody>
<tbody class="units last">
<tr>
<th>Troops</th>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit">
400 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none">
0 </td>
<td class="unit none last">
0 </td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Bounty</th>
<td colspan="11">
<div class="res">
<div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6130</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">5835</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">5638</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">12397</span></div></div> </div>
<div class="carry">
<img class="carry full" title="carry"
alt="carry"
src="/img/x.gif"/> ‭‭30000‬ / ‭30000‬‬ </div>
</td>
</tr>
</tbody>
<tbody class="infos">
<tr>
<th>Arrival</th>
<td colspan="11">
<div class="in">in <span class="timer" counting="down" value="920">0:15:20</span> hrs.</div>
<div class="at"><span>at 00:57:05</span><span> </span></div>
</td>
</tr>
</tbody>
</table>
我感興趣的數據如下:
- Return from 01‑soldier 00:43:10
- 從 0‑New Hulk 返回 00:57:05
感謝您的建議,這是我目前的代碼:
<?php include 'database.php' ?>
<?php session_start(); ?>
<?php
include_once('simple_html_dom.php');
$caserma = $_SESSION["caserma"];
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom‑>loadHTML($_SESSION["caserma"], LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath‑>query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
$texts[] = $textNode‑>nodeValue;
}
var_export($texts);
?>
但是作為輸出它給了我數組()
參考解法
方法 1:
Code assuming $_SESSION["caserma"]
contains your full html document: (Demo)
$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom‑>loadHTML($_SESSION["caserma"]);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath‑>query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
$texts[] = $textNode‑>nodeValue;
}
var_export($texts);
Output from your sample input:
array (
0 => 'Return from 01‑soldier',
1 => 'Return from 0‑New Hulk',
)
XPath Breakdown:
// # search to any depth in the document
table[contains(@class, 'troop_details') and contains(@class, 'inReturn')] # find all table tags with both `troop_details` and `inReturn` classes
// # continue searching any descendants of any matches
td[@class='troopHeadline'] # match all td tags with `troopHeadline` as its class
// # continue searching anydescendants of any matches
a[@href] # match all a tags with an href attribute
/ # search the immediate descendant (any first generation child)
text() # match the text of the parent a tag
libxml_use_internal_errors(true)
is used to silence any potential errors from an "invalid" document.- It is important to use
contains(...) and contains()
in the xpath so that even if the class attributes change their order or new classes are added to the element, the xpath will still match correctly. - The
foreach()
loop will iterate all qualifying text nodes. - Extract the
nodeValue
and push it into the result array.
(by Luigi、mickmackusa)