使用多個 <table> 標記抓取 HTML 頁面並從特定的 <a> 標記後代中提取文本 (Scrape HTML page with multiple <table> tags and extract text from specific <a> tag descendants)


問題描述

使用多個

標記抓取 HTML 頁面並從特定的
tags and extract text from specific tag descendants)

我在數據庫字段中有這個 html 源代碼。我想分析這段代碼,特別是一些表格的字段,並將它們打印在屏幕上。這是關於表的代碼:

<table cellspacing="1" cellpadding="1" class="troop_details inReturn"
    >
        <thead>
            <tr>
                <td class="role">
                                            <a href="/karte.php?d=91628">01] #WorkInProgress</a>
                                    </td>
                <td colspan="11" class="troopHeadline">
                                                                <a href="/karte.php?d=91611">Return from 01‑soldier</a>
                                    </td>
            </tr>
        </thead>
        <tbody class="units">
            <tr>
                <th class="coords">
                                            &#x202d;<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(&#x202d;&minus;&#x202d;1&#x202c;&#x202c;</span><span class="coordinatePipe">|</span><span class="coordinateY">&#x202d;&minus;&#x202d;28&#x202c;&#x202c;)</span></span>&#x202c;                                    </th>
                                    <td class="uniticon">
                        <img class="unit u21" title="Phalanx: 1:12:51" alt="Phalanx" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u22" title="Swordsman: 1:25:00" alt="Swordsman" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u23" title="Pathfinder: 0:30:00" alt="Pathfinder" src="/img/x.gif" />                  </td>
                                    <td class="uniticon">
                        <img class="unit u24" title="Theutates Thunder: 0:26:51" alt="Theutates Thunder" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u25" title="Druidrider: 0:31:53" alt="Druidrider" src="/img/x.gif" />                  </td>
                                    <td class="uniticon">
                        <img class="unit u26" title="Haeduan: 0:39:14" alt="Haeduan" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u27" title="Ram: 2:07:30" alt="Ram" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u28" title="Trebuchet: 2:50:00" alt="Trebuchet" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u29" title="Chieftain: 1:42:00" alt="Chieftain" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u30" title="Settler: 1:42:00" alt="Settler" src="/img/x.gif" />                    </td>
                                                    <td class="uniticon last">
                        <img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" />                 </td>
                            </tr>
        </tbody>

        <tbody class="units last">
            <tr>
                <th>Troops</th>
                                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit">
                                                    500                                         </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none last">
                                                    0                                           </td>
                            </tr>
        </tbody>

                    <tbody class="infos">
                <tr>
                    <th>Bounty</th>
                    <td colspan="11">
                        <div class="res">
                            <div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6758</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">8093</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">6908</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">15741</span></div></div>                       </div>
                        <div class="carry">
                            <img class="carry full" title="carry"
                                 alt="carry"
                                 src="/img/x.gif"/> &#x202d;&#x202d;37500&#x202c;&nbsp;/&nbsp;&#x202d;37500&#x202c;&#x202c;                     </div>
                    </td>
                </tr>
            </tbody>

        <tbody class="infos">
            <tr>
                <th>Arrival</th>
                <td colspan="11">
                    <div class="in">in&nbsp;<span  class="timer" counting="down" value="85">0:01:25</span>&nbsp;hrs.</div>
                    <div class="at"><span>at&nbsp;00:43:10</span><span> </span></div>
                </td>
            </tr>
        </tbody>
    </table>
            <a name="at"></a>
    <table cellspacing="1" cellpadding="1" class="troop_details inReturn"
    >
        <thead>
            <tr>
                <td class="role">
                                            <a href="/karte.php?d=91628">01] #WorkInProgress</a>
                                    </td>
                <td colspan="11" class="troopHeadline">
                                                                <a href="/karte.php?d=94829">Return from 0‑New Hulk</a>
                                    </td>
            </tr>
        </thead>
        <tbody class="units">
            <tr>
                <th class="coords">
                                            &#x202d;<span class="coordinates coordinatesWrapper coordinatesAligned coordinatesltr"><span class="coordinateX">(&#x202d;&minus;&#x202d;1&#x202c;&#x202c;</span><span class="coordinatePipe">|</span><span class="coordinateY">&#x202d;&minus;&#x202d;28&#x202c;&#x202c;)</span></span>&#x202c;                                    </th>
                                    <td class="uniticon">
                        <img class="unit u21" title="Phalanx: 0:45:33" alt="Phalanx" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u22" title="Swordsman: 0:53:09" alt="Swordsman" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u23" title="Pathfinder: 0:18:46" alt="Pathfinder" src="/img/x.gif" />                  </td>
                                    <td class="uniticon">
                        <img class="unit u24" title="Theutates Thunder: 0:16:47" alt="Theutates Thunder" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u25" title="Druidrider: 0:19:56" alt="Druidrider" src="/img/x.gif" />                  </td>
                                    <td class="uniticon">
                        <img class="unit u26" title="Haeduan: 0:24:32" alt="Haeduan" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u27" title="Ram: 1:19:44" alt="Ram" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u28" title="Trebuchet: 1:46:18" alt="Trebuchet" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u29" title="Chieftain: 1:03:47" alt="Chieftain" src="/img/x.gif" />                    </td>
                                    <td class="uniticon">
                        <img class="unit u30" title="Settler: 1:03:47" alt="Settler" src="/img/x.gif" />                    </td>
                                                    <td class="uniticon last">
                        <img class="unit uhero" title="Hero" alt="Hero" src="/img/x.gif" />                 </td>
                            </tr>
        </tbody>

        <tbody class="units last">
            <tr>
                <th>Troops</th>
                                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit">
                                                    400                                         </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none">
                                                    0                                           </td>
                                    <td class="unit none last">
                                                    0                                           </td>
                            </tr>
        </tbody>

                    <tbody class="infos">
                <tr>
                    <th>Bounty</th>
                    <td colspan="11">
                        <div class="res">
                            <div class="inlineIconList resourceWrapper"><div class="inlineIcon resources" title="Lumber"><i class="r1"></i><span class="value ">6130</span></div><div class="inlineIcon resources" title="Clay"><i class="r2"></i><span class="value ">5835</span></div><div class="inlineIcon resources" title="Iron"><i class="r3"></i><span class="value ">5638</span></div><div class="inlineIcon resources" title="Crop"><i class="r4"></i><span class="value ">12397</span></div></div>                       </div>
                        <div class="carry">
                            <img class="carry full" title="carry"
                                 alt="carry"
                                 src="/img/x.gif"/> &#x202d;&#x202d;30000&#x202c;&nbsp;/&nbsp;&#x202d;30000&#x202c;&#x202c;                     </div>
                    </td>
                </tr>
            </tbody>

        <tbody class="infos">
            <tr>
                <th>Arrival</th>
                <td colspan="11">
                    <div class="in">in&nbsp;<span  class="timer" counting="down" value="920">0:15:20</span>&nbsp;hrs.</div>
                    <div class="at"><span>at&nbsp;00:57:05</span><span> </span></div>
                </td>
            </tr>
        </tbody>
    </table>

我感興趣的數據如下:

  1. Return from 01‑soldier 00:43:10
  2. 從 0‑New Hulk 返回 00:57:05

感謝您的建議,這是我目前的代碼:

  <?php include 'database.php' ?>
<?php session_start(); ?>

<?php
include_once('simple_html_dom.php');
$caserma = $_SESSION["caserma"];

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom‑>loadHTML($_SESSION["caserma"], LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath‑>query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
    $texts[] = $textNode‑>nodeValue;
}
var_export($texts);
 ?>

但是作為輸出它給了我數組()


參考解法

方法 1:

Code assuming $_SESSION["caserma"] contains your full html document: (Demo)

$dom = new DOMDocument;
libxml_use_internal_errors(true);
$dom‑>loadHTML($_SESSION["caserma"]);
$xpath = new DOMXPath($dom);
$texts = [];
foreach ($xpath‑>query("//table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]//td[@class='troopHeadline']//a[@href]/text()") as $textNode) {
    $texts[] = $textNode‑>nodeValue;
}
var_export($texts);

Output from your sample input:

array (
  0 => 'Return from 01‑soldier',
  1 => 'Return from 0‑New Hulk',
)

XPath Breakdown:

//                                                                         # search to any depth in the document
table[contains(@class, 'troop_details') and contains(@class, 'inReturn')]  # find all table tags with both `troop_details` and `inReturn` classes
//                                                                         # continue searching any descendants of any matches
td[@class='troopHeadline']                                                 # match all td tags with `troopHeadline` as its class
//                                                                         # continue searching anydescendants of any matches
a[@href]                                                                   # match all a tags with an href attribute
/                                                                          # search the immediate descendant (any first generation child)
text()                                                                     # match the text of the parent a tag

  • libxml_use_internal_errors(true) is used to silence any potential errors from an "invalid" document.
  • It is important to use contains(...) and contains() in the xpath so that even if the class attributes change their order or new classes are added to the element, the xpath will still match correctly.
  • The foreach() loop will iterate all qualifying text nodes.
  • Extract the nodeValue and push it into the result array.

(by Luigimickmackusa)

參考文件

  1. Scrape HTML page with multiple tags and extract text from specific tag descendants (CC BY‑SA 2.5/3.0/4.0)

#domdocument #web-scraping #DOM #PHP #xpath






相關問題

PHP/DOMDocument: unset() 不釋放資源 (PHP/DOMDocument: unset() does not release resources)

C++ Xerces Parser 加載 HTML 並蒐索 HTML 元素 (C++ Xerces Parser Load HTML and Search for HTML Elements)

Cách lấy tên thuộc tính kiểu bằng PHP xpath (How to get the style property name using PHP xpath)

DOMDocument:如何解析類似 bbcode 的標籤? (DOMDocument : how to parse a bbcode like tag?)

如何使用 DOMDocument 獲取此 html 中的 url (How to use DOMDocument to get url in this html)

DomDocument 未能為 RSS 提要添加“鏈接”元素 (DomDocument failing to add a "link" element for RSS feed)

如何防止將文檔類型添加到 HTML 中? (How to prevent the doctype from being added to the HTML?)

PHP DOM 文檔回顯問題 (PHP DOMdocument echoing problem)

使用 PHP 將數據放到服務器上(新的 DOMdocument 不起作用) (Use PHP to put data onto server ( new DOMdocument not working))

有沒有辦法構建類似於 DOMDocument 構建 HTML 文檔的 SQL 查詢? (Is there a way to build SQL queries similar to how DOMDocument builds HTML document?)

來自 URL 的 file_get_contents 僅適用於本地服務器 (file_get_contents from URL works on local server only)

使用多個 <table> 標記抓取 HTML 頁面並從特定的 <a> 標記後代中提取文本 (Scrape HTML page with multiple <table> tags and extract text from specific <a> tag descendants)







留言討論