Python 循環遍歷 csv 文件中的 url 返回 \ufeffhttps:// (Python Looping through urls in csv file returns \ufeffhttps://)


問題描述

Python 循環遍歷 csv 文件中的 url 返回 \ufeffhttps:// (Python Looping through urls in csv file returns \ufeffhttps://)

我是 python 新手,我正在嘗試遍歷 csv 文件中的 url 列表並使用 BeautifulSouptitle >,然後我想將其保存到文件 Headlines.csv 中。但我無法抓取網頁title。如果我使用具有單個 url 的變量,如下所示:

url = 'https://www.space.com/japan‑hayabusa2‑asteroid‑samples‑landing‑date.html'

resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')

print(soup.title.text)

它工作得很好,我得到標題 日本太空艙攜帶小行星 Ryugu 的碎片將於 12 月 6 日登陸地球 | 空格
但是當我使用循環時,

import csv
with open('urls_file2.csv', newline='', encoding='utf‑8') as f:
    reader = csv.reader(f)
    for url in reader:
        print(url)
        resp = req.get(url)
        soup = BeautifulSoup(resp.text, 'lxml')

        print(soup.title.text)

我得到以下['\ufeffhttps://www.foxnews.com/us/this‑day‑ in‑history‑july‑16']

和錯誤信息


參考解法

方法 1:

You have a byte order mark \\ufeff on the URL you parse from your file. It looks like your file is a signature file and has encoding like utf‑8‑sig.

You need to read with the file with encoding='utf‑8‑sig'

Read more here.

方法 2:

As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.

The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.

Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.

import csv

with open('urls_file2.csv', newline='', encoding='utf‑8‑sig') as f:
    reader = csv.reader(f)
    for url in reader:
        print(url[0])

To read up more on lists, you can refer here. You can add more columns to the CSV file and experiment to see how the results would appear. If you would like to refer to the column name while reading each row, you can refer here.

(by GargamelTimmy ChanRaymond C.)

參考文件

  1. Python Looping through urls in csv file returns \ufeffhttps:// (CC BY‑SA 2.5/3.0/4.0)

#Python #url #CSV #beautifulsoup #loops






相關問題

如何從控制台中導入的文件中訪問變量的內容? (How do I access the contents of a variable from a file imported in a console?)

在 python 3.5 的輸入列表中添加美元符號、逗號和大括號 (Adding dollar signs, commas and curly brackets to input list in python 3.5)

為 KeyError 打印出奇怪的錯誤消息 (Strange error message printed out for KeyError)

django 1.9 中的 from django.views.generic.simple import direct_to_template 相當於什麼 (What is the equivalent of from django.views.generic.simple import direct_to_template in django 1.9)

查詢嵌入列表中的數組 (Querying for array in embedded list)

如何在 Python 中搜索子字符串是否在二進製文件中? (How to search if a substring is into a binary file in Python?)

為什麼要避免 while 循環? (Why avoid while loops?)

使用python的json模塊解析json請求 (Parse a json request using json module of python)

為什麼使用 py2app 模塊創建 mac 文件時出現錯誤? (Why i am getting Error when creating mac file using py2app module?)

當 python 線程在網絡調用(HTTPS)中並且發生上下文切換時會發生什麼? (What happens when the python thread is in network call(HTTPS) and the context switch happens?)

如何繪製一條帶斜率和一個點的線?Python (How to plot a line with slope and one point given? Python)

Pickle 找不到我不使用的模塊? (Pickle can't find module that I am not using?)







留言討論