問題描述
Python 循環遍歷 csv 文件中的 url 返回 \ufeffhttps:// (Python Looping through urls in csv file returns \ufeffhttps://)
我是 python 新手,我正在嘗試遍歷 csv
文件中的 url 列表並使用 BeautifulSoup
title >,然後我想將其保存到文件 Headlines.csv
中。但我無法抓取網頁title
。如果我使用具有單個 url 的變量,如下所示:
url = 'https://www.space.com/japan‑hayabusa2‑asteroid‑samples‑landing‑date.html'
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
它工作得很好,我得到標題 日本太空艙攜帶小行星 Ryugu 的碎片將於 12 月 6 日登陸地球 | 空格
但是當我使用循環時,
import csv
with open('urls_file2.csv', newline='', encoding='utf‑8') as f:
reader = csv.reader(f)
for url in reader:
print(url)
resp = req.get(url)
soup = BeautifulSoup(resp.text, 'lxml')
print(soup.title.text)
我得到以下['\ufeffhttps://www.foxnews.com/us/this‑day‑ in‑history‑july‑16']
和錯誤信息
參考解法
方法 1:
You have a byte order mark \\ufeff
on the URL you parse from your file. It looks like your file is a signature file and has encoding like utf‑8‑sig.
You need to read with the file with encoding='utf‑8‑sig'
Read more here.
方法 2:
As the previous answer has already mentioned about the "\ufeff", you would need to change the encoding.
The second issue is that when you read a CSV file, you will get a list containing all the columns for each row. The keyword here is list. You are passing the request a list instead of a string.
Based on the example you have given, I would assume that your urls are in the first column of the csv. Python lists starts with a index of 0 and not 1. So to extract out the url, you would need to extract the index of 0 which refers to the first column.
import csv
with open('urls_file2.csv', newline='', encoding='utf‑8‑sig') as f:
reader = csv.reader(f)
for url in reader:
print(url[0])
To read up more on lists, you can refer here. You can add more columns to the CSV file and experiment to see how the results would appear. If you would like to refer to the column name while reading each row, you can refer here.
(by Gargamel、Timmy Chan、Raymond C.)
參考文件