利用爬虫从浏览器中获取地理位置信息

By wuzhenzhen, 31 March, 2025

Forums

描述：从PPBC网站上搜索到物种后，因为不能直接批量下载地理位置信息，因此选择代码直接从网页上获取地址。

1. 搜索羊踯躅，获得以下网址：https://ppbc.iplant.cn/sp/25162

2. 利用python脚本（get_address.py）获取网址信息，并对获得的信息进行去重

import requests
from bs4 import BeautifulSoup
import pandas as pd

# 发送GET请求
try:
	index = 1
	result = []
	while True:
	    response = requests.get('https://ppbc.iplant.cn/ashx/getphotopage.ashx?page='
	    						+ str(index) + '&n=2&group=sp&cid=25162')
	    response.raise_for_status()  # 检查HTTP状态码
	    html = response.text
	    if len(html) == 0:
	        break
	    # 1. 解析HTML
	    soup = BeautifulSoup(html, 'html.parser')
	    items = soup.find_all('div', class_='item3 masonry_brick')
	    for item in items
	        # 提取图片URL
	        span = item.find('span')
	        username = span.find('a').text.strip()
	        location = span.text.replace(username, '').replace('<font>@</font>', '').replace("@", "").strip()
	        result.append({'作者': username, '地址': location})
	    index += 1
	print()
	# 将字典转换为元组集合去重
	unique_tuples = set(tuple(d.items()) for d in result)
	# 转换回字典列表
	unique_dicts = [dict(t) for t in unique_tuples]

	print(unique_dicts)
	df = pd.DataFrame(unique_dicts)
	# 导出到Excel文件
	df.to_excel("output.xlsx", index=False)

except requests.exceptions.Timeout:
	    print("请求超时，请检查网络连接")

3. 运行脚本，即可获得所有的地理信息

$ python get_address.py