导入相关的包¶

用到的只有requests和BeautifulSoup

In [1]:
import requests
from bs4 import BeautifulSoup

写一个方法¶

从xml中获取所有链接

In [2]:
def get_sitemap_index(url, headers):
    r = requests.get(url, headers=headers)
    soup = BeautifulSoup(r.text, 'xml')
    return [x.text for x in soup.find_all('loc')]
In [3]:
headers = {
  "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"
}

获取sitemap根目录¶

以便从根目录的索引继续遍历

In [4]:
url = 'https://www.imtrq.com/sitemap.xml'
sitemap = get_sitemap_index(url, headers)

看看sitemap根目录¶

确实是索引列表,百度不允许提交这样的sitemap

In [5]:
sitemap[:5]
Out[5]:
['https://www.imtrq.com/sitemap-misc.xml',
 'https://www.imtrq.com/sitemap-tax-category-1.xml',
 'https://www.imtrq.com/sitemap-pt-post-p1-2023-04.xml',
 'https://www.imtrq.com/sitemap-pt-post-p1-2023-03.xml',
 'https://www.imtrq.com/sitemap-pt-post-p1-2023-02.xml']

遍历索引列表¶

获取所有文章和页面的链接

In [6]:
li = []
for url in  sitemap:
    li = li + get_sitemap_index(url, headers)

查看所有链接¶

In [7]:
li[10:15]
Out[7]:
['https://www.imtrq.com/archives/3367',
 'https://www.imtrq.com/archives/3357',
 'https://www.imtrq.com/archives/3350',
 'https://www.imtrq.com/archives/3343',
 'https://www.imtrq.com/archives/3330']

推送到百度api¶

In [8]:
api = 'http://data.zz.baidu.com/urls?site=https://www.imtrq.com&token=******'
r = requests.post(api,data='\n'.join(li),headers={'Content-Type':'text/plain'})

查看结果¶

In [9]:
r.json()
Out[9]:
{'remain': 98900, 'success': 275}