DongHao's 个人在线文档库

网络爬虫

网络爬虫需要下载以下相关的库：

pip install requests
pip install beautifulsoup4
pip install lxml
pip install html5lib或者pip install requests-html

返回网站html也的文本代码信息：

python

import requests
url = "https://jww.zjgsu.edu.cn/2021/1224/c1331a111803/page.html"
res = requests.get(url)
print(res.text.encode("ISO-8859-1").decode("utf-8"))	

res = requests.get('http://www.baidu.com')
print(res)
# 返回200，表示请求网址成功，若为4xx，表示请求失败

response.encoding ：打印网页编码
response.text ：返回文本信息
response.content ：返回二进制数据
response.status_code ：返回响应状态码
response.url：返回访问的网址
response.headers ：返回http响应报头

requests-html全部功能只支持python3.6以及以后的版本

python

from requests-html import HTMLSession
session = HTMLSession()
url = 'https://www.dxsbb.com/news/7566.html'
r = session.get(url)
table = r.html.find('tbody>tr')
for row in table[:21]:
    l=row.text.split()
    s=''
    for i in l:
        s=s+'{0:^14}'.format(i)
    print(s)
    f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8') 
    f.write(s + '\n')
f.close()

豆瓣爬取书籍排名：

python

import requests
from lxml import etree
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://book.douban.com/top250', headers = headers).text
res = etree.HTML(html)
names = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/text()')
for name in names:
    print(name.strip())

其他爬取：（爬取古诗，分段）

python

import requests
from lxml import etree
#headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://so.gushiwen.cn/shiwenv_d16797ee39e4.aspx').text
res = etree.HTML(html)
names = res.xpath('//*[@id="contsond16797ee39e4"]/text()')
print(names)
name='\n'.join(names)
print(name)
f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8')
f.write(name)
f.close()

爬取图片：

python

import requests
response = requests.get('https://github.com/favicon.ico')    # 图片链接
with open('favicon.ico','wb')as f:         # favicon.ico 图片名字，可修改
	f.write(response.content)

网络爬虫 ​

网络爬虫