网络爬虫
网络爬虫需要下载以下相关的库:
pip install requests
pip install beautifulsoup4
pip install lxml
pip install html5lib
或者pip install requests-html
返回网站html
也的文本代码信息:
python
import requests
url = "https://jww.zjgsu.edu.cn/2021/1224/c1331a111803/page.html"
res = requests.get(url)
print(res.text.encode("ISO-8859-1").decode("utf-8"))
res = requests.get('http://www.baidu.com')
print(res)
# 返回200,表示请求网址成功,若为4xx,表示请求失败
response.encoding
:打印网页编码response.text
:返回文本信息response.content
:返回二进制数据response.status_code
:返回响应状态码response.url
:返回访问的网址response.headers
:返回http
响应报头
requests-html
全部功能只支持python3.6
以及以后的版本
python
from requests-html import HTMLSession
session = HTMLSession()
url = 'https://www.dxsbb.com/news/7566.html'
r = session.get(url)
table = r.html.find('tbody>tr')
for row in table[:21]:
l=row.text.split()
s=''
for i in l:
s=s+'{0:^14}'.format(i)
print(s)
f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8')
f.write(s + '\n')
f.close()
豆瓣爬取书籍排名:
python
import requests
from lxml import etree
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://book.douban.com/top250', headers = headers).text
res = etree.HTML(html)
names = res.xpath('//*[@id="content"]/div/div[1]/div/table/tr/td[2]/div[1]/a/text()')
for name in names:
print(name.strip())
其他爬取:(爬取古诗,分段)
python
import requests
from lxml import etree
#headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0'}
html = requests.get('https://so.gushiwen.cn/shiwenv_d16797ee39e4.aspx').text
res = etree.HTML(html)
names = res.xpath('//*[@id="contsond16797ee39e4"]/text()')
print(names)
name='\n'.join(names)
print(name)
f = open(r"C:\Users\Asus\Desktop\111.txt","a+", encoding='UTF-8')
f.write(name)
f.close()
爬取图片:
python
import requests
response = requests.get('https://github.com/favicon.ico') # 图片链接
with open('favicon.ico','wb')as f: # favicon.ico 图片名字,可修改
f.write(response.content)