来自 金沙澳门官网网址 2019-12-19 15:01 的文章
当前位置: 金沙澳门官网网址 > 金沙澳门官网网址 > 正文

金沙澳门官网网址:爬取数据不完整,框架的爬

以下为部分代码:defget_product():html=driver.page_sourcedoc=pq(html)items=doc('#ctl00_main_gvAll_Controltbodytr').items()foriteminitems:product={'序号':[item.find('td:nth-child(2)').text()],'标题':[item.find('td:nth-child(3)divnobra').text()],'来源':[item.find('td:nth-child(4)divnobr').text()],'姓名':[item.find('td:nth-child(6)').text()],'时间':[item.find('td:nth-child(7)').text()],}#print(product)df=pd.DataFrame(product)df.to_csv('1111.csv',mode='a',encoding='utf_8_sig',header=0)defnext_page():foriinrange(2,6):s=driver.find_element_by_xpath("//*[@id='ctl00$main$pager_input']")s.clear()s.send_keys(i)driver.find_element_by_xpath("//*[@id='ctl00$main$pager_btn']").click()time.sleep(5)get_product()defmain():login()start()get_product()next_page()if__name__=='__main__':main()

《 利用 Python36,基于 Scrapy 框架的爬虫思路 》

(一)引言
利用爬虫抓取网页数据已经是常见技能,八爪鱼等工具也比较方便,但导出数据需要付费。从长远来看,静下心来学习爬虫也未尝不可。不过还是有一定难度,需要有一些编程的基础,坑多,一般人就别整了。我做的第一个爬虫作品,欢迎有需要的朋友参考。

(二)目标网址(这是一个JS渲染的网页,网页直接无法解析,风格:网址不变下实现表格的翻页 ):
http://www.chinabond.com.cn/jsp/include/EJB/queryResult.jsp?pageNumber=1&queryType=0&sType=2&zqdm=&zqjc=&zqxz=07&eYear2=0000&bigPageNumber=1&bigPageLines=500&zqdmOrder=1&fxrqOrder=1&hkrOrder=1&qxrOrder=1&dqrOrder=1&ssltrOrder=1&zqqxOrder=1&fxfsOrder=1&xOrder=12345678&qxStart=0&qxEnd=0&sWhere=&wsYear=&weYear=&eWhere=&sEnd=0&fxksr=-00-00&fxjsr=-00-00&fxStart=-00-00&fxEnd=-00-00&dfStart=-00-00&dfEnd=-00-00&start=0&zqfxr=&fuxfs=&faxfs=00&zqxs=00&bzbh=&sYear=&sMonth=00&sDay=00&eYear=&eMonth=00&eDay=00&fxStartYear=&fxStartMonth=00&fxStartDay=00&fxEndYear=&fxEndMonth=00&fxEndDay=00&dfStartYear=&dfStartMonth=00&dfStartDay=00&dfEndYear=&dfEndMonth=00&dfEndDay=00&col=28%2C2%2C5%2C33%2C7%2C21%2C11%2C12%2C23%2C25

(三)工具:Scrapy + Selenium + Phantomjs,平台:Python36。

Python36 -> Pycharm -> CMD -> Scrapy -> Selenium -> Phantomjs -> chrome + xpath helper

-> MySQL ->Navicat Premium -> ODBC配置 -> Stata等应用层

(四)编程思路:

  1. Scrapy 下新建项目文件和主爬虫文件解析网页;

  2. 通过 Chrome + Xpath helper 插件,对网页进行初步分析,确定需要抓取的变量字段,获取 xpath 地址;

  3. 编写 Items.py,这部分主要是定义要抓取和写入的字段:
    class MyspiderItem(scrapy.Item):

以下定义要爬虫的字段名

id = scrapy.Field() # 序号
name = scrapy.Field() # 简称
code = scrapy.Field() # 代码
issuer = scrapy.Field() # 发行人
date = scrapy.Field() # 日期
amount = scrapy.Field() # 发行量
payway = scrapy.Field() # 付息方式
rate = scrapy.Field() # 利率
deadline = scrapy.Field() # 期限
startdate = scrapy.Field() # 起息日
enddate = scrapy.Field() # 到期日

以下定义要写入 MySQL 数据库的字段,注意变量名要与 MySQL 数据表中的字段一致。

def get_insert_sql(self):
insert_sql="""
insert into chinabond(
id,
name,
code,
issuer,
date,
amount,
payway,
rate,
deadline,
startdate,
enddate
)VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
"""

params=(
self["id"],
self["name"],
self["code"],
self["issuer"],
self["date"],
self["amount"],
self["payway"],
self["rate"],
self["deadline"],
self["startdate"],
self["enddate"],
)
return insert_sql,params

  1. 编写通道文件: piplines.py,这部分主要完成将数据写入 MySQL 等数据库:
    from twisted.enterprise import adbapi
    import pymysql
    import pymysql.cursors

class MyspiderPipeline(object):
def init(self,dbpool):
self.dbpool = dbpool

@classmethod
def from_settings(cls,settings):
dbpool = adbapi.ConnectionPool(
"pymysql",
host="localhost",
db="mysql",
user="root",
password="123",
charset="utf8mb4",
cursorclass=pymysql.cursors.DictCursor,
use_unicode = True
)
return cls(dbpool)

def process_item(self,item,spider):
self.dbpool.runInteraction(self.do_insert,item)

def do_insert(self,cursor,item):
insert_sql,params=item.get_insert_sql()
cursor.execute(insert_sql,params)

  1. 在 setting.py 开启通道文件,配置网页请求的头信息等:

(1)考虑以网站可能存在防爬机制,为避免服务器识别为机器访问,需要伪装成人工浏览器访问,启用头信息:

DEFAULT_REQUEST_HEADERS = {
'host': 'www.chinabond.com.cn',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8',
'accept-language': 'zh-CN,zh;q=0.9',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
}

(2)启用 piplines.py 通道文件:

Configure item pipelines

See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html

ITEM_PIPELINES = {
'myspider.pipelines.MyspiderPipeline': 300,
}

  1. 编写主爬虫文件: D:scrapymyspidermysipderspiderschinabond.py,需要在其中完成对网页的解析和元素的提取。

-- coding: utf-8 --

import scrapy
from myspider.items import MyspiderItem
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

class ChinabondSpider(scrapy.Spider):
name = 'chinabond'
allowed_domains = ['chinabond.com.cn']
start_urls = ["http://www.chinabond.com.cn/jsp/include/EJB/queryResult.jsp?pageNumber=1&queryType=0&sType=2&zqdm=&zqjc=&zqxz=07&eYear2=0000&bigPageNumber=1&bigPageLines=500&zqdmOrder=1&fxrqOrder=1&hkrOrder=1&qxrOrder=1&dqrOrder=1&ssltrOrder=1&zqqxOrder=1&fxfsOrder=1&xOrder=12345678&qxStart=0&qxEnd=0&sWhere=&wsYear=&weYear=&eWhere=&sEnd=0&fxksr=-00-00&fxjsr=-00-00&fxStart=-00-00&fxEnd=-00-00&dfStart=-00-00&dfEnd=-00-00&start=0&zqfxr=&fuxfs=&faxfs=00&zqxs=00&bzbh=&sYear=&sMonth=00&sDay=00&eYear=&eMonth=00&eDay=00&fxStartYear=&fxStartMonth=00&fxStartDay=00&fxEndYear=&fxEndMonth=00&fxEndDay=00&dfStartYear=&dfStartMonth=00&dfStartDay=00&dfEndYear=&dfEndMonth=00&dfEndDay=00&col=28%2C2%2C5%2C33%2C7%2C21%2C11%2C12%2C23%2C25"]

def parse(self, response):
driver = webdriver.PhantomJS(executable_path='C:Python36phantomjs-2.1.1-windowsbinphantomjs.exe')
url = "http://www.chinabond.com.cn/jsp/include/EJB/queryResult.jsp?pageNumber=1&queryType=0&sType=2&zqdm=&zqjc=&zqxz=07&eYear2=0000&bigPageNumber=1&bigPageLines=500&zqdmOrder=1&fxrqOrder=1&hkrOrder=1&qxrOrder=1&dqrOrder=1&ssltrOrder=1&zqqxOrder=1&fxfsOrder=1&xOrder=12345678&qxStart=0&qxEnd=0&sWhere=&wsYear=&weYear=&eWhere=&sEnd=0&fxksr=-00-00&fxjsr=-00-00&fxStart=-00-00&fxEnd=-00-00&dfStart=-00-00&dfEnd=-00-00&start=0&zqfxr=&fuxfs=&faxfs=00&zqxs=00&bzbh=&sYear=&sMonth=00&sDay=00&eYear=&eMonth=00&eDay=00&fxStartYear=&fxStartMonth=00&fxStartDay=00&fxEndYear=&fxEndMonth=00&fxEndDay=00&dfStartYear=&dfStartMonth=00&dfStartDay=00&dfEndYear=&dfEndMonth=00&dfEndDay=00&col=28%2C2%2C5%2C33%2C7%2C21%2C11%2C12%2C23%2C25"
driver.get(url)

page_n = len(driver.find_elements_by_xpath('//*[@id="sel"]/option')) # 计算总页数,数值型;

for j in range(1,3): # 在页数范围内爬虫!这里测试为 3,全爬应为:range(1,page_n+1)
page_i = int(driver.find_element_by_xpath('//*[@id="nowpage"]').text) # 获得当前页的页码,数值型;
print("当前是第:" + str(page_i) + "页,共计:" + str(page_n) + "页")

for i in range(2, 22): # 爬取当前页面数据,2—22 表示页面数据是从2-22行,不含表头,一般 tr 代表行,td代表列。
item = MyspiderItem()
item['id'] = driver.find_element_by_xpath("//[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[1]").text
item['name'] = driver.find_element_by_xpath("//
[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[2]").text
item['code'] = driver.find_element_by_xpath("//[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[3]").text
item['issuer'] = driver.find_element_by_xpath("//
[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[4]").text
item['date'] = driver.find_element_by_xpath("//[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[5]").text
item["amount"] = driver.find_element_by_xpath("//
[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[6]").text
item["payway"] = driver.find_element_by_xpath("//[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[7]").text
item["rate"] = driver.find_element_by_xpath("//
[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[9]").text
item["deadline"] = driver.find_element_by_xpath("//[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[10]").text
item["startdate"] = driver.find_element_by_xpath("//
[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[11]").text
item["enddate"] = driver.find_element_by_xpath("//*[@id='bodyTable']/tbody/tr[" + str(i) + "]/td[12]").text

yield item # 发送当前页数据

driver.find_element_by_xpath('//*[@id="xiayiye"]/a/img').click() # 点击下一页;

def load_ok(driver): # 判断下一页是否已经完成加载
if int(driver.find_element_by_xpath('//*[@id="nowpage"]').text) != page_i:
return 1
else:
return 0
WebDriverWait(driver,20).until(load_ok) # 等待加载完成。

yield scrapy.Request(url, callback=self.parse) # 注意,在本例中,翻页后网址不变,网页是JS渲染得到的,因此直接不可解析。因此,也是使用 selenium + Phantomjs 的理由。该句如果网址有变化才考虑回传网页请求,本例是直接两重循还得到。但循环时要先等待网页加载完成才可进行,必须要有这个判断函数。

(五)运行之,大功告成,哈哈!

  1. 爬虫过程图(164页,3270条记录):

金沙澳门官网网址 1

1.jpg

  1. 数据注入到 MySQL 的截图:
  1. Stata 走 ODBC 通道将数据导入结果

(附录A) 安装 Scrapy:

python 下载包:https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted

             http://blog.csdn.net/HHTNAN/article/details/77931782

pip3 安装 Scray 爬虫框架的方法:


====================================

1.运行CMD,切换到Python目录,安装pip:python -m pip install --upgrade pip --force-reinstall

说明:安装后的pip3存在于Scripts目录下,可在CMD中切换到 PythonScripts下,运行 pip3 -V 查看版本。

2.在CMD中,进入Scripts目录下,运行pip3来安装 Scrapy 框架:pip3 install scrapy

本文由金沙澳门官网网址发布于金沙澳门官网网址,转载请注明出处:金沙澳门官网网址:爬取数据不完整,框架的爬

关键词: