虽说我称之为数据库,实际上是只有一些基本信息的文本文档而已。
还记得前些天我编的那个爬虫吗?虽然并没有什么bug,但是由于我配置的参数过于简单+那个网站有问题,我接连爬了很多天,重做了好几次,再加上排重,一直到现在才发出来。
这也坚定了我学Scrapy爬虫框架的决心,等我学会了再用scrapy重做一个。
废话不多说,数据在这里:
不想写简介了,需要的直接下回去看就好了。
之前装模作样写了个readme,先挂上。。
Source
Downloaded from THE CHINESE INSTITUTE OF CERTIFIED PUBLIC ACCOUNTANTS INFORMATION MANAGEMENT SYSTEM. ( See cicpa.org.cn)
Describe
'Audit Firms' were divided into two groups (Headquarters and Branches) due to their different given information.
'NwPer' lists the basic information of practitioners in Audit Firms.
'CPA' is the list of certified public accountants.
'Profile' is the profile of certified public accountants.
Code
Since the data were crawled by a python robot, the source code is also available at my blog.
However, given the crawler was write in Requests module and my poor coding skill, bugs are unavoidable. I will use scrapy framework to rewrite a more powerful spider someday.
Warning
As you can see, the data contains over 9,000 audit firms and about 240,000 auditors' information, but integrity is still not 100% ensured.
Moreover, the data were downloaded at the beginning of 2020, timeliness should be taken into consideration.
Comments | NOTHING