  • 期刊

Design and Implementation of Network Data Reptile


With the rapid development of information technology, the global into a highly informative state, the network information resources show explosive growth, at the same time, the traditional means of information search has been far from being able to meet the needs of different industries, different positions of users, In order to improve the efficiency of Internet users search, web crawler as an important part of the search engine and the basis of its role is particularly important. This paper first introduces the background and significance of the research and the current research situation at home and abroad and the main contents of this paper. Then it introduces the basic concept of web crawler, the type of web crawler and the search strategy to use the web crawler system to extract and store the network data. Then it introduces the design and implementation of the web crawler system in detail, introduces the characteristics of the Python language used in the preparation of the web crawler, the advantages of the pycharm compiler and the urllib library and the tkinter graphical interface, and carries on the lottery record for the double color ball The crawler reptile is an example, and the urllib library is combined with the regular expression to perform the sub-string matching and the data crawling and subsequent storage function. Finally, the crawling result is analyzed and compared with the manual data. The time spent is compared and the objective conclusion is reached.


Shokouhi M, Chubak P, RaeesyZ.Enhancing Focused Craw-ling with Genetic Algorithms[C] // International Conferenceon Information Technology:Coding and Computing ( ITCC05)-Volume II.[s.l.] [s.n.],2005: 503-508.
S. Chakrabarti, M. van den Berg and B. Dom. Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery [C]. InProceedings of the 8th International World Wide Web Conference, Toronto, Canada, 1999.
Netcraft: There are over 10 billion Websites in the World Wide Web (WWW). [EB/OL]. [2014-09-18]. http://tech.huanqiu.com/internet/2014-09/5142584.html.
Livescience: The indexed web contains at least 46 billion pages (WWW). [EB/OL]. [2016-03-21].http: // www.cankaoxiaoxi.com/science/20160321/1105602.shtml
CNNIC thirty-ninth Internet Report[EB/OL].[2016-01-22].http:// www.cac.gov.cn/ cnnic39/ index.htm
