CN107689951A - Web data crawling method, device, user terminal and readable storage medium storing program for executing - Google Patents
Web data crawling method, device, user terminal and readable storage medium storing program for executing Download PDFInfo
- Publication number
- CN107689951A CN107689951A CN201710619263.4A CN201710619263A CN107689951A CN 107689951 A CN107689951 A CN 107689951A CN 201710619263 A CN201710619263 A CN 201710619263A CN 107689951 A CN107689951 A CN 107689951A
- Authority
- CN
- China
- Prior art keywords
- crawled
- website
- data
- crawling
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/08—Network architectures or network communication protocols for network security for authentication of entities
- H04L63/083—Network architectures or network communication protocols for network security for authentication of entities using passwords
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/10—Network architectures or network communication protocols for network security for controlling access to devices or network resources
- H04L63/102—Entity profiles
Landscapes
- Engineering & Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Computer Networks & Wireless Communication (AREA)
- Computer Hardware Design (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
Description
技术领域technical field
本发明涉及计算机技术领域,特别是涉及一种网页数据爬取方法、装置、用户终端及可读存储介质。The present invention relates to the field of computer technology, in particular to a web page data crawling method, device, user terminal and readable storage medium.
背景技术Background technique
目前,互联网上大量有价值的信息均需要爬取到服务器进行分析,以对用户的行为等进行分析,例如可以通过服务器登录向待爬取网站输入账号和密码,以登录待爬取网站,然后爬取存储在待爬取网站中的数据,但是由于当前各个网站的安全机制都非常高,同一IP地址爬取过多账户的信息会触发网站的风控机制,导致用户的账户被封锁,从而用户不能使用账户。At present, a large amount of valuable information on the Internet needs to be crawled to the server for analysis, so as to analyze user behavior. Crawl the data stored in the website to be crawled, but because the current security mechanism of each website is very high, crawling too many account information from the same IP address will trigger the website's risk control mechanism, resulting in the user's account being blocked, thus The user cannot use the account.
发明内容Contents of the invention
基于此,有必要针对服务器爬取待爬取网站中的数据所导致的账户被封锁的问题,提供一种网页数据爬取方法、装置、用户终端及可读存储介质。Based on this, it is necessary to provide a webpage data crawling method, device, user terminal and readable storage medium for the problem of account blocking caused by the server crawling the data in the website to be crawled.
一种网页数据爬取方法,所述方法包括:A web page data crawling method, the method comprising:
通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;Receive the account and password corresponding to the website to be crawled through the login interface of the website to be crawled embedded in the client, and log in to the website to be crawled through the account and password corresponding to the website to be crawled;
检测是否成功登录所述待爬取网站;Detecting whether the website to be crawled is successfully logged in;
当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;When successfully logging in to the website to be crawled, it is judged whether the account of the client matches the account of the website to be crawled;
当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;When the account of the client matches the account of the website to be crawled, then crawl the data to be crawled in the website to be crawled;
将所爬取的待爬取数据发送至服务器。Send the crawled data to be crawled to the server.
在其中一个实施例中,所述爬取所述待爬取网站中的待爬取数据的步骤,包括:In one of the embodiments, the step of crawling the data to be crawled in the website to be crawled includes:
向服务器发送爬取脚本获取请求;Send a crawling script acquisition request to the server;
接收服务器返回的与所述爬取脚本获取请求对应的爬取脚本;Receiving the crawling script corresponding to the crawling script acquisition request returned by the server;
通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。The data to be crawled in the website to be crawled is crawled through the crawling script.
在其中一个实施例中,所述向服务器发送爬取脚本获取请求的步骤之前,还包括:In one of the embodiments, before the step of sending the crawling script acquisition request to the server, it also includes:
获取上次接收服务器返回爬取脚本的时间;Obtain the time when the crawling script was returned by the receiving server last time;
当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取所述待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行所述向服务器发送爬取脚本获取请求的步骤。When the difference between the time when the crawling script was returned by the receiving server last time and the current time is within a preset range, the data to be crawled in the website to be crawled is crawled through the crawling script returned by the server received last time ; When the difference between the last time the server received the crawling script and the current time is not within the preset range, execute the step of sending the crawling script acquisition request to the server.
在其中一个实施例中,所述检测是否成功登录所述待爬取网站的步骤,包括:In one of the embodiments, the step of detecting whether to successfully log in to the website to be crawled includes:
检测所述客户端所显示的当前页面的URL地址是否改变;Detect whether the URL address of the current page displayed by the client changes;
当所述客户端所显示的当前页面的URL地址改变,则成功登录所述待爬取网站;When the URL address of the current page displayed by the client changes, the website to be crawled is successfully logged in;
当所述客户端所显示的当前页面的URL地址未改变,则未成功登录所述待爬取网站。When the URL address of the current page displayed by the client does not change, the login to the website to be crawled is unsuccessful.
在其中一个实施例中,所述待爬取网站为邮箱网站;In one of the embodiments, the website to be crawled is a mailbox website;
所述爬取所述待爬取网站中的待爬取数据的步骤,包括:The step of crawling the data to be crawled in the website to be crawled includes:
从所述邮箱网站中选取标题与所述待爬取数据对应的邮件;Selecting emails with titles corresponding to the data to be crawled from the mailbox website;
从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。Crawl the data of the preset fields from the selected emails as the crawled data to be crawled.
在其中一个实施例中,所述将所爬取的待爬取数据发送至服务器的步骤,包括:In one of the embodiments, the step of sending the crawled data to be crawled to the server includes:
将所爬取的待爬取数据进行加密处理;Encrypt the crawled data to be crawled;
将加密后的待爬取数据进行打包;Pack the encrypted data to be crawled;
将打包后的待爬取数据发送至服务器。Send the packaged data to be crawled to the server.
一种网页数据爬取装置,所述装置包括:A web page data crawling device, said device comprising:
登录模块,用于通过客户端嵌入的待爬取网站登录界面,接收输入的与所述待爬取网站对应的账户和密码,通过与所述待爬取网站对应的账户和密码登录所述待爬取网站;The login module is used to receive the account and password corresponding to the website to be crawled through the login interface of the website to be crawled embedded in the client, and log in to the website to be crawled through the account and password corresponding to the website to be crawled. crawl the website;
检测模块,用于检测是否成功登录所述待爬取网站;A detection module, configured to detect whether the website to be crawled is successfully logged in;
验证模块,用于当成功登录所述待爬取网站时,则判断所述客户端的账户与所述待爬取网站的账户是否匹配;A verification module, configured to determine whether the account of the client matches the account of the website to be crawled when successfully logging in to the website to be crawled;
爬取模块,用于当所述客户端的账户与所述待爬取网站的账户匹配时,则爬取所述待爬取网站中的待爬取数据;A crawling module, configured to crawl the data to be crawled in the website to be crawled when the account of the client matches the account of the website to be crawled;
发送模块,用于将所爬取的待爬取数据发送至服务器。The sending module is used to send the crawled data to be crawled to the server.
在其中一个实施例中,所述发送模块还用于向服务器发送爬取脚本获取请求;In one of the embodiments, the sending module is also used to send a crawling script acquisition request to the server;
所述爬取模块包括:The crawling module includes:
接收单元,用于接收服务器返回的与所述爬取脚本获取请求对应的爬取脚本;A receiving unit, configured to receive the crawling script returned by the server and corresponding to the crawling script acquisition request;
爬取单元,用于通过所述爬取脚本爬取所述待爬取网站中的待爬取数据。A crawling unit, configured to crawl the data to be crawled in the website to be crawled through the crawling script.
一种用户终端,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述方法中的步骤。A user terminal includes a memory, a processor, and a computer program stored in the memory and operable on the processor, and the processor implements the steps in the above method when executing the computer program.
一种计算机可读存储介质,其上存储有计算机程序,该计算机程序被处理器执行时实现上述方法中的步骤。A computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the above method are realized.
上述网页数据爬取方法、装置、用户终端及可读存储介质,通过客户端来爬取待爬取网站中的待爬取数据,在通过客户端嵌入的待爬取网站登录界面登录待爬取网站后,验证待爬取网站的账户与客户端的账户是否对应,来确保所爬取的待爬取数据即为客户端用户的数据,并将爬取的待爬取数据发送至服务器以供服务器进行处理分析,可以避免在服务器端爬取待爬取网站中的待爬取数据触发风控机制,导致用户账户被锁等情况的发生。The above web page data crawling method, device, user terminal and readable storage medium use the client to crawl the data to be crawled in the website to be crawled, and log in to the website to be crawled through the login interface of the website to be crawled embedded in the client. After the website, verify whether the account of the website to be crawled corresponds to the account of the client to ensure that the data to be crawled is the data of the client user, and send the data to be crawled to the server for the server The processing and analysis can avoid the risk control mechanism triggered by crawling the data to be crawled in the website to be crawled on the server side, resulting in the lockout of the user account.
附图说明Description of drawings
图1为一实施例中网页数据爬取方法的应用环境图;Fig. 1 is the application environment diagram of web page data crawling method in an embodiment;
图2为一实施例中的网页数据爬取方法的流程图;Fig. 2 is the flowchart of the web page data crawling method in an embodiment;
图3为图2所示实施例的步骤S208的流程图;Fig. 3 is the flowchart of step S208 of the embodiment shown in Fig. 2;
图4为一实施例中qq邮箱登录界面的界面图;Fig. 4 is the interface figure of qq mailbox login interface in an embodiment;
图5为一实施例中账单数据爬取过程界面的界面图;Fig. 5 is an interface diagram of the bill data crawling process interface in an embodiment;
图6为一实施例中账单数据爬取成功的界面图;Fig. 6 is an interface diagram of successful crawling of bill data in an embodiment;
图7为图2所示实施例中的步骤S208的另一流程图;Fig. 7 is another flowchart of step S208 in the embodiment shown in Fig. 2;
图8为图2所示实施例中的步骤S210的流程图;FIG. 8 is a flowchart of step S210 in the embodiment shown in FIG. 2;
图9为一实施例中的网页数据爬取装置的结构示意图;Fig. 9 is a schematic structural diagram of a webpage data crawling device in an embodiment;
图10为一实施例中的用户终端的结构示意图。Fig. 10 is a schematic structural diagram of a user terminal in an embodiment.
具体实施方式detailed description
为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用于解释本发明,并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.
在详细说明根据本发明的实施例前,应该注意到的是,所述的实施例主要在于与网页数据爬取方法、装置、用户终端及可读存储介质相关的步骤和系统组件的组合。因此,所属系统组件和方法步骤已经在附图中通过常规符号在适当的位置表示出来了,并且只示出了与理解本发明的实施例有关的细节,以免因对于得益于本发明的本领域普通技术人员而言显而易见的那些细节模糊了本发明的公开内容。Before describing the embodiments according to the present invention in detail, it should be noted that the described embodiments mainly lie in the combination of steps and system components related to the web page data crawling method, device, user terminal and readable storage medium. Accordingly, pertinent system components and method steps have been indicated at appropriate places in the drawings by conventional symbols, and only details relevant to the understanding of the embodiments of the present invention have been shown so as not to cause confusion to those who would benefit from the present invention. Details that would be apparent to one of ordinary skill in the art obscure the present disclosure.
在本文中,诸如左和右,上和下,前和后,第一和第二之类的关系术语仅仅用来区分一个实体或动作与另一个实体或动作,而不一定要求或暗示这种实体或动作之间的任何实际的这种关系或顺序。术语“包括”、“包含”或任何其他变体旨在涵盖非排他性的包含,由此使得包括一系列要素的过程、方法、物品或者设备不仅包含这些要素,而且还包含没有明确列出的其他要素,或者为这种过程、方法、物品或者设备所固有的要素。In this text, relational terms such as left and right, up and down, front and back, first and second are used merely to distinguish one entity or action from another without necessarily requiring or implying such Any actual such relationship or sequence between entities or actions. The terms "comprising", "comprising" or any other variant are intended to cover a non-exclusive inclusion whereby a process, method, article or apparatus comprising a set of elements includes not only those elements but also other elements not expressly listed elements, or elements inherent in such a process, method, article, or apparatus.
参阅图1,图1为一实施例中网页数据爬取方法的应用环境图,在该实施例中,包括服务器以及数个用户终端,服务器可以分别与数个用户终端相通信,其中用户终端中安装有客户端APP,客户端APP中嵌入有待爬取网站。用户终端可以是手机、平板或电脑等终端,用户终端中安装的客户端APP可以是任意APP提供商的APP,其中嵌入有待爬取网站,例如微信等客户端APP中可以嵌入邮箱登录界面等。Referring to Fig. 1, Fig. 1 is an application environment diagram of a webpage data crawling method in an embodiment, in this embodiment, including a server and several user terminals, the server can communicate with several user terminals respectively, wherein the user terminal The client APP is installed, and the website to be crawled is embedded in the client APP. The user terminal can be a terminal such as a mobile phone, a tablet, or a computer, and the client APP installed in the user terminal can be an APP of any APP provider, which embeds a website to be crawled. For example, a client APP such as WeChat can embed an email login interface, etc.
请参阅图2,在其中一个实施例中,提供一种网页数据爬取方法,本实施例以该方法应用到上述图1中的服务器来举例说明。该服务器上运行有网页数据爬取程序,通过该网页数据爬取程序来实施网页数据爬取方法。该方法具体包括如下步骤:Please refer to FIG. 2 . In one embodiment, a webpage data crawling method is provided. This embodiment is illustrated by applying the method to the server in FIG. 1 above. A webpage data crawling program runs on the server, and the webpage data crawling method is implemented through the webpage data crawling program. The method specifically includes the following steps:
S202:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站。S202: Receive an input account and password corresponding to the website to be crawled through the login interface of the website to be crawled embedded in the client, and log in to the website to be crawled through the account and password corresponding to the website to be crawled.
具体地,客户端是指安装在用户终端的APP等应用程序,其中嵌入了待爬取网站登录界面,该待爬取网站界面可以是邮箱登录界面、电商登录界面,例如qq邮箱登录界面、126邮箱登录界面、163邮箱登录界面、淘宝登录界面、支付宝登录界面、京东登录界面、唯品会登录界面等。Specifically, the client refers to an application such as an APP installed on a user terminal, in which a login interface of a website to be crawled is embedded. The website interface to be crawled may be an email login interface or an e-commerce login interface, such as a qq mailbox login interface, 126 mailbox login interface, 163 mailbox login interface, Taobao login interface, Alipay login interface, Jingdong login interface, Vipshop login interface, etc.
当用户通过客户端的账户登录客户端后,再打开该待爬取网站登录界面,输入待爬取网站的账户和密码,从而可以通过客户端中嵌入的待爬取网站登录界面登录待爬取网站。例如在“平安一账通APP”中嵌入有qq邮箱登录界面,用户可以首先通过“平安一账通APP”登录“平安一账通APP”,然后打开嵌入至“平安一账通APP”中的qq邮箱登录界面,通过向该qq邮箱登录界面输入qq邮箱账户和密码登录qq邮箱。After the user logs in to the client through the account of the client, open the login interface of the website to be crawled, and enter the account and password of the website to be crawled, so that the website to be crawled can be logged in through the login interface of the website to be crawled embedded in the client . For example, there is a qq mailbox login interface embedded in the "Ping An One Account APP". Users can first log in to the "Ping An One Account APP" through the "Ping An One Account APP", and then open the On the qq mailbox login interface, log in to the qq mailbox by entering the qq mailbox account and password into the qq mailbox login interface.
S204:检测是否成功登录待爬取网站。S204: Detect whether the website to be crawled is successfully logged in.
具体地,由于在爬取待爬取网站中的待爬取数据前,需要成功登录待爬取网站,因此在爬取待爬取网站中的待爬取数据前,需要检测是否成功登录待爬取网站,如果未成功登录,则无法爬取待爬取网站中的待爬取数据。Specifically, before crawling the data to be crawled in the website to be crawled, it is necessary to successfully log in to the website to be crawled, so before crawling the data to be crawled in the website to be crawled, it is necessary to detect whether the login is successful If you do not log in successfully, you cannot crawl the data to be crawled in the website to be crawled.
S206:当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配。S206: When the website to be crawled is successfully logged in, it is determined whether the account of the client matches the account of the website to be crawled.
具体地,在某些情况下,用户可能通过自己的客户端来登录其他用户的待爬取网站的账户,如果此时也爬取其他用户的待爬取网站的账户,则会导致最终爬取的数据不是用户自己的,从而导致数据错误。因此在爬取之前,为了保证待爬取网站的账户是用户本人的,因此判断客户端的账户与待爬取网站的账户是否匹配。例如“平安一账通APP”的用户账户中可以设置用户的唯一标识,例如用户的身份证号等,且qq邮箱的账户中也可以设置用户的唯一标识,例如身份证号等,只有“平安一账通APP”的账户唯一标识与qq邮箱的账户唯一标识相匹配时,才会进行下一步来爬取待爬取网站中的待爬取数据。Specifically, in some cases, users may log in other users’ accounts of websites to be crawled through their own clients. If they also crawl accounts of other users’ websites to be crawled at this time, it will eventually lead to The data is not the user's own, resulting in data errors. Therefore, before crawling, in order to ensure that the account of the website to be crawled is the user's own, it is judged whether the account of the client matches the account of the website to be crawled. For example, the user's unique identifier, such as the user's ID number, can be set in the user account of "Ping An One Account", and the user's unique identifier, such as the ID number, can also be set in the account of qq mailbox. Only when the account unique identifier of "One Account Tong APP" matches the account unique identifier of qq mailbox, will the next step be performed to crawl the data to be crawled in the website to be crawled.
S208:当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据。S208: When the account of the client matches the account of the website to be crawled, crawl the data to be crawled in the website to be crawled.
具体地,当客户端的账户与待爬取网站的账户匹配时,则证明待爬取网站中的待爬取数据是用户的数据,则直接通过客户端中的爬取程序爬取待爬取网站中的待爬取数据即可,这样每个用户的待爬取数据在每个用户终端进行爬取,而非所有用户的待爬取数据均在服务器进行爬取,从而可以有效避免由于待爬取网站的风控机制将用户的待爬取网站的账户锁定的情况的发生。Specifically, when the account of the client matches the account of the website to be crawled, it is proved that the data to be crawled in the website to be crawled is the user's data, and the website to be crawled is directly crawled through the crawling program in the client The data to be crawled in is enough, so that each user's data to be crawled is crawled on each user terminal, but not all users' data to be crawled is crawled on the server, which can effectively avoid due to crawling The risk control mechanism of the crawling website locks the account of the user's website to be crawled.
S210:将所爬取的待爬取数据发送至服务器。S210: Send the crawled data to be crawled to the server.
具体地,当用户终端爬取到相应的待爬取数据时,则可以将该些待爬取数据发送至服务器,从而服务器可以根据该些数据为用户提供相应的服务。例如当用户终端爬取的是qq邮箱中的信用卡账单时,服务器可以根据该账单数据提醒用户何时需要还款,或者可以给用户提供还款红包,例如当用户需要还款1000元时,则给用户提供5元抵扣红包等服务。Specifically, when the user terminal crawls the corresponding data to be crawled, the data to be crawled can be sent to the server, so that the server can provide the user with corresponding services according to the data. For example, when the user terminal crawls the credit card bill in the qq mailbox, the server can remind the user when to repay according to the bill data, or provide the user with a repayment red envelope, for example, when the user needs to repay 1,000 yuan, then Provide users with services such as 5 yuan deduction of red envelopes.
上述网页数据爬取方法,通过客户端来爬取待爬取网站中的待爬取数据,在通过客户端嵌入的待爬取网站登录界面登录待爬取网站后,验证待爬取网站的账户与客户端的账户是否对应,来确保所爬取的待爬取数据即为客户端用户的数据,并将爬取的待爬取数据发送至服务器以供服务器进行处理分析,可以避免在服务器端爬取待爬取网站中的待爬取数据触发风控机制,导致用户账户被锁等情况的发生。The above web page data crawling method uses the client to crawl the data to be crawled in the website to be crawled, and after logging in to the website to be crawled through the login interface of the website to be crawled embedded in the client, verify the account of the website to be crawled Whether it corresponds to the account of the client to ensure that the crawled data to be crawled is the data of the client user, and the crawled data to be crawled is sent to the server for processing and analysis by the server, which can avoid crawling on the server side Getting the data to be crawled from the website to be crawled triggers the risk control mechanism, resulting in the lockout of the user account.
在其中一个实施例中,参阅图3,图3为图2所示实施例的步骤S208的流程图,该步骤S208,即爬取待爬取网站中的待爬取数据的步骤可以包括:In one of the embodiments, referring to FIG. 3, FIG. 3 is a flow chart of step S208 of the embodiment shown in FIG. 2. The step S208, that is, the step of crawling the data to be crawled in the website to be crawled may include:
S302:向服务器发送爬取脚本获取请求。S302: Send a crawling script acquisition request to the server.
具体地,爬取脚本是指可以用于用户终端的,用来爬取待爬取网站中的待爬取数据的脚本。该爬取脚本是存储在服务器的,这样可以仅在服务器端对该爬取脚本进行修改,且在下次爬取待爬取网站中的爬取数据前,直接从服务器下载新的爬取脚本即可,由于该爬取脚本是采用脚本的方式,其占用空间小,传输速度快。当在客户端的账户与待爬取网站的账户相匹配时,用户终端则向服务器发送爬取脚本获取请求,服务器在接收到爬取脚本获取请求后,查询到该爬取脚本,然后将该爬取脚本进行打包后发送至相应的客户端,这样可以数据的传输量。Specifically, the crawling script refers to a script that can be used in a user terminal to crawl data to be crawled in a website to be crawled. The crawling script is stored on the server, so that the crawling script can only be modified on the server side, and before the crawling data in the website to be crawled is crawled next time, a new crawling script is directly downloaded from the server. Yes, since the crawling script is in the form of a script, it occupies a small space and has a fast transmission speed. When the account on the client side matches the account of the website to be crawled, the user terminal sends a crawling script acquisition request to the server, and the server queries the crawling script after receiving the crawling script acquisition request, and then crawls the Take the script and send it to the corresponding client after packaging, so as to reduce the amount of data transmission.
S304:接收服务器返回的与爬取脚本获取请求对应的爬取脚本。S304: Receive the crawling script corresponding to the crawling script acquisition request returned by the server.
具体地,当服务器查询到与爬取脚本获取指令对应的爬取脚本后,则将该爬取脚本发送到用户终端,用户终端从而可以通过该爬取脚本爬取待爬取网站中的待爬取数据。Specifically, when the server inquires about the crawling script corresponding to the crawling script acquisition instruction, it sends the crawling script to the user terminal, and the user terminal can use the crawling script to crawl the pages to be crawled in the website to be crawled. fetch data.
S306:通过爬取脚本爬取待爬取网站中的待爬取数据。S306: Crawl the data to be crawled in the website to be crawled by using the crawling script.
具体地,用户终端通过从服务器下载的爬取脚本爬取相应的待爬取数据,请参阅图4至图6,图4为一实施例中qq邮箱登录界面的界面图,图5为一实施例中账单数据爬取过程界面的界面图,如6为一实施例中账单数据爬取成功的界面图。其中qq邮箱界面是嵌入至用户终端的客户端的,用户通过在qq邮箱界面中输入账户和密码来登录qq邮箱,如图4,当qq邮箱登录成功后,用户终端检测客户端的账户与qq邮箱的账户相匹配后,则从服务器下载爬取脚本,然后通过爬取脚本来爬取qq邮箱中的账单信息,例如图5,可以显示用户终端爬取待爬取数据的进度,图5中表示qq邮箱验证成功、也搜索到相应的账单,且账单已经爬取了64%。当用户终端爬取到了待爬取数据,即账单后,则可以提示用户爬取完成,例如图6。Specifically, the user terminal crawls the corresponding data to be crawled through the crawling script downloaded from the server. Please refer to FIGS. 4 to 6. FIG. 4 is an interface diagram of the qq mailbox login interface in an embodiment, and FIG. The interface diagram of the billing data crawling process interface in the example, as shown in 6, is the interface diagram of the successful crawling of the billing data in an embodiment. Among them, the qq mailbox interface is embedded in the client terminal of the user terminal. The user logs in to the qq mailbox by entering the account and password in the qq mailbox interface, as shown in Figure 4. When the qq mailbox is successfully logged in, the user terminal detects the client account and the qq mailbox After the accounts are matched, the crawling script is downloaded from the server, and then the billing information in the qq mailbox is crawled through the crawling script, as shown in Figure 5, which can display the progress of the user terminal crawling the data to be crawled. Figure 5 shows qq The email verification was successful, and the corresponding bill was also searched, and the bill has been crawled by 64%. When the user terminal crawls the data to be crawled, that is, the bill, the user can be prompted to complete the crawling, as shown in Figure 6.
上述实施例中,在客户端账户和待爬取网站的账户相匹配时,则向服务器发送获取脚本的信息,服务器接收到该信息后,将最新的脚本进行打包后传输给用户终端。这样操作首先,脚本存储在服务器,可以仅在服务器对爬取脚本进行修改,但是如果爬取脚本是与客户端安装包一起下发的话,则当爬取脚本修改时,则就需要下发新的安装包,导致客户端更新频率增加,其次在发送爬取脚本时,打包后再发送,可以减少数据传输量。In the above embodiment, when the account of the client matches the account of the website to be crawled, the information of obtaining the script is sent to the server. After receiving the information, the server packages the latest script and transmits it to the user terminal. In this way, first, the script is stored on the server, and the crawling script can only be modified on the server, but if the crawling script is delivered together with the client installation package, when the crawling script is modified, a new one needs to be issued The installation package will increase the frequency of client updates. Secondly, when sending the crawling script, it will be packaged before sending, which can reduce the amount of data transmission.
在其中一个实施例中,向服务器发送爬取脚本获取请求的步骤之前还可以包括:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据。当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤,即步骤S302。In one of the embodiments, before the step of sending the crawling script acquisition request to the server, it may also include: obtaining the time when the last receiving server returned the crawling script; the difference between the time when the last receiving server returned the crawling script and the current time When the value is within the preset range, the data to be crawled in the website to be crawled is crawled through the crawling script received last time and returned by the server. When the difference between the last time the server received the crawling script and the current time is not within the preset range, the step of sending a crawling script acquisition request to the server is performed, that is, step S302.
具体地,为了防止用户终端短时间内从服务器多次爬取爬取脚本,设置了一预设范围,只要用户上次从服务器获取爬取脚本的时间与当前时间的差值在预设范围内,则用户终端则不需要再次从服务器下载爬取脚本。该预设范围可以是1小时、30分钟、2小时、1天、1星期等,在此不做限制。例如,上一次爬取时,从服务器下载了爬取脚本,时间为上午9点30分,预设范围是2小时,则再次爬取时是上午10点30分,由于与上午9点30分的差值是1小时,小于预设范围2小时,因此在10点30分爬取时,则采用上次从服务器下载的爬取脚本即可,不需要重新下载爬取脚本,但是如果再次爬取的时间是下午2点30分,与上午9点30分的差值5小时,大于预设范围2小时,因此在下午2点30分爬取时,则需要重新从服务器下载爬取脚本。Specifically, in order to prevent the user terminal from crawling the crawling script multiple times from the server in a short period of time, a preset range is set, as long as the difference between the time when the user obtained the crawling script from the server last time and the current time is within the preset range , the user terminal does not need to download the crawling script from the server again. The preset range may be 1 hour, 30 minutes, 2 hours, 1 day, 1 week, etc., which is not limited here. For example, when crawling last time, the crawling script was downloaded from the server at 9:30 am, and the default range is 2 hours, then the time of crawling again is 10:30 am, because it is different from 9:30 am The difference is 1 hour, which is less than the preset range of 2 hours. Therefore, when crawling at 10:30, the crawling script downloaded from the server last time can be used, and there is no need to re-download the crawling script, but if you crawl again The time taken is 2:30 pm, the difference between 9:30 am and 9:30 am is 5 hours, which is 2 hours longer than the preset range, so when crawling at 2:30 pm, you need to download the crawling script from the server again.
上述实施例中,在客户端账户和待爬取网站的账户相匹配后,可以首先获取上一次爬取脚本获取的时间,如果上一次爬取脚本与当前时间的差值在预设范围,则直接调用用户终端存储的爬取脚本,而不再需要从服务器下载,这样可以避免,例如一天内用户频繁登录qq邮箱同步账单导致每次都下载脚本,造成数据流量的浪费等。In the above embodiment, after the client account matches the account of the website to be crawled, the time when the last crawled script was acquired can be obtained first, if the difference between the last crawled script and the current time is within the preset range, then Directly call the crawling script stored in the user terminal instead of downloading it from the server, which can avoid, for example, the user frequently logs in to the qq mailbox to synchronize bills within a day, which leads to downloading the script every time, resulting in waste of data traffic, etc.
在其中一个实施例中,检测是否成功登录待爬取网站的步骤,即图2所示实施例中的步骤S204可以包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。In one of the embodiments, the step of detecting whether the website to be crawled is successfully logged in, that is, step S204 in the embodiment shown in FIG. 2 may include: detecting whether the URL address of the current page displayed by the client has changed; If the URL address of the displayed current page changes, the website to be crawled is successfully logged in; when the URL address of the current page displayed by the client does not change, the website to be crawled is unsuccessfully logged in.
具体地,由于不同网页的URL(统一资源定位符,Uniform Resource Locator)地址是不同的,所以可以通过检测网页的URL地址是否改变来确定是否成功登录待爬取网站。例如待爬取网站登录界面的URL地址可能是A,而在登录成功后URL地址可能变成B,如果登录失败,则仍会停留在当前的待爬取网站登录界面,即其URL地址仍未A,从而通过判断URL地址是否改变即可判断出是否成功登录待爬取网站,操作简单。Specifically, since the URL (Uniform Resource Locator, Uniform Resource Locator) addresses of different webpages are different, it can be determined whether the website to be crawled is successfully logged in by detecting whether the URL address of the webpage changes. For example, the URL address of the login interface of the website to be crawled may be A, and the URL address may become B after successful login. If the login fails, it will still stay at the current login interface of the website to be crawled, that is, its URL address has not yet been A, so that by judging whether the URL address has changed, you can judge whether you have successfully logged in to the website to be crawled, and the operation is simple.
上述实施例中,检测是否成功登录待爬取网站可以通过检测客户端当前界面的URL地址是否改变来进行,只有在登录成功时,客户端当前界面的URL地址才会改变。登录失败的时候,客户端当前界面的URL地址不变,且会提供相应的登录失败的提示信息。In the above embodiment, detecting whether the website to be crawled is successfully logged in can be performed by detecting whether the URL address of the current interface of the client changes. Only when the login is successful, the URL address of the current interface of the client will change. When the login fails, the URL address of the client's current interface remains unchanged, and a corresponding prompt message of login failure will be provided.
在其中一个实施例中,待爬取网站为邮箱网站;参阅图7,图7为图2所示实施例中的步骤S208的另一流程图,该步骤S208,即爬取待爬取网站中的待爬取数据的步骤可以包括:In one of the embodiments, the website to be crawled is a mailbox website; referring to FIG. 7, FIG. 7 is another flowchart of step S208 in the embodiment shown in FIG. The steps to crawl data can include:
S702:从邮箱网站中选取标题与待爬取数据对应的邮件。S702: Select emails with titles corresponding to the data to be crawled from the mailbox website.
具体地,由于邮箱中可能存储有大量的数据,而服务器只关心与待爬取数据对应的邮件,因此首先可以通过待爬取数据的性质从邮箱中选取与待爬取数据对应的邮件。例如当需要爬取账单数据时,则首先爬取邮件标题与账单有关的邮件。Specifically, because there may be a large amount of data stored in the mailbox, and the server only cares about the emails corresponding to the data to be crawled, firstly, the emails corresponding to the data to be crawled can be selected from the mailbox according to the properties of the data to be crawled. For example, when billing data needs to be crawled, emails whose titles are related to bills are crawled first.
S704:从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。S704: Crawl the data of the preset field from the selected email as the crawled data to be crawled.
具体地,由于账单邮件中可能存储大量的账单信息,例如有的账单信息可能包括姓名、日期、消费额、收款方等多种信息,但是服务器仅需要爬取姓名、消费额信息即可,则用户终端则从所选取的邮件中爬取姓名和消费额字段的数据作为爬取数据即可,而不需要爬取其他额外的数据。Specifically, because a large amount of billing information may be stored in the billing email, for example, some billing information may include name, date, consumption amount, payee and other information, but the server only needs to crawl the name and consumption amount information. Then the user terminal crawls the data in the fields of the name and consumption amount from the selected email as the crawled data, without the need to crawl other additional data.
上述实施例,首先根据邮件的标题进行锁定邮件,例如可以遍历收件箱中的邮件的标题,或者遍历某一时间段中收件箱中的邮件的标题,以确定与信用卡账单相关联的邮件。当用户是首次使用客户端APP时,则需要遍历整个收件箱中的邮件,但是如果用户非首次使用客户端APP时,则可以获取服务器最近一次获取账单的时间,仅需要遍历该时间以后的收件箱中的邮件即可。当已经锁定标题与账单相关的邮件后,再获取预设字段的内容,例如可以是仅获取日期、摘要、支入、支出等信息,即过滤掉无用信息,或者还可以获取所有的信息,例如余额、支入支出对象信息等。In the above embodiment, firstly, the mail is locked according to the title of the mail, for example, the title of the mail in the inbox can be traversed, or the title of the mail in the inbox in a certain period of time can be traversed to determine the mail associated with the credit card bill . When the user is using the client APP for the first time, it is necessary to traverse the mails in the entire inbox, but if the user is not using the client APP for the first time, you can obtain the time when the server obtained the bill last time, and only need to traverse the emails after that time Emails in your inbox will do. After you have locked the emails whose titles are related to bills, you can get the contents of the preset fields, for example, you can only get the date, summary, income, expenditure and other information, that is, filter out useless information, or you can also get all the information, for example Balance, expenditure object information, etc.
在其中一个实施例中,参阅图8,图8为图2所示实施例中的步骤S210的流程图,该步骤S210,即将所爬取的待爬取数据发送至服务器的步骤可以包括:In one of the embodiments, referring to FIG. 8, FIG. 8 is a flowchart of step S210 in the embodiment shown in FIG. 2. The step S210, the step of sending the crawled data to be crawled to the server may include:
S802:将所爬取的待爬取数据进行加密处理。S802: Encrypt the crawled data to be crawled.
具体地,由于所爬取的数据涉及到用户的隐私信息,因此在传输过程中需要进行加密处理,其可以采用对称加密方法也可以采用非对称的加密方法,在此不作限定。当用户终端爬取到待爬取数据后,则将待爬取数据进行加密处理,然后发送到服务器,服务器接收到该些数据后,进行相应的解密操作以获取所爬取的待爬取数据。Specifically, since the crawled data involves the user's private information, it needs to be encrypted during the transmission process, which may adopt a symmetric encryption method or an asymmetric encryption method, which is not limited here. When the user terminal crawls the data to be crawled, it encrypts the data to be crawled, and then sends it to the server. After receiving the data, the server performs corresponding decryption operations to obtain the crawled data to be crawled .
S804:将加密后的待爬取数据进行打包。S804: Pack the encrypted data to be crawled.
具体地,为了减少数据的传输量,可以对所爬取的数据进行打包处理,将打包后的数据发送给服务器,从而可以减少用户流量的使用。Specifically, in order to reduce the amount of data transmission, the crawled data can be packaged, and the packaged data can be sent to the server, thereby reducing the usage of user traffic.
S806:将打包后的待爬取数据发送至服务器。S806: Send the packaged data to be crawled to the server.
具体地,当待爬取数据打包完成后,则将打包完成的数据发送给服务器,此时用户终端可以检测当前所处的网络环境,当网路为wifi网络时,则将打包后的待爬取数据发送至服务器,当网络为移动网络时,则暂时不发送该打包后的待爬取数据,直至用户终端的网络编程为wifi网络后,则将打包后的待爬取数据发送至服务器,这样可以减少用户流量的使用。Specifically, when the data to be crawled is packaged, the packaged data is sent to the server. At this time, the user terminal can detect the current network environment. When the network is a wifi network, the packaged data to be crawled is Get the data and send it to the server. When the network is a mobile network, the packaged data to be crawled will not be sent temporarily until the network of the user terminal is programmed as a wifi network, then the packaged data to be crawled will be sent to the server. This reduces the usage of user traffic.
上述实施例中,在发送爬取的待爬取数据时,首先对该些待爬取数据进行加密,然后将加密后的待爬取数据进行打包,这样,即可以保证待爬取数据在传输过程中的安全性,有可以降低数据传输量。In the above embodiment, when sending the crawled data to be crawled, the data to be crawled is first encrypted, and then the encrypted data to be crawled is packaged. In this way, the transmission of the data to be crawled can be ensured. In-process security, there is the possibility to reduce the amount of data transfer.
参阅图9,图9为一实施例中的网页数据爬取装置的结构示意图,该网页数据爬取装置包括:Referring to FIG. 9, FIG. 9 is a schematic structural diagram of a webpage data crawling device in an embodiment, and the webpage data crawling device includes:
登录模块100,用于通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站。The login module 100 is used to receive the input account and password corresponding to the website to be crawled through the login interface of the website to be crawled embedded in the client, and log in to the website to be crawled through the account and password corresponding to the website to be crawled.
检测模块200,用于检测是否成功登录待爬取网站。The detection module 200 is used to detect whether the website to be crawled is successfully logged in.
验证模块300,用于当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配。The verification module 300 is configured to determine whether the account of the client matches the account of the website to be crawled when successfully logging in to the website to be crawled.
爬取模块400,用于当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据。The crawling module 400 is configured to crawl the data to be crawled in the website to be crawled when the account of the client matches the account of the website to be crawled.
发送模块500,用于将所爬取的待爬取数据发送至服务器。The sending module 500 is configured to send the crawled data to be crawled to the server.
在其中一个实施例中,发送模块可以还用于向服务器发送爬取脚本获取请求。In one of the embodiments, the sending module may be further configured to send a crawling script acquisition request to the server.
爬取模块可以包括:Crawling modules can include:
接收单元,用于接收服务器返回的与爬取脚本获取请求对应的爬取脚本。The receiving unit is configured to receive the crawling script returned by the server and corresponding to the crawling script acquisition request.
爬取单元,用于通过爬取脚本爬取待爬取网站中的待爬取数据。The crawling unit is used to crawl the data to be crawled in the website to be crawled through a crawling script.
在其中一个实施例中,网页数据爬取装置还可以包括:In one of the embodiments, the web page data crawling device may also include:
时间获取模块,用于获取上次接收服务器返回爬取脚本的时间。The time acquisition module is used to acquire the time when the crawling script was returned by the receiving server last time.
比较模块,用于当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。The comparison module is used to crawl the website to be crawled through the crawling script returned by the server received last time when the difference between the time when the last receiving server returned the crawling script and the current time is within a preset range. Crawling data; when the difference between the last time the server received the crawling script and the current time is not within the preset range, the step of sending the crawling script acquisition request to the server is executed.
在其中一个实施例中,检测模块还可以用于检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。In one of the embodiments, the detection module can also be used to detect whether the URL address of the current page displayed by the client changes; when the URL address of the current page displayed by the client changes, the website to be crawled is successfully logged in; when the client If the URL address of the current page displayed on the terminal has not changed, the website to be crawled has not been successfully logged in.
在其中一个实施例中,待爬取网站为邮箱网站。爬取模块还可以用于从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。In one of the embodiments, the website to be crawled is a mailbox website. The crawling module can also be used to select emails whose titles correspond to the data to be crawled from the mailbox website; crawl data in preset fields from the selected emails as the crawled data to be crawled.
在其中一个实施例中,发送模块可以包括:In one of the embodiments, the sending module may include:
加密单元,用于将所爬取的待爬取数据进行加密处理。An encryption unit, configured to encrypt the crawled data to be crawled.
打包单元,用于将加密后的待爬取数据进行打包。The packaging unit is used to package the encrypted data to be crawled.
发送单元,用于将打包后的待爬取数据发送至服务器。The sending unit is used to send the packaged data to be crawled to the server.
其中网页数据爬取装置中所涉及到的模块、单元可以是依据功能划分的程序段,此外上述关于网页数据爬取装置的限定可以参加上文中关于网页数据爬取方法的限定,在此不再赘述。The modules and units involved in the webpage data crawling device can be program segments divided according to functions. In addition, the above-mentioned limitations on the webpage data crawling device can be added to the above-mentioned limitations on the webpage data crawling method, which will not be repeated here. repeat.
请参阅图8,图8为一实施例中的用户终端的结构示意图,该用户终端可以是常规服务器或者其他任意计算机设备,包括存储器、处理器以及存储在存储器上并可在处理器上运行的计算机程序,其中该存储器可以包括非易失性存储介质以及内存储器,该计算机程序可以存储在该非易失性存储介质中,处理器执行程序时实现以下步骤:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站;检测是否成功登录待爬取网站;当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配;当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据;将所爬取的待爬取数据发送至服务器。Please refer to FIG. 8. FIG. 8 is a schematic structural diagram of a user terminal in an embodiment. The user terminal may be a conventional server or any other computer equipment, including a memory, a processor, and a computer stored in the memory and operable on the processor. A computer program, wherein the memory may include a non-volatile storage medium and an internal memory, the computer program may be stored in the non-volatile storage medium, and the following steps are implemented when the processor executes the program: the crawler embedded in the client The website login interface receives the input account and password corresponding to the website to be crawled, and logs in to the website to be crawled through the account and password corresponding to the website to be crawled; detects whether the website to be crawled is successfully logged in; when successfully logged in to the website to be crawled website, it is determined whether the account of the client matches the account of the website to be crawled; when the account of the client matches the account of the website to be crawled, the data to be crawled in the website to be crawled is crawled; The data to be crawled is sent to the server.
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:向服务器发送爬取脚本获取请求;接收服务器返回的与爬取脚本获取请求对应的爬取脚本;通过爬取脚本爬取待爬取网站中的待爬取数据。In one of the embodiments, when the processor executes the program, the following steps can also be implemented: sending a crawling script acquisition request to the server; receiving the crawling script corresponding to the crawling script acquisition request returned by the server; crawling the waiting script through the crawling script Crawl the data to be crawled in the website.
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。In one of the embodiments, when the processor executes the program, the following steps can also be implemented: obtaining the time when the receiving server returned the crawling script last time; when the last receiving server returned the crawling script time and the current time When it is in the range, crawl the data to be crawled in the website to be crawled through the crawling script returned by the server received last time; when the difference between the time when the crawling script was returned by the receiving server last time and the current time is not within the preset range , execute the step of sending the crawling script acquisition request to the server.
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:获取输入的申请信息的步骤包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。In one of the embodiments, the following steps can also be implemented when the processor executes the program: the step of obtaining the input application information includes: detecting whether the URL address of the current page displayed by the client changes; If the URL address changes, the website to be crawled is successfully logged in; when the URL address of the current page displayed by the client has not changed, the website to be crawled is unsuccessfully logged in.
在其中一个实施例中,待爬取网站为邮箱网站;处理器执行程序时还可以实现以下步骤:从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。In one of the embodiments, the website to be crawled is a mailbox website; when the processor executes the program, the following steps can also be implemented: selecting the email whose title corresponds to the data to be crawled from the mailbox website; Set the data of the field as the crawled data to be crawled.
在其中一个实施例中,处理器执行程序时还可以实现以下步骤:将所爬取的待爬取数据进行加密处理;将加密后的待爬取数据进行打包;将打包后的待爬取数据发送至服务器。In one of the embodiments, when the processor executes the program, the following steps can also be implemented: encrypting the crawled data to be crawled; packaging the encrypted data to be crawled; packaging the packed data to be crawled sent to the server.
上述对于爬虫终端的限定可以参见上文中对于网页数据爬取方法的具体限定,在此不再赘述。For the above limitations on crawler terminals, please refer to the specific limitations on web page data crawling methods above, which will not be repeated here.
请继续参阅图8,还提供一种计算机可读存储介质,其上存储有计算机程序,如图8中所示的非易失性存储介质,其中,该程序被处理器执行时实现以下步骤:通过客户端嵌入的待爬取网站登录界面,接收输入的与待爬取网站对应的账户和密码,通过与待爬取网站对应的账户和密码登录待爬取网站;检测是否成功登录待爬取网站;当成功登录待爬取网站时,则判断客户端的账户与待爬取网站的账户是否匹配;当客户端的账户与待爬取网站的账户匹配时,则爬取待爬取网站中的待爬取数据;将所爬取的待爬取数据发送至服务器。Please continue to refer to FIG. 8 , there is also provided a computer-readable storage medium on which a computer program is stored, such as the non-volatile storage medium shown in FIG. 8 , wherein, when the program is executed by the processor, the following steps are implemented: Receive the input account and password corresponding to the website to be crawled through the login interface of the website to be crawled embedded in the client, and log in to the website to be crawled through the account and password corresponding to the website to be crawled; check whether the login to be crawled is successful website; when successfully logging in to the website to be crawled, it is judged whether the account of the client matches the account of the website to be crawled; when the account of the client matches the account of the website to be crawled, crawl the Crawling data; sending the crawled data to be crawled to the server.
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:向服务器发送爬取脚本获取请求;接收服务器返回的与爬取脚本获取请求对应的爬取脚本;通过爬取脚本爬取待爬取网站中的待爬取数据。In one of the embodiments, when the program is executed by the processor, the following steps can also be implemented: sending a crawling script acquisition request to the server; receiving the crawling script corresponding to the crawling script acquisition request returned by the server; Get the data to be crawled from the website to be crawled.
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:获取上次接收服务器返回爬取脚本的时间;当上次接收服务器返回爬取脚本的时间与当前时间的差值在预设范围时,则通过上次所接收的服务器返回的爬取脚本爬取待爬取网站中的待爬取数据;当上次接收服务器返回爬取脚本的时间与当前时间的差值不在预设范围内,则执行向服务器发送爬取脚本获取请求的步骤。In one of the embodiments, when the program is executed by the processor, the following steps can also be implemented: obtaining the time when the receiving server returned the crawling script last time; when the time when the receiving server returned the crawling script last time and the current time When the scope is preset, crawl the data to be crawled in the website to be crawled through the crawling script returned by the server received last time; If it is within the set range, the step of sending the crawling script acquisition request to the server is executed.
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:获取输入的申请信息的步骤包括:检测客户端所显示的当前页面的URL地址是否改变;当客户端所显示的当前页面的URL地址改变,则成功登录待爬取网站;当客户端所显示的当前页面的URL地址未改变,则未成功登录待爬取网站。In one of the embodiments, when the program is executed by the processor, the following steps can also be implemented: the step of obtaining the input application information includes: detecting whether the URL address of the current page displayed by the client changes; If the URL address of the page changes, the website to be crawled is successfully logged in; when the URL address of the current page displayed by the client has not changed, the website to be crawled is unsuccessfully logged in.
在其中一个实施例中,待爬取网站为邮箱网站;该程序被处理器执行时还可以实现以下步骤:从邮箱网站中选取标题与待爬取数据对应的邮件;从所选取的邮件中爬取预设字段的数据作为所爬取的待爬取数据。In one of the embodiments, the website to be crawled is a mailbox website; when the program is executed by the processor, the following steps can also be implemented: selecting the mail with the title corresponding to the data to be crawled from the mailbox website; crawling from the selected mail Take the data in the preset field as the crawled data to be crawled.
在其中一个实施例中,该程序被处理器执行时还可以实现以下步骤:将所爬取的待爬取数据进行加密处理;将加密后的待爬取数据进行打包;将打包后的待爬取数据发送至服务器。In one of the embodiments, when the program is executed by the processor, the following steps can also be implemented: encrypting the crawled data to be crawled; packaging the encrypted data to be crawled; packing the packed data to be crawled Get the data and send it to the server.
上述对于计算机可读存储介质的限定可以参见上文中对于网页数据爬取方法的具体限定,在此不再赘述。For the above limitations on the computer-readable storage medium, please refer to the above specific limitations on the web page data crawling method, and details will not be repeated here.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented through computer programs to instruct related hardware, and the programs can be stored in a non-volatile computer-readable storage medium When the program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM) and the like.
以上所述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The various technical features of the above-mentioned embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the various technical features in the above-mentioned embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, should be considered as within the scope of this specification.
以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。因此,本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present invention, and the descriptions thereof are relatively specific and detailed, but should not be construed as limiting the patent scope of the invention. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present invention, and these all belong to the protection scope of the present invention. Therefore, the protection scope of the patent for the present invention should be based on the appended claims.
Claims (10)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710619263.4A CN107689951A (en) | 2017-07-26 | 2017-07-26 | Web data crawling method, device, user terminal and readable storage medium storing program for executing |
PCT/CN2017/103932 WO2019019344A1 (en) | 2017-07-26 | 2017-09-28 | Webpage data crawling method and device, user terminal, and readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710619263.4A CN107689951A (en) | 2017-07-26 | 2017-07-26 | Web data crawling method, device, user terminal and readable storage medium storing program for executing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107689951A true CN107689951A (en) | 2018-02-13 |
Family
ID=61153095
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710619263.4A Pending CN107689951A (en) | 2017-07-26 | 2017-07-26 | Web data crawling method, device, user terminal and readable storage medium storing program for executing |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN107689951A (en) |
WO (1) | WO2019019344A1 (en) |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109670100A (en) * | 2018-12-21 | 2019-04-23 | 第四范式(北京)技术有限公司 | A kind of page data grasping means and device |
CN109948020A (en) * | 2019-01-14 | 2019-06-28 | 北京三快在线科技有限公司 | Data capture method, device, system and readable storage medium storing program for executing |
CN110162682A (en) * | 2019-04-12 | 2019-08-23 | 深圳壹账通智能科技有限公司 | A kind of crawling method of network data, device, storage medium and terminal device |
CN110390043A (en) * | 2019-06-17 | 2019-10-29 | 深圳壹账通智能科技有限公司 | Method, device, terminal and storage medium for crawling web mailbox data |
CN110400080A (en) * | 2019-07-26 | 2019-11-01 | 浙江大搜车软件技术有限公司 | Examination data monitoring method, device, computer equipment and storage medium |
CN110677423A (en) * | 2019-09-30 | 2020-01-10 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device based on client agent side and computer equipment |
CN110691091A (en) * | 2019-09-30 | 2020-01-14 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device based on identity authentication and computer equipment |
CN110968755A (en) * | 2018-09-29 | 2020-04-07 | 北京国双科技有限公司 | Method and device for crawling data |
CN112989159A (en) * | 2019-12-16 | 2021-06-18 | 浙江大搜车软件技术有限公司 | Data acquisition method and device, computer equipment and storage medium |
CN114780822A (en) * | 2022-06-20 | 2022-07-22 | 云账户技术(天津)有限公司 | Method and device for crawling application program data, electronic equipment and storage medium |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113254744A (en) * | 2021-04-24 | 2021-08-13 | 中电长城网际系统应用广东有限公司 | Method for acquiring data information of security equipment by using web crawler technology |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102761843A (en) * | 2012-08-10 | 2012-10-31 | 上海洲信信息技术有限公司 | System and method for mobile terminal user to obtain mails and based on full-text search and WAPPUSH |
CN103365893A (en) * | 2012-03-31 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and device for searching individual information of user |
US20150295942A1 (en) * | 2012-12-26 | 2015-10-15 | Sinan TAO | Method and server for performing cloud detection for malicious information |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106341313A (en) * | 2016-09-29 | 2017-01-18 | 北京小米移动软件有限公司 | Method and apparatus for obtaining billing information |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7913312B2 (en) * | 2002-09-13 | 2011-03-22 | Oracle America, Inc. | Embedded content requests in a rights locker system for digital content access control |
US9332035B2 (en) * | 2013-10-10 | 2016-05-03 | The Nielsen Company (Us), Llc | Methods and apparatus to measure exposure to streaming media |
CN106886547A (en) * | 2016-07-13 | 2017-06-23 | 阿里巴巴集团控股有限公司 | A kind of scenario generation method and device |
-
2017
- 2017-07-26 CN CN201710619263.4A patent/CN107689951A/en active Pending
- 2017-09-28 WO PCT/CN2017/103932 patent/WO2019019344A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103365893A (en) * | 2012-03-31 | 2013-10-23 | 百度在线网络技术(北京)有限公司 | Method and device for searching individual information of user |
CN102761843A (en) * | 2012-08-10 | 2012-10-31 | 上海洲信信息技术有限公司 | System and method for mobile terminal user to obtain mails and based on full-text search and WAPPUSH |
US20150295942A1 (en) * | 2012-12-26 | 2015-10-15 | Sinan TAO | Method and server for performing cloud detection for malicious information |
CN106021257A (en) * | 2015-12-31 | 2016-10-12 | 广州华多网络科技有限公司 | Method, device, and system for crawler to capture data supporting online programming |
CN106341313A (en) * | 2016-09-29 | 2017-01-18 | 北京小米移动软件有限公司 | Method and apparatus for obtaining billing information |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110968755A (en) * | 2018-09-29 | 2020-04-07 | 北京国双科技有限公司 | Method and device for crawling data |
CN109670100A (en) * | 2018-12-21 | 2019-04-23 | 第四范式(北京)技术有限公司 | A kind of page data grasping means and device |
CN109948020A (en) * | 2019-01-14 | 2019-06-28 | 北京三快在线科技有限公司 | Data capture method, device, system and readable storage medium storing program for executing |
CN110162682A (en) * | 2019-04-12 | 2019-08-23 | 深圳壹账通智能科技有限公司 | A kind of crawling method of network data, device, storage medium and terminal device |
CN110390043A (en) * | 2019-06-17 | 2019-10-29 | 深圳壹账通智能科技有限公司 | Method, device, terminal and storage medium for crawling web mailbox data |
CN110400080A (en) * | 2019-07-26 | 2019-11-01 | 浙江大搜车软件技术有限公司 | Examination data monitoring method, device, computer equipment and storage medium |
CN110677423A (en) * | 2019-09-30 | 2020-01-10 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device based on client agent side and computer equipment |
CN110691091A (en) * | 2019-09-30 | 2020-01-14 | 深圳前海环融联易信息科技服务有限公司 | Data acquisition method and device based on identity authentication and computer equipment |
CN112989159A (en) * | 2019-12-16 | 2021-06-18 | 浙江大搜车软件技术有限公司 | Data acquisition method and device, computer equipment and storage medium |
CN114780822A (en) * | 2022-06-20 | 2022-07-22 | 云账户技术(天津)有限公司 | Method and device for crawling application program data, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
WO2019019344A1 (en) | 2019-01-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107689951A (en) | Web data crawling method, device, user terminal and readable storage medium storing program for executing | |
US9742747B2 (en) | Differential client-side encryption of information originating from a client | |
US20170302451A1 (en) | Method and device for identifying user identity | |
CN105207996B (en) | Account merging method and device | |
CN111314306A (en) | Interface access method and device, electronic equipment and storage medium | |
WO2020083891A1 (en) | Container builder for individualized network services | |
US10726157B2 (en) | Method, device and software for securing web application data through tokenization | |
WO2015062362A1 (en) | Method, device, and system for user login | |
CN105701423B (en) | Date storage method and device applied to high in the clouds payment transaction | |
US10015191B2 (en) | Detection of man in the browser style malware using namespace inspection | |
CN104484259A (en) | Application program traffic monitoring method and device, and mobile terminal | |
CN106685973A (en) | Method and device for remembering login information, login control method and device | |
CN111294337B (en) | Authentication method and device based on token | |
CN103561040A (en) | File downloading method and system | |
CN106899549B (en) | Network security detection method and device | |
CN104954386A (en) | Network anti-hijacking methods and device | |
CN107733883B (en) | Method and device for detecting account numbers registered in batches | |
CN106709324A (en) | Method and equipment used for verifying application safety | |
US20250294364A1 (en) | Systems and methods for verified communication between mobile applications | |
WO2016202204A1 (en) | Application download method and device | |
CN114208114A (en) | Multi-view security context per participant | |
CN105184559A (en) | System and method for payment | |
US12388652B2 (en) | Header for conveying trustful client address | |
CN107147648A (en) | Resource request processing method, client, server and system | |
CN110245309A (en) | Page loading method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20180528 Address after: 518000 Room 201, building A, 1 front Bay Road, Shenzhen Qianhai cooperation zone, Shenzhen, Guangdong Applicant after: Shenzhen one ledger Intelligent Technology Co., Ltd. Address before: 200000 Xuhui District, Shanghai Kai Bin Road 166, 9, 10 level. Applicant before: Shanghai Financial Technologies Ltd |
|
TA01 | Transfer of patent application right | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 1250858 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180213 |
|
RJ01 | Rejection of invention patent application after publication | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: WD Ref document number: 1250858 Country of ref document: HK |