CN108229810A - Industry analysis system and method based on network information resource - Google Patents
Industry analysis system and method based on network information resource Download PDFInfo
- Publication number
- CN108229810A CN108229810A CN201711475066.6A CN201711475066A CN108229810A CN 108229810 A CN108229810 A CN 108229810A CN 201711475066 A CN201711475066 A CN 201711475066A CN 108229810 A CN108229810 A CN 108229810A
- Authority
- CN
- China
- Prior art keywords
- data
- industry
- module
- network information
- analysis
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0637—Strategic management or analysis, e.g. setting a goal or target of an organisation; Planning actions based on goals; Analysis or evaluation of effectiveness of goals
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Educational Administration (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Marketing (AREA)
- Computational Linguistics (AREA)
- Development Economics (AREA)
- General Business, Economics & Management (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Primary Health Care (AREA)
- Data Mining & Analysis (AREA)
- Game Theory and Decision Science (AREA)
- Operations Research (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
本发明涉及信息分析领域,提出了一种基于网络信息资源的行业分析系统,旨在解决行业信息分析需要消耗大量的人力物力,且无法达到实时性的问题。该系统包括:数据采集模块、数据预处理模块、数据分析模块和前台交互模块,其中,数据采集模块,配置为采集与行业相关的网络信息;数据预处理模块,配置为对上述网络信息进行结构化处理,与平台数据进行融合,构建产业结构树;数据分析模块,配置为通过自然语言处理技术和数据挖掘算法分析上述平台数据,提取与上述关键词相关的数据作为交互数据;前台交互模块,配置为于通过上述交互数据与用户终端进行交互。本发明实现从海量网络信息中挖掘有价值的数据,为用户实时地呈现行业的分析结果。
The invention relates to the field of information analysis, and proposes an industry analysis system based on network information resources, aiming to solve the problem that industry information analysis needs to consume a lot of manpower and material resources, and cannot achieve real-time performance. The system includes: a data collection module, a data preprocessing module, a data analysis module and a front-end interaction module, wherein the data collection module is configured to collect industry-related network information; the data preprocessing module is configured to structure the above network information The data processing is integrated with the platform data to build an industrial structure tree; the data analysis module is configured to analyze the above platform data through natural language processing technology and data mining algorithm, and extract the data related to the above keywords as interactive data; the front-end interaction module, It is configured to interact with the user terminal through the above interaction data. The invention realizes digging valuable data from massive network information, and presents industry analysis results for users in real time.
Description
技术领域technical field
本发明涉及计算机网络信息应用领域,具体涉及网络信息资源的数据挖掘应用领域,特别涉及一种基于网络信息资源的行业分析系统及方法。The invention relates to the application field of computer network information, in particular to the application field of data mining of network information resources, and in particular to an industry analysis system and method based on network information resources.
背景技术Background technique
随着信息技术的快速发展,各领域的信息数据呈现出爆炸式增长,也给这些行业的工作者带来了巨大的挑战和压力,如何从这些海量数据中,挖掘出有价值的行业信息,实时追踪产业信息变化,了解产业上下游分工和竞争对手的发展动向,辅助行业管理层、决策层针对市场变化,做出快速有效的应对策略,具有重要的借鉴意义。With the rapid development of information technology, information data in various fields has shown explosive growth, which has also brought huge challenges and pressures to workers in these industries. How to dig out valuable industry information from these massive data, It is of great reference significance to track changes in industry information in real time, understand the division of labor upstream and downstream of the industry and the development trends of competitors, and assist industry management and decision-makers to make quick and effective coping strategies for market changes.
行业分析,是一种系统性的行业信息整合分析结果,对于企业发现行业商机、把握市场脉搏、评估投资风险等方面,具有重要的借鉴意义。通常由企业内部或专业的市场研究公司,收集相关数据,并结合相关的从业经验来进行行业分析报告。由于行业分析报告需要调研后编制,需要消耗大量的人力物力,且无法达到实时性,这与瞬息万变的信息时代,存在巨大的反差。Industry analysis is the result of systematic industry information integration and analysis, which has important reference significance for enterprises to discover industry business opportunities, grasp the pulse of the market, and evaluate investment risks. Usually internal or professional market research companies collect relevant data and make industry analysis reports based on relevant experience. Since the industry analysis report needs to be compiled after research, it needs to consume a lot of manpower and material resources, and it cannot achieve real-time performance. This is in huge contrast with the rapidly changing information age.
发明内容Contents of the invention
为了解决现有技术中的上述问题,即为了解决在行业分析报告需要调研后编制,需要消耗大量的人力物力,且无法达到实时性的问题,本发明采用以下技术方案以解决上述问题:In order to solve the above-mentioned problems in the prior art, that is, in order to solve the problem that the industry analysis report needs to be compiled after investigation, it needs to consume a lot of manpower and material resources, and cannot achieve real-time performance, the present invention adopts the following technical solutions to solve the above-mentioned problems:
第一方面,本申请提供了基于网络信息资源的行业分析系统,该系统包括:数据采集模块、数据预处理模块、数据分析模块和前台交互模块,其中,上述数据采集模块,配置为采集与用户所关注行业相关的网络信息;上述数据预处理模块,配置为对上述网络信息进行结构化处理,与预设的平台数据进行融合,构建产业结构的领域知识树和产业结构的领域知识树节点之间的关联关系;上述数据分析模块,配置为通过自然语言处理方法和数据挖掘算法分析上述平台数据和领域知识树,提取与上述行业相关的数据作为交互数据;上述前台交互模块,配置为通过上述交互数据与用户终端进行交互。In the first aspect, the application provides an industry analysis system based on network information resources, the system includes: a data collection module, a data preprocessing module, a data analysis module and a front-end interaction module, wherein the above-mentioned data collection module is configured to collect and communicate with users The network information related to the industry concerned; the above data preprocessing module is configured to perform structural processing on the above network information, integrate with the preset platform data, and construct the domain knowledge tree of the industrial structure and the node of the domain knowledge tree of the industrial structure The above-mentioned data analysis module is configured to analyze the above-mentioned platform data and domain knowledge tree through natural language processing methods and data mining algorithms, and extract data related to the above-mentioned industries as interactive data; the above-mentioned front-end interaction module is configured to pass the above-mentioned The interaction data interacts with the user terminal.
在一些示例中,上述数据采集模块包括垂直网络爬虫和学术网络爬虫,上述垂直网络爬虫,配置为根据预设的第一初始种子节点,通过分析统一资源定位符从行业垂直网站抓取网页信息;上述学术网络爬虫,配置为根据预设的第二初始种子节点从学术网站抓取学术文章。In some examples, the data collection module includes a vertical web crawler and an academic web crawler, and the vertical web crawler is configured to grab webpage information from industry vertical websites by analyzing uniform resource locators according to the preset first initial seed node; The above-mentioned academic web crawler is configured to crawl academic articles from academic websites according to a preset second initial seed node.
在一些示例中,上述数据预处理模块包括数据结构化子模块、平台数据子模块、领域术语提取子模块和领域知识树子模块,上述数据结构化子模块,配置为对上述垂直网络爬虫收集的垂直网页信息进行结构化分析;上述平台数据子模块,配置为存储平台用户以及收集的网络信息数据,并为上述分析模块提供数据;上述领域术语提取子模块,配置为从上述学术网络爬虫所爬取的学术文章中提取领域相关术语;上述领域知识树子模块,配置为结合领域专家知识,对提取的所述领域术语进行结构化组织,构建产业结构的领域知识树,并分析上述领域知识树的节点之间的产业关联关系。In some examples, the above-mentioned data preprocessing module includes a data structuring submodule, a platform data submodule, a domain term extraction submodule and a domain knowledge tree submodule, and the above-mentioned data structuring submodule is configured to collect the above-mentioned vertical web crawler Structural analysis of vertical webpage information; the above-mentioned platform data sub-module is configured to store platform users and collected network information data, and provide data for the above-mentioned analysis module; the above-mentioned field term extraction sub-module is configured to crawl from the above-mentioned academic web crawler Extract domain-related terms from academic articles; the above-mentioned domain knowledge tree sub-module is configured to combine the domain expert knowledge to structurally organize the extracted domain terms, construct a domain knowledge tree of industrial structure, and analyze the above-mentioned domain knowledge tree The industrial association relationship between the nodes.
在一些示例中,上述领域术语提取子模块,进一步配置为分析学术网络爬虫获取的学术文章,使用文本分析方法分析文章标题、关键词和摘要中的词频,提取领域专业术语。In some examples, the domain term extraction sub-module is further configured to analyze academic articles obtained by academic web crawlers, use text analysis methods to analyze word frequency in article titles, keywords and abstracts, and extract domain terminology.
在一些示例中,上述数据分析模块包括实体识别子模块和数据挖掘子模块,上述实体识别子模块,配置为通过文本分词、词性标注和句法分析来构建实体识别特征,融合条件随机场和基于规则的方法,识别平台数据中所包含的地域实体、机构名实体和领域术语实体;上述数据挖掘子模块,配置为利用有监督的机器学习算法,将识别出的实体与领域知识树相关联,统计分析新闻数据、公司数据与领域知识树之间的关联关系,从而分析网络信息数据在地域和产业链各节点的分布情况以及变化趋势;根据用户在平台的操作数据,推理用户所关注的产业节点,使用基于内容的推荐算法,为用户推荐个性化的新闻、公司和产品。In some examples, the above-mentioned data analysis module includes an entity recognition sub-module and a data mining sub-module, and the above-mentioned entity recognition sub-module is configured to construct entity recognition features through text segmentation, part-of-speech tagging and syntactic analysis, and integrate conditional random fields and rule-based method to identify the regional entity, organization name entity, and domain term entity contained in the platform data; the above data mining sub-module is configured to use a supervised machine learning algorithm to associate the identified entity with the domain knowledge tree, and statistically Analyze the relationship between news data, company data and domain knowledge tree, so as to analyze the distribution and change trend of network information data in regions and nodes of the industry chain; infer the industry nodes that users care about based on the user's operation data on the platform , using content-based recommendation algorithms to recommend personalized news, companies, and products for users.
在一些示例中,上述前台交互模块包括可视化子模块和地图子模块,上述可视化子模块,配置为通过领域知识树、地图、折线图、柱状图和列表综合的方式将上述数据分析模块分析的结果数据与用户进行交互;上述地图子模块,配置为用户呈现所选取区域的区域地图。In some examples, the above-mentioned front-end interaction module includes a visualization sub-module and a map sub-module, and the above-mentioned visualization sub-module is configured to synthesize the results of the above-mentioned data analysis module through domain knowledge trees, maps, line graphs, histograms, and lists The data interacts with the user; the above map sub-module is configured to present the regional map of the selected region to the user.
第二方面,本申请提供了一种基于网络信息资源的行业分析方法,该方法包括:采集与用户所关注行业相关的网络信息;对所述网络信息进行结构化处理,与预设的平台数据进行融合,构建产业结构树;通过自然语言处理技术和数据挖掘算法分析所述平台数据,提取与所述行业相关的数据作为交互数据;通过所述交互数据与用户终端进行交互。In the second aspect, the present application provides an industry analysis method based on network information resources, the method includes: collecting network information related to the industry concerned by the user; performing structured processing on the network information, and combining the preset platform data Carry out fusion to build an industrial structure tree; analyze the platform data through natural language processing technology and data mining algorithms, and extract data related to the industry as interactive data; interact with user terminals through the interactive data.
在一些示例中,上述与行业相关的网络信息包括网页信息和学术文章,上述采集与用户所关注行业相关的网络信息,包括:根据预设的第一初始种子节点,利用垂直网络爬虫通过分析上述第一初始种子节点所包含的统一资源定位符从行业垂直网站抓取网页信息;根据预设的第二初始种子节点,利用学术网络爬虫为从学术网站抓取学术文章。In some examples, the above-mentioned industry-related network information includes webpage information and academic articles, and the above-mentioned collection of network information related to the industry concerned by the user includes: according to the preset first initial seed node, using a vertical web crawler to analyze the above-mentioned The uniform resource locator contained in the first initial seed node grabs web page information from industry vertical websites; according to the preset second initial seed node, academic web crawlers are used to grab academic articles from academic websites.
在一些示例中,上述对上述网络信息进行结构化处理,与预设的平台数据进行融合,构建产业结构的领域知识树,包括对垂直网络爬虫采集的垂直网页信息进行结构化分析;从上述学术网络爬虫所爬取的学术文章中提取领域相关术语;结合领域专家知识,对提取的领域术语以及关键技术进行结构化组织,构建产业结构树,并分析结构树节点之间的产业关联关系。In some examples, the aforementioned network information is structurally processed, fused with preset platform data, and a domain knowledge tree of industrial structure is constructed, including structural analysis of vertical web page information collected by vertical web crawlers; from the above academic Extract domain-related terms from academic articles crawled by web crawlers; combine the domain expert knowledge to structurally organize the extracted domain terms and key technologies, build an industrial structure tree, and analyze the industrial relationship between the nodes of the structure tree.
在一些示例中,上述从上述学术网络爬虫所爬取的学术文章中提取领域相关术语,包括:为分析学术网络爬虫获取的学术文章,使用文本分析算法分析文章标题、关键词和摘要中的词频,提取领域专业术语。In some examples, the aforementioned domain-related terminology is extracted from the academic articles crawled by the above-mentioned academic web crawler, including: for analyzing the academic articles obtained by the academic web crawler, using text analysis algorithms to analyze the word frequency in article titles, keywords, and abstracts , to extract domain terminology.
在一些示例中,上述通过自然语言处理方法和数据挖掘算法分析上述平台数据,提取与上述行业相关的数据作为交互数据,包括:通过文本分词、词性标注和句法分析来构建实体识别特征,融合条件随机场和基于规则的方法,识别平台数据中所包含的地域实体、机构名实体和领域术语实体;利用有监督的机器学习算法,将识别出的实体与领域知识树相关联,统计分析新闻数据、公司数据与领域知识树之间的关联关系,从而分析网络信息数据在地域和产业链各节点的分布情况以及变化趋势;根据用户在平台的数据,推理用户所关注的产业节点,使用基于内容的推荐算法,为用户推荐个性化的新闻、公司和产品。In some examples, the above-mentioned platform data is analyzed through natural language processing methods and data mining algorithms, and data related to the above-mentioned industries is extracted as interactive data, including: constructing entity recognition features through text segmentation, part-of-speech tagging and syntactic analysis, fusion conditions Random field and rule-based methods to identify regional entities, organization name entities, and domain term entities contained in platform data; use supervised machine learning algorithms to associate identified entities with domain knowledge trees, and statistically analyze news data , Company data and the relationship between the domain knowledge tree, so as to analyze the distribution and change trend of network information data in the region and each node of the industrial chain; The recommended algorithm recommends personalized news, companies and products for users.
在一些示例中,上述通过上述交互数据与用户终端进行交互,包括:通过领域知识树、地图、折线图、柱状图和列表综合的方式将上述交互数据与用户进行交互;为用户呈现所选取区域的地图。In some examples, the above-mentioned interaction with the user terminal through the above-mentioned interaction data includes: interacting with the user through the combination of domain knowledge tree, map, line graph, histogram and list; presenting the selected area for the user map.
本申请提供的基于网络信息资源的行业分析系统及方法,数据采集模块采集与用户所在行业相关的信息,通过数据预处理模块将上述信息进行结构化处理,并构建行业的领域知识树,利用数据分析模块对预处理后的信息进行分析挖掘得到行业信息的分析结果,通过前台交互模块与用户进行交互。实现了从海量数据中,挖掘出有价值的行业信息,实时追踪产业信息变化,了解产业上下游分工和竞争对手信息,辅助行业管理层、决策层针对市场变化,做出快速有效的应对策略。In the industry analysis system and method based on network information resources provided by this application, the data acquisition module collects information related to the user's industry, and the above information is structured through the data preprocessing module, and the domain knowledge tree of the industry is constructed. The analysis module analyzes and mines the preprocessed information to obtain the analysis results of industry information, and interacts with users through the front-end interaction module. It realizes the mining of valuable industry information from massive data, real-time tracking of changes in industry information, understanding of industry upstream and downstream division of labor and competitor information, and assists industry management and decision-makers to make quick and effective coping strategies for market changes.
附图说明Description of drawings
图1是根据本申请的基于网络信息资源的行业分析系统的一实施例的结构示意图;Fig. 1 is a schematic structural diagram of an embodiment of an industry analysis system based on network information resources according to the present application;
图2是本申请的实施例中垂直网络爬虫爬取网页信息流程的基本框架图;Fig. 2 is the basic frame diagram of vertical web crawler crawling web page information flow in the embodiment of the present application;
图3是本申请的实施例中领域知识树子模块构建的机器人行业产业链知识树的示例性应用的示意图;Fig. 3 is a schematic diagram of an exemplary application of the robot industry industry chain knowledge tree constructed by the domain knowledge tree sub-module in the embodiment of the present application;
图4a是在行业产业链中构建的产业节点的上下游节点关系示意图;Figure 4a is a schematic diagram of the upstream and downstream node relationships of the industrial nodes constructed in the industry industrial chain;
图4b是在行业产业链中构建的机器人产业链中系统集成产业节点的上下游节点关系示意图;Figure 4b is a schematic diagram of the relationship between upstream and downstream nodes of the system integration industry nodes in the robot industry chain constructed in the industry industry chain;
图5是本申请实施例中利用文本分析算法执行文本分词、词性标注和句法分析的实例结果示意图;Fig. 5 is a schematic diagram of an example result of using a text analysis algorithm to perform text segmentation, part-of-speech tagging and syntactic analysis in the embodiment of the present application;
图6是应用于本申请的基于网络信息资源的行业分析方法的一实施例示意图。Fig. 6 is a schematic diagram of an embodiment of an industry analysis method based on network information resources applied in this application.
具体实施方式Detailed ways
下面参照附图来描述本发明的优选实施方式。本领域技术人员应当理解的是,这些实施方式仅仅用于解释本发明的技术原理,并非旨在限制本发明的保护范围。Preferred embodiments of the present invention are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are only used to explain the technical principles of the present invention, and are not intended to limit the protection scope of the present invention.
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the case of no conflict, the embodiments in the present application and the features in the embodiments can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and embodiments.
本申请的基于网络信息资源的行业分析系统可以包括数据采集模块、数据预处理模块、数据分析模块和前台交互模块,其中,上述数据采集模块,配置为采集与用户所关注行业相关的网络信息;上述数据预处理模块,配置为对上述网络信息进行结构化处理,与预设的平台数据进行融合,构建产业结构的领域知识树;上述数据分析模块,配置为通过自然语言处理方法和数据挖掘算法分析上述平台数据和领域知识树,提取与上述关键词相关的数据作为交互数据;上述前台交互模块,配置为通过上述交互数据与用户终端进行交互。The industry analysis system based on network information resources of this application may include a data collection module, a data preprocessing module, a data analysis module and a front-end interaction module, wherein the above-mentioned data collection module is configured to collect network information related to the industry concerned by the user; The above-mentioned data preprocessing module is configured to perform structured processing on the above-mentioned network information, and integrates with the preset platform data to construct a domain knowledge tree of the industrial structure; the above-mentioned data analysis module is configured to use natural language processing methods and data mining algorithms Analyzing the above-mentioned platform data and domain knowledge tree, extracting data related to the above-mentioned keywords as interaction data; the above-mentioned front-end interaction module is configured to interact with the user terminal through the above-mentioned interaction data.
在本实施例中,上述数据采集模块是根据用户提供的关键词或关键信息采集行业相关的网络信息,这里,可以采集与用户所关注的行业相关的信息,如,与用户同行业的企业的信息、上下游企业的信息;还可以采集与用户所关注行业的发展相关的信息,如该行业发展的技术前沿,学术前沿等信息。In this embodiment, the above-mentioned data collection module collects industry-related network information according to keywords or key information provided by users. Here, information related to industries that users are concerned about can be collected, such as information about companies in the same industry as users. Information, information of upstream and downstream enterprises; it can also collect information related to the development of the industry that users are concerned about, such as the technological frontier and academic frontier of the industry's development.
上述数据预处理模块对上述网络信息进行预处理,上述预处理可以为对上述与用户所关注行业相关的网络信息结构化处理,可以从上述信息中提取与行业相关的公司、公司的产品、公司分布区域、公司求购等分类信息;还可以对行业发展建立分类前沿信息,从其中获取产业发展趋势、技术发展监控等信息。The above-mentioned data pre-processing module pre-processes the above-mentioned network information. The above-mentioned pre-processing can be structured processing of the above-mentioned network information related to the industry concerned by the user, and can extract companies related to the industry, company products, and company information from the above-mentioned information. Classified information such as distribution area and company's purchase request; it can also establish classified frontier information for industry development, from which information such as industrial development trends and technological development monitoring can be obtained.
上述数据分析模块通过对上述数据预处理模块建立的结构化数据和分类信息进行分析挖掘,结合用户在平台的操作信息,向用户推荐用户可能感兴趣的产品、公司和新闻等信息。The above-mentioned data analysis module analyzes and mines the structured data and classified information established by the above-mentioned data preprocessing module, and combines the user's operation information on the platform to recommend to the user information such as products, companies, and news that the user may be interested in.
上述前台交互模块可以通过交互界面与用户交互,交互界面可以是通过各种图表的形式展示产业变化趋势、产业地域分布、上下游分析、竞争对手、潜在买家等信息;用户可以直观的、随时、随地的获得行业信息。The above-mentioned front-end interaction module can interact with users through an interactive interface, which can display information such as industrial change trends, industrial geographical distribution, upstream and downstream analysis, competitors, potential buyers, etc. in the form of various charts; users can intuitively and at any time , Obtain industry information anywhere.
本申请的上述实施例提供的系统通过对用户所关注的行业相关的信息进行分析挖掘,为用户展现分析挖掘得到的行业信息。The system provided by the above embodiments of the present application analyzes and mines the industry-related information concerned by the user, and presents the industry information obtained by the analysis and mining for the user.
图1示出了可以应用本申请的基于网络信息资源的行业分析系统的一具体实施例的示例性系统结构示意图。FIG. 1 shows an exemplary system structure diagram of a specific embodiment of an industry analysis system based on network information resources to which the present application can be applied.
具体地,如图1所示,基于网络信息资源的行业分析系统的数据采集模块、数据预处理模块、数据分析模块和前台交互模块,分别实现数据采集、数据预处理、数据分析和前台交互功能。Specifically, as shown in Figure 1, the data acquisition module, data preprocessing module, data analysis module and front-end interaction module of the industry analysis system based on network information resources respectively realize data acquisition, data preprocessing, data analysis and front-end interaction functions .
上述数据采集模块包括垂直网络爬虫101和学术网络爬虫102。上述垂直网络爬虫101配置为根据预设的第一初始种子节点,通过分析统一资源定位符(Uniform ResourceLocator,URL)从行业垂直网站抓取网页信息。具体地,可以为根据行业选择具有代表性的网站,作为垂直网络爬虫的第一初始种子节点。上述垂直网络爬虫101通过分析网站的URL进行与用户关注行业相关的网页信息的爬取。上述网页信息包括该行业的相关企业的新闻、该行业的机构信息、产品和求购信息等。The above data collection module includes a vertical web crawler 101 and an academic web crawler 102 . The above-mentioned vertical web crawler 101 is configured to crawl webpage information from industry vertical websites by analyzing uniform resource locators (Uniform ResourceLocator, URL) according to the preset first initial seed node. Specifically, a representative website may be selected according to the industry as the first initial seed node of the vertical web crawler. The above-mentioned vertical web crawler 101 crawls webpage information related to the industry concerned by the user by analyzing the URL of the website. The above-mentioned web page information includes news about relevant companies in the industry, institutional information in the industry, products and purchase information, etc.
上述学术网络爬虫102,配置为根据预设的第二初始种子节点从学术网站抓取学术文章。这里,可以将基于学术会议和学术期刊作为学术网络爬虫的第二初始节点。上述学术网络爬虫102根据上述第二初始种子节点从学术网站、学术期刊或学术会议中爬取相关的学术文章,获取该行业的发展前沿信息。The above-mentioned academic web crawler 102 is configured to crawl academic articles from academic websites according to a preset second initial seed node. Here, academic conferences and academic journals can be used as the second initial node of the academic web crawler. The above-mentioned academic web crawler 102 crawls relevant academic articles from academic websites, academic journals or academic conferences according to the above-mentioned second initial seed node, and obtains the development frontier information of the industry.
上述垂直网络爬虫101和学术网络爬虫102可以根据实际需求,按照固定周期运行,为进行实时的行业分析提供数据支撑。例如,因行业信息更新频率较高,上述垂直网络爬虫101可以每隔1小时运行一次,使得所获取的信息能够尽可能达到实时的效果;而知识领域或学术领域的信息更新频率可以比较低一些,上述学术网络爬虫102,可以每天或者每月运行一次。The above-mentioned vertical web crawler 101 and academic web crawler 102 can run in a fixed cycle according to actual needs, and provide data support for real-time industry analysis. For example, due to the high update frequency of industry information, the above-mentioned vertical web crawler 101 can run once every hour, so that the obtained information can achieve real-time effects as much as possible; while the update frequency of information in the knowledge field or academic field can be relatively low , the above-mentioned academic web crawler 102 can run once a day or once a month.
作为示例,图2示出了上述垂直网络爬虫101爬取与行业相关数据的流程。As an example, FIG. 2 shows the process of crawling industry-related data by the above-mentioned vertical web crawler 101 .
步骤2.1:根据用户所关注的行业选择具有代表性的网站,作为模块201的种子URL,其中,模块201中存储网络爬虫的初始种子;Step 2.1: Select a representative website according to the industry concerned by the user as the seed URL of the module 201, wherein the initial seed of the web crawler is stored in the module 201;
步骤2.2:将模块201的种子URL压入模块202的待抓取URL队列;Step 2.2: push the seed URL of module 201 into the URL queue to be grabbed of module 202;
步骤2.3:模块203读取从待抓取URL队列中读取URL,并使用URL过滤器204对选定的URL进行过滤,具体可以是对所读取的URL进行解析,只保留和新闻、公司相关的网页URL;Step 2.3: Module 203 reads and reads URLs from the queue of URLs to be grabbed, and uses URL filter 204 to filter the selected URLs. Specifically, the read URLs can be parsed, and only news and companies are retained. the relevant web page URL;
步骤2.4:模块205的下载器从垂直网站上爬取过滤后的URL网页,并使用模块206保存网页内容;Step 2.4: the downloader of module 205 crawls the filtered URL webpage from the vertical website, and uses module 206 to save the webpage content;
步骤2.5:将模块206的网页保存到网页数据库209中,同时,把爬取成功的网页URL压入到模块207的已抓取的URL队列208中;Step 2.5: the webpage of module 206 is saved in the webpage database 209, meanwhile, the URL of the webpage successfully crawled is pushed into the captured URL queue 208 of module 207;
步骤2.6:使用模块206从网页中提取URL,并对其中的已爬取URL进行过滤,把未爬取的URL压入到模块202;Step 2.6: Use module 206 to extract URLs from the webpage, filter the crawled URLs, and push uncrawled URLs into module 202;
步骤2.7:判断模块202的队列中,是否还有未抓取的网页URL,如果有,则跳转到步骤2.3,否则,网络爬虫对行业相关数据的爬取结束。Step 2.7: Judging whether there are uncrawled web page URLs in the queue of module 202, if so, jump to step 2.3, otherwise, crawling of industry-related data by the web crawler ends.
在本实施例中,上述数据预处理模块包括数据结构化子模块103、平台数据子模块105、领域术语提取子模块104和领域知识树子模块106。上述数据结构化子模块103配置为对上述垂直网络爬虫101所爬取的网页内容进行解析和结构化处理,并与平台数据子模块105中的平台数据合并,为数据的进一步分析处理提供基础数据。对于上述垂直网络爬虫101所爬取的新闻网页,上述数据结构化子模块103利用利用网页解析工具,如,BeautifulSoup、lxml等,提取上述新闻网页的新闻标题、发布时间、网页内容等。作为示例,表1中示出了利用网页解析工具对上述垂直网络爬虫101所爬取新闻网页的进行解析得到的新闻数据表。In this embodiment, the data preprocessing module includes a data structuring submodule 103 , a platform data submodule 105 , a domain term extraction submodule 104 and a domain knowledge tree submodule 106 . The above-mentioned data structuring sub-module 103 is configured to analyze and structure the webpage content crawled by the above-mentioned vertical web crawler 101, and merge it with the platform data in the platform data sub-module 105 to provide basic data for further analysis and processing of data . For the news web pages crawled by the above-mentioned vertical web crawler 101, the above-mentioned data structuring sub-module 103 utilizes web page parsing tools, such as BeautifulSoup, lxml, etc., to extract the news title, release time, web content, etc. of the above-mentioned news web pages. As an example, Table 1 shows a news data table obtained by analyzing the news web pages crawled by the vertical web crawler 101 using a web page analysis tool.
表1新闻数据表,Table 1 news data table,
上述垂直网络爬虫101获取的企业机构信息中包括与企业相关的公司的信息,获取公司网页内容,利用网页解析工具解析该公司网页内容,提取出上述公司网页内容所指示的公司的名称、地址、产品、求购和公司介绍等信息。作为示例,参考表2,表2示出了上述利用网页解析工具对上述垂直网络爬虫101所爬取的公司网页进行解析得到的公司数据表。The corporate organization information obtained by the above vertical web crawler 101 includes the information of companies related to the enterprise, obtains the content of the company's web page, uses a web page analysis tool to analyze the content of the company's web page, and extracts the company's name, address, Products, buy and company introduction and other information. As an example, refer to Table 2, which shows the company data table obtained by analyzing the company web pages crawled by the vertical web crawler 101 using the web page analysis tool.
表2公司数据Table 2 Company Data
上述领域术语提取子模块104利用网页解析工具对上述学术网络爬虫102获取的学术文章进行分析,进而提取学术文章的领域术语。由于学术文章所包含的标题、关键词和摘要是该学术文章的核心内容的提炼,因此对学术文章的分析可以是首先对学术文章的标题、关键词和摘要的分析,然后根据需要对学术文章的内容进行分析。上述领域术语提取子模块104内嵌各种文本分析算法,利用上述领域术语提取子模块104所内嵌的文本分析算法分析上述学术文章。具体地,上述利用文本分析算法分析学术文章,可以为使用词频-反文档频率算法(term frequency-inverse document frequency,TF-IDF)和潜在语义分析算法(Latent Dirichilet Allocation,LDA)提取学术文章的文本的关键词,使用聚类方法分析学术文章标题、关键词和摘要中的词频,抽取出现次数大于设定阈值的词语,作为领域知识树中子节点的领域术语。上述领域知识树中各子节点的领域术语所构成的术语集合可以用于分析网络信息数据与领域知识树之间的关系。作为示例,参考表3,表3示出了学术文章的数据结构,文本分析算法基于表3的数据结构所示的内容对学术文章进行分析。The domain term extraction sub-module 104 analyzes the academic articles obtained by the academic web crawler 102 by using a web page analysis tool, and then extracts domain terms of the academic articles. Since the title, keywords and abstract contained in an academic article are the refinement of the core content of the academic article, the analysis of the academic article can be the first analysis of the title, keywords and abstract of the academic article, and then according to the needs of the academic article content is analyzed. Various text analysis algorithms are embedded in the domain term extraction sub-module 104, and the above-mentioned academic articles are analyzed by using the text analysis algorithms embedded in the domain term extraction sub-module 104. Specifically, the above-mentioned analysis of academic articles using text analysis algorithms can extract the text of academic articles using the term frequency-inverse document frequency algorithm (term frequency-inverse document frequency, TF-IDF) and latent semantic analysis algorithm (Latent Dirichilet Allocation, LDA) keywords, using clustering method to analyze the word frequency in titles, keywords, and abstracts of academic articles, and extract words whose frequency of occurrence is greater than the set threshold, as domain terms of child nodes in the domain knowledge tree. The term set formed by the domain terms of each child node in the above domain knowledge tree can be used to analyze the relationship between the network information data and the domain knowledge tree. As an example, refer to Table 3, which shows the data structure of academic articles, and the text analysis algorithm analyzes the academic articles based on the content shown in the data structure of Table 3.
表3学术文章数据表Table 3 Academic Article Data Table
上述平台数据子模块105配置为上述数据分析模块进行数据分析提供基础数据和预处理过的数据。上述平台数据子模块105中存储各类信息,包括平台中的用户操作行为、公司产品、求购需求、公司新闻、公司信息和地域信息等。上述用户操作行为为用户在系统平台中的操作行为,如浏览新闻、点击产品、发布需求等,用于跟踪记录用户的行为信息,为算法分析用户兴趣,提供数据支撑。上述公司产品可以为公司用户在平台中所发布的产品信息,如产品名称、产品简介、产品功能、产品的参数等信息。上述求购需求可以使用户在平台中发布的求购信息,如,求购产品名称、参数、价位、限定区域等。上述公司新闻可以为公司用户在平台中发布的新闻信息,包括新闻标题、作者、内容等。上述公司信息可以为公司用户在系统平台中的注册信息,如,公司名称、注册地址、主营业务等;上述地域信息可以为系统平台中构建的中国地理信息,包括省市的全称、简称、经纬坐标和区域,为分析网络信息以及定位公司位置信息所使用。The platform data sub-module 105 is configured to provide basic data and preprocessed data for the data analysis module to perform data analysis. The above-mentioned platform data sub-module 105 stores various types of information, including user operation behaviors in the platform, company products, purchase needs, company news, company information, and regional information. The above-mentioned user operation behavior is the user's operation behavior on the system platform, such as browsing news, clicking on products, publishing requirements, etc., which is used to track and record user behavior information, and provide data support for the algorithm to analyze user interests. The above-mentioned company products can be product information released by company users on the platform, such as product name, product introduction, product function, product parameters and other information. The above request for purchase may be the purchase information released by the user on the platform, such as the name of the product to be purchased, parameters, price, limited area, and the like. The above-mentioned company news may be news information released by company users on the platform, including news title, author, content, etc. The above-mentioned company information can be the registration information of company users in the system platform, such as company name, registered address, main business, etc.; the above-mentioned geographical information can be the geographical information of China constructed on the system platform, including the full name, abbreviation, Latitude and longitude coordinates and areas are used for analyzing network information and locating company location information.
上述领域知识树子模块106配置为结合专家知识和所提取的领域专业术语,构建领域知识树和产业结构的领域知识树节点之间的关联关系。上述领域知识树子模块106可以根据公司用户所在的行业的提取的数据信息构建行业产业的领域知识树。上述构建行业产业的领域知识树,首先构建产业链节点,分别为产业链上游、产业链中游和产业链下游节点;然后,根据网络爬虫所爬取的网页信息和专家知识分别构建产业链上游节点、产业链中游节点和产业链下游节点的子节点;最后继续以各上述子节点为中间节点,构建各上述中间节点的子节点,从而构建出公司用户所在行业产业链的领域知识树。作为示例,图3示出了上述领域知识树子模块106构建的机器人产业链的领域知识树。在机器人产业链中,分为产业链上游节点、产业链中游节点和产业链下游节点。产业链上游节点为供应商,包括原材料、零部件等子节点;产业链下游节点为售后服务和应用,包括合作商子节点、代理商子节点、第三方服务子节点和解决方案子节点等;产业链中游节点为行业主营业务,作为领域树主干,包括机器人本体节点和机器人集成节点,在机器人集成节点下包括多层子节点,例如机器人集成节点的子节点有智能机器人节点,智能机器人节点下有工业机器人子节点,工业机器人节点下有搬用机器人子节点等。The domain knowledge tree sub-module 106 is configured to combine the expert knowledge and the extracted domain terminology to construct the association relationship between the domain knowledge tree and the nodes of the domain knowledge tree of the industrial structure. The above-mentioned domain knowledge tree sub-module 106 can construct the domain knowledge tree of the industry according to the extracted data information of the industry in which the company user is located. The above construction of the domain knowledge tree of the industry industry first constructs the nodes of the industrial chain, which are the upstream of the industrial chain, the middle of the industrial chain, and the downstream of the industrial chain; then, according to the web page information and expert knowledge crawled by the web crawler, the upstream nodes of the industrial chain are respectively constructed , the sub-nodes of the midstream nodes of the industrial chain and the downstream nodes of the industrial chain; finally, continue to use each of the above-mentioned sub-nodes as intermediate nodes to construct the sub-nodes of each of the above-mentioned intermediate nodes, thereby constructing the domain knowledge tree of the industry chain where the company user is located. As an example, FIG. 3 shows the domain knowledge tree of the robot industry chain constructed by the domain knowledge tree sub-module 106 described above. In the robot industry chain, it is divided into upstream nodes of the industrial chain, midstream nodes of the industrial chain and downstream nodes of the industrial chain. The upstream nodes of the industrial chain are suppliers, including sub-nodes such as raw materials and parts; the downstream nodes of the industrial chain are after-sales services and applications, including sub-nodes of partners, sub-nodes of agents, sub-nodes of third-party services, and sub-nodes of solutions; The middle reaches of the industrial chain are the main business of the industry. As the backbone of the domain tree, it includes the robot body node and the robot integration node. The robot integration node includes multiple sub-nodes. For example, the sub-nodes of the robot integration node include intelligent robot nodes, intelligent robot nodes There are industrial robot sub-nodes under the industrial robot node, and there are moving robot sub-nodes under the industrial robot node.
图4示出了上述领域知识树子模块106所构建的机器人产业链上下游产业节点的示意图。其中,图4a示出了设计各节点的上下游节点关系,图4b示出了机器人产业链中一具体示例的产业链节点示意图,如,在机器人产业链中,当产业节点为“系统集成”,那么上游产业节点包括传感器、控制器等,下游产业节点包括第三方、代理商等。FIG. 4 shows a schematic diagram of the upstream and downstream industry nodes of the robot industry chain constructed by the above domain knowledge tree sub-module 106 . Among them, Figure 4a shows the relationship between the upstream and downstream nodes of each node in the design, and Figure 4b shows a schematic diagram of the industrial chain nodes of a specific example in the robot industry chain, such as, in the robot industry chain, when the industry node is "system integration" , then the upstream industry nodes include sensors, controllers, etc., and the downstream industry nodes include third parties, agents, etc.
在本实施例中,上述数据分析模块包括实体识别子模块107和数据挖掘子模块108,上述实体识别子模块107配置为通过文本分词、词性标注和句法分析来构建实体识别特征,融合条件随机场和基于规则的方法,识别平台数据中所包含的地域实体、机构名实体和领域术语实体;上述数据挖掘子模块108,配置为利用有监督的机器学习算法,将识别出的实体与领域知识树相关联,统计分析新闻数据、公司数据与领域知识树之间的关联关系,从而分析网络信息数据在地域和产业链各节点的分布情况以及变化趋势;根据用户在平台的数据,推理用户所关注的产业节点,使用基于内容的推荐算法,为用户推荐个性化的新闻、公司和产品。In this embodiment, the data analysis module includes an entity recognition sub-module 107 and a data mining sub-module 108. The entity recognition sub-module 107 is configured to construct entity recognition features through text segmentation, part-of-speech tagging and syntactic analysis, and integrate conditional random fields and a rule-based method to identify regional entities, organization name entities, and domain term entities contained in the platform data; the above-mentioned data mining sub-module 108 is configured to use a supervised machine learning algorithm to combine the identified entities with the domain knowledge tree Correlation, statistical analysis of the relationship between news data, company data and domain knowledge trees, so as to analyze the distribution and change trend of network information data in regions and nodes of the industrial chain; infer the user's concerns based on the user's data on the platform The industry node of , using content-based recommendation algorithms, recommends personalized news, companies and products for users.
上述实体识别子模块107包括文本分词、词性标注、句法分析、地域识别、机构名识别和领域术语识别六个单元,上述文本分词、词性标注和句法分析,用于构建实体识别特征。以“机器人是自动执行工作的机器装置”为例,进行文本分词、词性标注和句法分析,结果如图5所示,从中提取的实体识别特征,如表4所示;然后融合条件随机场(ConditionalRandom Fields,CRF)和基于规则的方法,针对每条信息,发现识别其包含的地域实体、机构名实体和领域术语实体。The above-mentioned entity recognition sub-module 107 includes six units of text segmentation, part-of-speech tagging, syntactic analysis, region recognition, organization name recognition and field term recognition. The above-mentioned text segmentation, part-of-speech tagging and syntax analysis are used to construct entity recognition features. Taking "a robot is a machine device that automatically performs work" as an example, text segmentation, part-of-speech tagging and syntactic analysis are performed. The results are shown in Figure 5, and the entity recognition features extracted from it are shown in Table 4; then the fusion of conditional random fields ( Conditional Random Fields (CRF) and rule-based methods, for each piece of information, discover and identify the geographical entity, organization name entity and domain term entity contained in it.
表4实体识别特征Table 4 Entity Recognition Features
上述数据挖掘子模块108,利用监督学习算法,构建识别实体与产业节点之间的关联关系,分析新闻数据、公司数据与领域知识树之间的关联关系,统计产业趋势变化、产业地域分布、上下游分析,通过分析用户或公司在平台中的数据,如发布的产品、求购等信息,推理其关注的产业节点,推荐其感兴趣的产品、公司和新闻。The above-mentioned data mining sub-module 108 utilizes supervised learning algorithms to construct associations between identification entities and industry nodes, analyze associations between news data, company data, and domain knowledge trees, and make statistics on industry trend changes, industrial geographical distribution, and Downstream analysis, by analyzing the data of users or companies on the platform, such as released products, purchasing information, etc., reasoning about the industry nodes they care about, and recommending products, companies and news that they are interested in.
在本实施例中,上述前台交互模块包括可视化子模块和地图子模块,上述可视化子模块配置为通过领域知识树109、折线图111、柱状图112和列表113综合的方式将所述数据分析模块分析的结果数据与用户进行交互;上述地图子模块110,配置为用户呈现所选取区域的地图,可以作为该区域的区域地图。In this embodiment, the above-mentioned foreground interaction module includes a visualization sub-module and a map sub-module, and the above-mentioned visualization sub-module is configured to synthesize the data analysis module through domain knowledge tree 109, line graph 111, histogram 112 and list 113 The analyzed result data is interacted with the user; the above map sub-module 110 is configured to present the map of the selected area to the user, which can be used as an area map of the area.
上述与用户交互的可视化子模块,通过领域知识树109、折线图111、柱状图112和列表113的方式为用户呈现各类分析结果信息。The visualization sub-module for interacting with the user presents various analysis result information for the user through domain knowledge tree 109 , line graph 111 , bar graph 112 and list 113 .
上述领域知识树109,为用户呈现用户所关注行业的领域知识树结构,供用户来选择查看的产业节点。The above domain knowledge tree 109 presents the domain knowledge tree structure of the industry concerned by the user for the user, and allows the user to select the industry node to view.
上述地图子模块110为用户呈现中国各省市的区域,当选择某个省市,会自动跳转到该省的省地图。The above-mentioned map sub-module 110 presents the regions of various provinces and cities in China for the user. When a certain province or city is selected, it will automatically jump to the provincial map of this province.
上述折线图111为用户呈现某地区某产业节点的新闻热度随时间的变化趋势。The above-mentioned line chart 111 presents to the user the trend of news popularity of a certain industry node in a certain region over time.
上述柱状图112,为用户呈现某地区某产业节点的新闻热度分布。The above-mentioned histogram 112 presents the news popularity distribution of a certain industry node in a certain region for the user.
上述列表113,以列表的方式为用户呈现上下游公司、竞争对手和潜在买家,以及推荐的新闻等信息。The above-mentioned list 113 presents upstream and downstream companies, competitors, potential buyers, and recommended news and other information to the user in the form of a list.
本申请上述实施例所提供的系统通过数据采集模块在海量数据中抽取与用户所在行业相关的信息;数据预处理模块通过对所抽取的信息进行数据结构化处理;构建领域知识树。数据分析模块分析和挖掘处理后的信息,并结合专家知识分析产业发展趋势,为用户提供行业分析报告;前台交互模块与用户进行信息交互,为用户提供与产业相关的信息。使得用户可以及时掌握产业各节点的实时变化,了解产业上下游分工和竞争对手的信息,辅助行业管理层或决策层针对市场变化,做出快速有效的应对策略。The system provided by the above embodiments of the present application extracts information related to the user's industry from massive data through the data acquisition module; the data preprocessing module performs data structural processing on the extracted information; and builds a domain knowledge tree. The data analysis module analyzes and mines the processed information, and combines expert knowledge to analyze industry development trends, and provides users with industry analysis reports; the front-end interaction module interacts with users to provide users with industry-related information. It enables users to grasp the real-time changes of each node in the industry in a timely manner, understand the upstream and downstream division of labor in the industry and the information of competitors, and assist the industry management or decision-making layer to make quick and effective coping strategies for market changes.
参考图6,本申请提供一种基于网络信息资源的行业分析方法,该方法包括如下步骤:Referring to Fig. 6, the present application provides an industry analysis method based on network information resources, the method includes the following steps:
步骤601,采集与用户所关注行业相关的网络信息。Step 601, collecting network information related to the industry concerned by the user.
在本实施例中,应用于本申请的电子设备(可以为服务器或应用平台)利用网络爬虫从与行业相关的网站获取与行业相关的网络信息。这里,与用户所关注行业相关的网站可以为用户所在或所从事行业内及上下游产业的公司的网站,还可以为与行业产业相关的技术及学术论坛或网站。上述网络爬虫可以为垂直网络爬虫,还可以为学术网络爬虫。上述垂直网络爬虫从领域相关的网站,收集新闻、机构、产品和求购信息。上述学术网络爬虫,从领域相关的学术会议和学术期刊网站中,抓取相关的学术文章。上述网络信息可以为上述新闻、机构、产品和求购信息,还以为学术文章。In this embodiment, the electronic device (which may be a server or an application platform) applied to this application uses a web crawler to obtain industry-related network information from industry-related websites. Here, the website related to the industry concerned by the user may be the website of the company in the industry the user is in or engaged in, as well as the upstream and downstream industries, and may also be a technical and academic forum or website related to the industry. The aforementioned web crawler may be a vertical web crawler, and may also be an academic web crawler. The above-mentioned vertical web crawler collects news, institutions, products and purchase information from domain-related websites. The above-mentioned academic web crawler crawls relevant academic articles from the websites of academic conferences and academic journals related to the field. The above-mentioned network information may be the above-mentioned news, institutions, products and purchase information, or academic articles.
在一些优选的实施方案中,上述与用户所关注行业相关的网络信息包括网页信息和学术文章,上述采集与用户所关注行业相关的网络信息,包括:根据预设的第一初始种子节点,利用垂直网络爬虫通过分析上述第一初始种子节点的统一资源定位符从行业垂直网站抓取网页信息。根据预设的第二初始种子节点,利用学术网络爬虫为从学术网站抓取学术文章。这里,上述第一初始种子节点是根据行业选择具有代表性的网站,作为网络爬虫的初始种子节点。上述第二初始种子节点可以是基于学术会议和学术期刊作作为初始种子节点。上述网络爬虫通过分析URL,爬取相关网页信息或学术文章。In some preferred embodiments, the above-mentioned network information related to the industry concerned by the user includes webpage information and academic articles, and the collection of network information related to the industry concerned by the user includes: according to the preset first initial seed node, using The vertical web crawler grabs web page information from industry vertical websites by analyzing the Uniform Resource Locator of the above-mentioned first initial seed node. According to the preset second initial seed node, an academic web crawler is used to crawl academic articles from academic websites. Here, the above-mentioned first initial seed node is to select a representative website according to the industry as the initial seed node of the web crawler. The above-mentioned second initial seed node may be based on academic conferences and academic journals as the initial seed node. The aforementioned web crawler crawls relevant web page information or academic articles by analyzing the URL.
步骤602,对上述网络信息进行结构化处理,与预设的平台数据进行融合,构建产业结构的领域知识树。In step 602, the above-mentioned network information is structured and fused with the preset platform data to build a domain knowledge tree of the industrial structure.
在本实施例中,上述服务器或应用平台对上述网络信息进行数据预处理,构建产业结构树。这里,上述数据预处理可以为对垂直网络爬虫采集的垂直网页信息进行结构化分析;还可以为对学术网络爬虫所爬取的学术文章中提取领域相关术语和关键技术,并结合领域专家知识,对提取的领域术语以及关键技术进行结构化组织,构建产业结构树,并分析结构树节点之间的产业关联关系。进一步地,从学术网络爬虫所爬取的学术文章中提取与行业相关行业或产业相关术语和关键技术信息,可以为分析学术网络爬虫获取的学术文章,使用文本分析算法分析文章标题、关键词和摘要中的词频,提取领域专业术语。上述文本分析算法可以为TF-IDF、LDA、聚类等算法。In this embodiment, the server or the application platform performs data preprocessing on the network information to build an industrial structure tree. Here, the above data preprocessing can be structured analysis of vertical webpage information collected by vertical web crawlers; it can also extract domain-related terms and key technologies from academic articles crawled by academic web crawlers, and combine domain expert knowledge, Structurally organize the extracted domain terms and key technologies, build an industrial structure tree, and analyze the industrial relationship between the nodes of the structure tree. Further, extracting industry-related or industry-related terms and key technical information from the academic articles crawled by academic web crawlers can be used to analyze the academic articles obtained by academic web crawlers, using text analysis algorithms to analyze article titles, keywords and The frequency of words in the abstract, extracting domain terminology. The aforementioned text analysis algorithm may be algorithms such as TF-IDF, LDA, and clustering.
步骤603,通过自然语言处理方法和数据挖掘算法分析平台数据和领域知识树,提取与行业相关的数据作为交互数据。Step 603, analyze platform data and domain knowledge tree through natural language processing method and data mining algorithm, and extract industry-related data as interaction data.
在本实施例中,可以利用自然语言处理方法从新闻、公司、产品、求购等网络信息中,识别地域实体、领域术语实体和机构名实体;可以利用数据挖掘算法根据识别出的领域术语实体和领域知识树节点之间的关系,对新闻、公司、产品、求购等信息的知识节点分类分析,并根据这些信息所处的地域和发布的时间进行统计,基于知识节点的新闻热度变化,形成对产业趋势变化的跟踪。In this embodiment, natural language processing methods can be used to identify regional entities, domain term entities, and organization name entities from network information such as news, companies, products, and purchases; data mining algorithms can be used to identify domain term entities and The relationship between domain knowledge tree nodes, classification and analysis of knowledge nodes such as news, company, product, and purchase information, and statistics are made according to the region and release time of these information, based on the change of news popularity of knowledge nodes, a pair of knowledge nodes is formed. Tracking of industry trend changes.
进一步地,上述通过自然语言处理方法和数据挖掘算法分析平台数据,提取与行业相关的数据作为交互数据,包括通过文本分词、词性标注和句法分析来构建实体识别特征,融合条件随机场和基于规则的方法,识别平台数据中所包含的地域实体、机构名实体和领域术语实体;利用有监督的机器学习算法,将识别出的实体与领域知识树相关联,统计分析新闻数据、公司数据与领域知识树之间的关联关系,从而分析网络信息数据在地域和产业链各节点的分布情况以及变化趋势;根据用户在平台的数据,推理用户所关注的产业节点,使用基于内容的推荐算法,为用户推荐个性化的新闻、公司和产品。Further, the above-mentioned platform data is analyzed through natural language processing methods and data mining algorithms, and industry-related data is extracted as interactive data, including the construction of entity recognition features through text segmentation, part-of-speech tagging and syntactic analysis, and the fusion of conditional random fields and rule-based method to identify regional entities, organization name entities, and domain term entities contained in platform data; use supervised machine learning algorithms to associate identified entities with domain knowledge trees, and statistically analyze news data, company data, and domain The relationship between the knowledge trees, so as to analyze the distribution and change trend of network information data in the region and each node of the industrial chain; according to the data of the user on the platform, infer the industrial nodes that the user is concerned about, and use the content-based recommendation algorithm to provide User recommendations for personalized news, companies and products.
步骤604,通过交互数据与用户终端进行交互。Step 604, interact with the user terminal through the interaction data.
在本实施例中,通过应用平台提供的交互应用与用户进行信息交互。这里,交互应用可以为可视化的应用,如折线图、柱状图、列表的形式显示分析结果。具体地:In this embodiment, the interactive application provided by the application platform performs information interaction with the user. Here, the interactive application may be a visual application, such as displaying analysis results in the form of a line graph, a histogram, or a list. specifically:
使用折线图,为用户呈现所选择的地域范围内,所选择的领域知识树节点的产业趋势变化。Use the line chart to present the industry trend changes of the selected domain knowledge tree nodes within the selected geographical range to the user.
使用柱状图,为用户呈现所选择的地域范围内,所选择的领域知识树节点的地域分布状况。Use the histogram to show the user the geographical distribution of the selected domain knowledge tree nodes within the selected geographical range.
使用列表,为用户呈现所选择的地域范围内,所选择的领域知识树节点的上下游企业展示;使用列表,为用户推荐其感兴趣的公司;使用列表,为用户推荐其感兴趣的产品;使用列表,为用户推荐其感兴趣的新闻。Use the list to present the upstream and downstream enterprises of the selected domain knowledge tree node for the user within the selected geographical range; use the list to recommend the companies they are interested in to the user; use the list to recommend the products they are interested in to the user; Use lists to recommend news of interest to users.
本申请的上述实施例所提供的方法能够从海量数据中,抽取有效信息,为用户呈现产业各节点的实时变化,了解产业上下游分工和竞争对手,辅助行业管理层、决策层等,针对市场变化,做出快速有效的应对策略。The method provided by the above-mentioned embodiments of the present application can extract effective information from massive data, present real-time changes of each node in the industry for users, understand the division of labor and competitors in the upstream and downstream of the industry, assist industry management, decision-making, etc., and target the market Changes and make quick and effective coping strategies.
至此,已经结合附图所示的优选实施方式描述了本发明的技术方案,但是,本领域技术人员容易理解的是,本发明的保护范围显然不局限于这些具体实施方式。在不偏离本发明的原理的前提下,本领域技术人员可以对相关技术特征作出等同的更改或替换,这些更改或替换之后的技术方案都将落入本发明的保护范围之内。So far, the technical solutions of the present invention have been described in conjunction with the preferred embodiments shown in the accompanying drawings, but those skilled in the art will easily understand that the protection scope of the present invention is obviously not limited to these specific embodiments. Without departing from the principles of the present invention, those skilled in the art can make equivalent changes or substitutions to relevant technical features, and the technical solutions after these changes or substitutions will all fall within the protection scope of the present invention.
Claims (12)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711475066.6A CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711475066.6A CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108229810A true CN108229810A (en) | 2018-06-29 |
CN108229810B CN108229810B (en) | 2021-02-05 |
Family
ID=62646986
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711475066.6A Active CN108229810B (en) | 2017-12-29 | 2017-12-29 | Industry analysis system and method based on network information resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108229810B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255034A (en) * | 2018-08-08 | 2019-01-22 | 数据地平线(广州)科技有限公司 | A kind of domain knowledge map construction method based on industrial chain |
CN109299362A (en) * | 2018-09-21 | 2019-02-01 | 平安科技(深圳)有限公司 | Similar enterprise's recommended method, device, computer equipment and storage medium |
CN109543045A (en) * | 2018-11-15 | 2019-03-29 | 厦门笨鸟电子商务有限公司 | A kind of methods of exhibiting of whole world industrial chain |
CN110020092A (en) * | 2018-11-20 | 2019-07-16 | 皮商云集(厦门)科技有限公司 | Leather industry data center systems based on web crawlers |
CN110020226A (en) * | 2018-08-20 | 2019-07-16 | 中国平安人寿保险股份有限公司 | Method for exhibiting data, user equipment, storage medium and device based on big data |
CN110175239A (en) * | 2019-04-23 | 2019-08-27 | 成都数联铭品科技有限公司 | A kind of construction method and system of knowledge mapping |
CN110263233A (en) * | 2019-05-06 | 2019-09-20 | 平安科技(深圳)有限公司 | Enterprise's public sentiment base construction method, device, computer equipment and storage medium |
CN111275364A (en) * | 2020-03-28 | 2020-06-12 | 苏州中灏文化科技有限公司 | Regional collaborative manufacturing management service platform based on industrial map |
CN112464668A (en) * | 2020-11-26 | 2021-03-09 | 南京数脉动力信息技术有限公司 | Method and system for extracting dynamic information of smart home industry |
CN113326870A (en) * | 2021-05-11 | 2021-08-31 | 中科迅(深圳)科技有限公司 | Multi-platform tourism data fusion system based on big data |
CN113987146A (en) * | 2021-10-22 | 2022-01-28 | 国网江苏省电力有限公司镇江供电分公司 | A new type of intelligent question answering system dedicated to power intranet |
CN114202355A (en) * | 2021-11-22 | 2022-03-18 | 北京基智科技有限公司 | Business opportunity recommendation system and storage medium for enterprise service industry based on market cloud data |
CN114722210A (en) * | 2020-12-22 | 2022-07-08 | 京东科技控股股份有限公司 | A data processing method and device |
CN118569374A (en) * | 2024-07-09 | 2024-08-30 | 济南慧谷数字科技有限公司 | A graph system for investment promotion information in the construction industry chain |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446065B1 (en) * | 1996-07-05 | 2002-09-03 | Hitachi, Ltd. | Document retrieval assisting method and system for the same and document retrieval service using the same |
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104573016A (en) * | 2015-01-12 | 2015-04-29 | 武汉泰迪智慧科技有限公司 | System and method for analyzing vertical public opinions based on industry |
-
2017
- 2017-12-29 CN CN201711475066.6A patent/CN108229810B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6446065B1 (en) * | 1996-07-05 | 2002-09-03 | Hitachi, Ltd. | Document retrieval assisting method and system for the same and document retrieval service using the same |
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104376406A (en) * | 2014-11-05 | 2015-02-25 | 上海计算机软件技术开发中心 | Enterprise innovation resource management and analysis system and method based on big data |
CN104573016A (en) * | 2015-01-12 | 2015-04-29 | 武汉泰迪智慧科技有限公司 | System and method for analyzing vertical public opinions based on industry |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255034A (en) * | 2018-08-08 | 2019-01-22 | 数据地平线(广州)科技有限公司 | A kind of domain knowledge map construction method based on industrial chain |
CN110020226A (en) * | 2018-08-20 | 2019-07-16 | 中国平安人寿保险股份有限公司 | Method for exhibiting data, user equipment, storage medium and device based on big data |
CN110020226B (en) * | 2018-08-20 | 2023-07-21 | 中国平安人寿保险股份有限公司 | Big data-based data display method, user equipment, storage medium and device |
CN109299362B (en) * | 2018-09-21 | 2023-04-14 | 平安科技(深圳)有限公司 | Similar enterprise recommendation method and device, computer equipment and storage medium |
CN109299362A (en) * | 2018-09-21 | 2019-02-01 | 平安科技(深圳)有限公司 | Similar enterprise's recommended method, device, computer equipment and storage medium |
CN109543045A (en) * | 2018-11-15 | 2019-03-29 | 厦门笨鸟电子商务有限公司 | A kind of methods of exhibiting of whole world industrial chain |
CN110020092A (en) * | 2018-11-20 | 2019-07-16 | 皮商云集(厦门)科技有限公司 | Leather industry data center systems based on web crawlers |
CN110175239A (en) * | 2019-04-23 | 2019-08-27 | 成都数联铭品科技有限公司 | A kind of construction method and system of knowledge mapping |
CN110263233A (en) * | 2019-05-06 | 2019-09-20 | 平安科技(深圳)有限公司 | Enterprise's public sentiment base construction method, device, computer equipment and storage medium |
CN111275364A (en) * | 2020-03-28 | 2020-06-12 | 苏州中灏文化科技有限公司 | Regional collaborative manufacturing management service platform based on industrial map |
CN112464668A (en) * | 2020-11-26 | 2021-03-09 | 南京数脉动力信息技术有限公司 | Method and system for extracting dynamic information of smart home industry |
CN114722210A (en) * | 2020-12-22 | 2022-07-08 | 京东科技控股股份有限公司 | A data processing method and device |
CN113326870A (en) * | 2021-05-11 | 2021-08-31 | 中科迅(深圳)科技有限公司 | Multi-platform tourism data fusion system based on big data |
CN113326870B (en) * | 2021-05-11 | 2023-08-04 | 中科迅(深圳)科技有限公司 | Multi-platform travel data fusion system based on big data |
CN113987146B (en) * | 2021-10-22 | 2023-01-31 | 国网江苏省电力有限公司镇江供电分公司 | An intelligent question answering system dedicated to electric power intranet |
CN113987146A (en) * | 2021-10-22 | 2022-01-28 | 国网江苏省电力有限公司镇江供电分公司 | A new type of intelligent question answering system dedicated to power intranet |
CN114202355A (en) * | 2021-11-22 | 2022-03-18 | 北京基智科技有限公司 | Business opportunity recommendation system and storage medium for enterprise service industry based on market cloud data |
CN118569374A (en) * | 2024-07-09 | 2024-08-30 | 济南慧谷数字科技有限公司 | A graph system for investment promotion information in the construction industry chain |
Also Published As
Publication number | Publication date |
---|---|
CN108229810B (en) | 2021-02-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108229810A (en) | Industry analysis system and method based on network information resource | |
US20230114019A1 (en) | Method and apparatus for the semi-autonomous management, analysis and distribution of intellectual property assets between various entities | |
Vargiu et al. | Exploiting web scraping in a collaborative filtering-based approach to web advertising. | |
CN102270331B (en) | Network shopping navigating method based on visual search | |
US20130282693A1 (en) | Object oriented data and metadata based search | |
Siddiqui et al. | Web mining techniques in e-commerce applications | |
CN106202563A (en) | A kind of real time correlation evental news recommends method and system | |
CN104036038A (en) | News recommendation method and system | |
CN106991175B (en) | Customer information mining method, device, equipment and storage medium | |
Vijiyarani et al. | Research issues in web mining | |
Dias et al. | Automating the extraction of static content and dynamic behaviour from e-commerce websites | |
Dong et al. | Profiling users via their reviews: an extended systematic mapping study | |
Bhujbal et al. | News aggregation using web scraping news portals | |
CN106844588A (en) | A kind of analysis method and system of the user behavior data based on web crawlers | |
CN108153754A (en) | A kind of data processing method and its device | |
Talakokkula | A survey on web usage mining, applications and tools | |
Kumar | Web usage mining techniques and applications across industries | |
Alsini et al. | Data streaming | |
Sharma et al. | A hand to hand taxonomical survey on web mining | |
Jian-guo et al. | Web mining for electronic business application | |
Ding et al. | [Retracted] Clustering Merchants and Accurate Marketing of Products Using the Segmentation Tree Vector Space Model | |
Chou et al. | Accessing e-learners' knowledge for personalization in e-learning environment | |
Dong et al. | Based User Profiling: A Systematic Mapping Study | |
Schneider et al. | Leveraging web data harvesting for product recommendation systems: a comprehensive review of methodologies and use cases | |
Siqueira et al. | Leveraging analysis of user behavior from Web usage extraction over DOM-tree structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder |
Address after: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District Patentee after: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES Patentee after: Zhongke (Luoyang) robot and intelligent equipment Research Institute Address before: 100190 No. 95 East Zhongguancun Road, Beijing, Haidian District Patentee before: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES Patentee before: INNOVATION INSTITUTE FOR ROBOT AND INTELLIGENT EQUIPMENT (LUOYANG), CASIA |
|
CP01 | Change in the name or title of a patent holder | ||
TR01 | Transfer of patent right |
Effective date of registration: 20250718 Address after: Luoyang District of Jianxi Dragon Yu Lu National University Science and Technology Park in Henan province Luoyang city 471000 Building No. 1 Room 201 Patentee after: Zhongke (Luoyang) robot and intelligent equipment Research Institute Country or region after: China Address before: 100190 Zhongguancun East Road, Beijing, No. 95, No. Patentee before: INSTITUTE OF AUTOMATION, CHINESE ACADEMY OF SCIENCES Country or region before: China Patentee before: Zhongke (Luoyang) robot and intelligent equipment Research Institute |
|
TR01 | Transfer of patent right |