WO2020068009A1

WO2020068009A1 - A search engine and data warehouse system with vertical and thematic focus

Info

Publication number: WO2020068009A1
Application number: PCT/TR2018/050528
Authority: WO
Inventors: Mehmet Ali ERDAL
Original assignee: Metaform Bilisim Iletisim Danismanlik San Tic Ltd Sti
Current assignee: Metaform Bilisim Iletisim Danismanlik San Tic Ltd Sti
Priority date: 2018-09-26
Filing date: 2018-09-26
Publication date: 2020-04-02
Anticipated expiration: 2021-03-26

Abstract

The present invention relates to a search engine and data warehouse system (1) with vertical and thematic focus whereby users can control, customize, configure evaluation parameters and algorithms of source data in internet environment depending on corporate and/or professional users' needs and store them in special data warehouse and enable to browse the stored source sites and links visually in a relational and depthwise way.

Description

A SEARCH ENGINE AND DATA WAREHOUSE SYSTEM WITH VERTICAL AND THEMATIC FOCUS

Technical Field

The present invention relates to a search engine and data warehouse system with vertical and thematic focus whereby users can control, customize, configure evaluation parameters and algorithms of source data in internet environment depending on users’ needs and store them in special data warehouse and enable to browse the stored source websites and links visually in a relational and depth-wise way. Background of the Invention

Search engines being used today such as Google, Yandex and Bing are used as services for general purposes and users cannot control, customize the parameters and the algorithms which evaluate the source data or they cannot store the data which are gathered in line with their needs, in any database. Considering the history of search engines, a small number of follower search engine service have started to provide service in the world under the leadership of Yahoo and upon Google started to make an overwhelming impression in the field of search engine. Thereby, search engines are created by companies aiming to dedicate themselves to users who search income models by seeking for searches of personal and corporate users from every country all over the world in all age groups and profile as public services.

Although horizontal-type search engines, which have to find answer for searches in all fields and at every level for everyone, are in struggle for finding the one being searched by the user among a high number of data obtained; their success is unfortunately not high in relational or neural searches. The interfaces whereby the results searched in the said search engines are in the form of list; users perform search transactions manually in order to find the websites comprising the information that they need still by clicking the links of the websites in the related lists one by one. Besides, they have to save the pages -wherein the searched information are included- to an area before forgetting them and continue clicking other links by returning the list from their last location and remember the clicked ones. In addition, a large number of unrelated links listed according to the search criteria makes it impossible for a person to access the searched data by the click speed in unit time.

The Chinese patent document no. CN107273499A discloses a data capture method based on vertical search engine. The method, which is subject to the invention disclosed in the Chinese patent document, determines an association degree of each webpage through crawling and analysing the webpage. At the same time, it cannot store the webpage and a website associated with each other according to an association degree threshold. Collection efficiency and storage efficiency are improved by achieving a multi-thread webpage crawling in the invention mentioned in the Chinese patent document.

Therefore, a structure whereby the above-stated problems are overcome; which can be identified by units of corporate structures such as research and development, strategy, sales, human resources, purchase units according to their own needs; which collects data continuously by doing research in areas both with vertical and thematic focus and also custom fields and will listen the requested websites in accordance with the specified scope and provide copy of the data to a data center prepared for corporate structures is needed.

Summary of the Invention

An objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby corporate and/or professional users can control, customize, configure the parameters and algorithms that evaluate the source data depending on their needs and store them in database. Another objective of the present invention is to realize a system which stores the data -that are captured by the search engine system with a vertical and thematic focus wherein searching, sorting, archiving and browsing transactions of web pages are carried out- by means of a data warehouse service that is built-in in internet environment or customer data center.

Another objective of the present invention is to realize a system whereby users can switch from a search engine data warehouse with a vertical or thematic focus wherein they are authorized to another one.

Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus wherein vertical/thematic/custom field and scope thereof required for corporate structures are identified.

Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby vertical field can be identified; which can operate as focused and continuously listens the websites determined by the users within the scope of the identified theme while it is operating in both vertical field and as focused.

Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby the pages comprising the data within the scope of the identified theme and the domains wherein the pages are published, are archived categorically and automatically by means of new/old version information.

Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby users who will make use of outputs of each thematic search engine, by their authorizations. Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby the pages wherein the pages where the data archived within the scope of the identified theme or themes are located and the websites where the pages are published, are examined in depth-wise visual graphic interface.

Another objective of the present invention is to realize a search engine and data warehouse system with vertical and thematic focus whereby subject of interest (SOI) identification that will identify each thematic field can be realized.

Detailed Description of the Invention

“A Search Engine and Source Data System with Vertical and Thematic Focus” realized to fulfil the objectives of the present invention is shown in the figure attached, in which:

Figure 1 is a schematic view of the inventive search engine and source data system with vertical and thematic focus.

The components illustrated in the figure are individually numbered, where the numbers refer to the following:

1. System

2. Identification module

3. Indexing module

4. Index database

5. Calculation module

6. Database

The inventive search engine and data warehouse system with vertical and thematic focus (1) performing thematic search comprises: - at least one identification module (2) wherein the theme is identified vertically and as focused, configurations are realized and which provides reports and statistics of the identified themes, displays the data in a relational structure, operates as web based;

- at least one indexing module (3) which examines web pages depending on the theme determined in the identification module (2), indexes the examined web pages, inserts URL’s of the web pages into a queue by using a sorting algorithm and which is in communication with an index database (4) wherein the indexed URL’s (Uniform Resource Locator) are stored; and

- at least one calculation module (5) which calculates the relation level depending on the theme configurations on the data of the URL in the queue, records the data in the database (6) and carries out tagging transactions.

In a preferred embodiment of the invention, the identification module (2) is configured in order to recalculate the relation level according to the user feedback. Besides, the identification module (2) can carry out theme-specific user identification transaction.

Web pages aimed to be listened and indexed depending on users’ requests are entered on the identification module (2). Depending on the identifications entered from the identification module (2), the websites and the pages having only the data requested and needed by the users are automatically circulated for the users continually and the determined websites are listened continuously. Thereby, the users individually search, examine and decide that the websites -which have the data desired by them among millions of websites to take much time by qualified users- are qualified and then they completely get rid of workloads of copying them to somewhere.

Identification transaction is carried out by authorizations of the users who will make use of outputs of each thematic search engine on the identification module (2). The identification module (2) also presents the pages where the data archived within the scope of the theme or the themes whereby the institution or the institutions are identified are located and the websites that publish the pages, on a depth-wise visual graphic interface.

A SOI (subject of interest) identification that will identify each thematic field is also made on the identification module (2). For example, such as “competitors, investments, products, executives, reports, suppliers, sales or partners” for a thematic field with TESLA focus.

In a preferred embodiment of the invention, the indexing module (3) is configured in order to be developed by using java and open-source technologies. The indexing module (3) splits the URL at first and scales it before enqueuing the related page. A preferred embodiment uses an algorithm which is specifically developed for this transaction. The indexing module (3) is configured in order to separate it into word roots after splitting the URL. In a preferred embodiment, the indexing module (3) carries out the transaction of separating into word root by using NLP (natural language processing) frameworks.

In the invention, the indexing module (3) is configured in order to ensure that scales can be determined according to fields and the users can determine the related scales by themselves. The indexing module (3) carries out the prioritization transactions such that URL’s with the highest value will be processed at first in the queue. In a preferred embodiment, the indexing module (3) performs the prioritization by arrangement of the priority queue algorithm such that the URL’s with the highest value will be processed to the queue at first. The indexing module (3) is configured such that it will preferably use java open-source html processor for accesses to the webpage. The indexing module (3) uses pre-ranking algorithm in order that ranking is made according to order of importance.

The indexing module (3) is configured in order to take the obligation into consideration that the focus of interest must be in the examined page as well while it is examining the URL’s in cases where it is desired to operate as focused in accordance with the identification. The indexing module (3) calculates theme- relation levels of the data iteratively by using the related feedback algorithm in cases where it is desired to operate without focus in accordance with the identification.

In a preferred embodiment of the invention, the indexing module (3) archives the pages that comprises the data within the scope of the theme defined in the identification module (2) and the domains where these pages are published, in the index database (4) together with the new old version information categorically and automatically.

In a preferred embodiment of the invention, the calculation module (5) is configured in order to be developed using java and open sources and to calculate the relation level by using an algorithm on the data of the URL in the queue by listening the queue. The calculation module (5) considers the theme arrangement when it calculates the relation level by algorithm. The calculation module (5) separates the data in the webpage body into the word roots by using open-source natural language processing structures and calculates the relation level of the page. The calculation module (5) uses the word-tag duo added for the related theme while it carries out the tagging transaction inside the body of the webpage.

In the inventive system (1), the calculation module (5) examines the word intensities and frequencies of the user that are used in the webpages enqueued by the indexing module (3) upon being captured inside the theme that is identified over the identification module (2), preferably by using artificial intelligence techniques. The calculation module (5) uses artificial intelligence techniques such as clustering and regression in a preferred embodiment for the examination transaction.

In a preferred embodiment of the invention, the indexing module (3) and the database (6) exist together in a data warehouse structure. Thus, the relation level is continuously calculated depending on the data input from the identification module (2) on the collected data and they can be updated so as to be processed in the queue order. Users can easily switch from the search engine data warehouse with vertical or thematic focus wherein they are authorized into another one by means of the data warehouse structure.

The inventive search engine and data warehouse system with vertical and thematic focus (1) can be identified by corporate structures according to needs of their units such as research and development, strategy, sales, human resources, purchase units and it collects data continuously by doing research in areas both with vertical and thematic focus and also custom fields. Besides, the inventive system listens the requested websites in accordance with the specified scope continuously and stores copy of the data in a data center prepared for corporate structures.

With the inventive system (1), vertical/thematic/custom field and scope required by public institutions, private sector organizations, universities, institutes, non governmental organizations' units, departments, customers or stakeholders or professional end users can be identified. A vertical field can be identified for example as cancer, artificial vision, laser with a search engine and data warehouse system with vertical and thematic focus (1); it can take a company, person or case as a basis and operate as focused in order to collect data in the required fields related to thereof and the websites displayed by the users are listened continuously within the scope of the identified theme while operating in both ways or individually.

With the inventive search engine and data warehouse system with vertical and thematic focus (1); users can control, customize, configure evaluation parameters and algorithms of source data in internet environment depending on users’ needs and store them in special data warehouse and enable to browse the stored source websites and links visually in a relational and depth-wise way.

It is possible to develop various embodiments of inventive search engine system with vertical and thematic focus (1), the invention cannot be limited to examples disclosed herein and it is essentially according to claims.

Claims

1. A search engine and data warehouse system with vertical and thematic focus

(1) characterized by:

- at least one identification module (2) wherein the theme is identified vertically and as focused, configurations are realized and which provides reports and statistics of the identified themes, displays the data in a relational structure, operates as web based;

2. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 1; characterized by the identification module (2) which is configured in order to recalculate the relation level according to the user feedback.

3. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 1 or 2; characterized by the identification module (2) which carries out theme-specific user identification transaction.

4. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the identification module (2) whereon web pages aimed to be listened and indexed depending on users’ requests are entered.

5. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the identification module (2) whereon identification transaction is carried out by authorizations of the users who will make use of outputs of each thematic search engine.

6. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the identification module (2) which presents the pages where the data archived within the scope of the theme or the themes whereby the institution or the institutions are identified are located and the websites that publish the pages, on a depth-wise visual graphic interface.

7. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the identification module (2) wherein SOI (subject of interest) identification that will identify each thematic field is also made.

8. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module (3) which splits the URL at first and scales it before enqueuing the related page.

9. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 8; characterized by the indexing module (3) which is configured in order to separate the URL into word roots after splitting it.

10. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 8 or 9; characterized by the indexing module (3) which carries out the transaction of separating into word root by using NLP (natural language processing) frameworks.

11. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module (3) which is configured in order to ensure that scales can be determined according to fields and the users can determine the related scales by themselves.

12. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module (3) which is configured in order to take the obligation into consideration that the focus of interest must be in the examined page as well while it is examining the URL’s in cases where it is desired to operate as focused in accordance with the identification.

13. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 12; characterized by the indexing module (3) which calculates theme-relation levels of the data iteratively by using the related feedback algorithm in cases where it is desired to operate without focus in accordance with the identification.

14. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module (3) which archives the pages that comprises the data within the scope of the theme defined in the identification module (2) and the domains where these pages are published, in the index database (4) together with the new old version information categorically and automatically.

15. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the calculation module (5) which is configured in order to be developed using java and open sources and to calculate the relation level by using an algorithm on the data of the URL in the queue by listening the queue.

16. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the calculation module (5) which separates the data in the webpage body into the word roots by using open-source natural language processing structures and calculates the relation level of the page.

17. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 16; characterized by the calculation module (5) which uses the word-tag duo added for the related theme while it carries out the tagging transaction inside the body of the webpage.

18. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the calculation module (5) which examines the word intensities and frequencies of the user that are used in the webpages enqueued by the indexing module (3) upon being captured inside the theme that is identified over the identification module (2), preferably by using artificial intelligence techniques.

19. A search engine and data warehouse system with vertical and thematic focus (1) according to Claim 18; characterized by the calculation module (5) which uses artificial intelligence techniques such as clustering and regression in a preferred embodiment for the examination transaction.

20. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module

(3) uses pre-ranking algorithm in order that ranking is made according to order of importance while indexing the webpages gradually.

21. A search engine and data warehouse system with vertical and thematic focus (1) according to any of the preceding claims; characterized by the indexing module

(3) and the database (6) which co-exist in a data warehouse structure.