CN119202133A

CN119202133A - Shared dimension optimization method, device, equipment and storage medium for data warehouse model

Info

Publication number: CN119202133A
Application number: CN202411362772.XA
Authority: CN
Inventors: 陈朝亮
Original assignee: Ping An Property and Casualty Insurance Company of China Ltd
Current assignee: Ping An Property and Casualty Insurance Company of China Ltd
Priority date: 2024-09-27
Filing date: 2024-09-27
Publication date: 2024-12-27

Abstract

The present invention relates to a shared dimension optimization method for a data warehouse model. By collecting business data tables, Chinese descriptions of fields are generated, and a text parsing method is used to disassemble and count the Chinese descriptions, mark the parts of speech, and generate parsing results. The parsing results are optimized, parts of speech that are not related to the shared dimensions are filtered, and time period parts of speech and measurement parts of speech are screened out. Statistical analysis is performed on the optimized parsing results to identify dimensions with common characteristics. The identified common dimensions are processed based on business logic to generate shared dimensions. Through the generated shared dimensions, the structure of the data warehouse model is adjusted, the structural design of the dimension table and the fact table is improved, and the data query efficiency is optimized. The present invention reduces the workload of manual research and data analysis, improves the efficiency of dimension identification and sharing through automated processing, shortens the cycle of data combing, reduces labor costs, and adapts to the data processing requirements of complex business scenarios.

Description

Shared dimension optimization method, device and equipment of number bin model and storage medium

Technical Field

The invention relates to the technical field of big data and the field of financial science and technology, in particular to a shared dimension optimization method, a device, equipment and a storage medium of a digital bin model.

Background

In the financial and other data-intensive industries, the manipulation and sharing of data dimensions is an important component of data analysis and business decisions. The shared dimensions typically include a main dimension (e.g., product, customer, institution, etc.) that is highly related to the business, other analysis dimensions (e.g., business type, payment method, etc.), and labels for modifying or defining the business characteristics (e.g., whether to self-purchase, whether to new energy vehicles, etc.). These dimensions provide a basis for data analysis for enterprises, supporting cross-department data sharing and decision optimization.

However, existing dimensional carding methods face a number of challenges and deficiencies in practical applications. The traditional method mainly relies on manually initiated business investigation and demand analysis processes, or manually analyzes data processing logic one by one. These methods suffer from the following significant drawbacks:

The cross-department investigation is difficult, and the sharing of data dimension requires the cooperation and communication between different departments. However, data investigation across departments is often difficult to coordinate, and there are differences in data standards and analysis requirements of different departments, so that the investigation process is complex and inefficient.

The investigation and the carding period are long, so that a great deal of time and effort are required for manually investigating and analyzing the dimension sharing property, and particularly when a plurality of business scenes are involved, the manually processed data volume is huge, the carding period is long, and the service demand change is difficult to respond timely.

The analysis load of the data processing logic is heavy, and the existing method requires manual analysis of each data processing logic one by one. This not only requires high technical threshold expertise, but also requires a large amount of human resources, making data processing logic grooming an extremely time-consuming and costly process.

These deficiencies make the existing dimension-comb methods inefficient in handling complex data environments and rapidly changing business demands, and difficult to quickly land in practical business applications. Therefore, how to utilize a more intelligent mode to carry out the carding of sharing dimension, promotes the efficiency of cross-department cooperation, reduces the manpower input and becomes the problem to be solved urgently.

Disclosure of Invention

The invention mainly aims to provide a shared dimension optimization method, device, equipment and storage medium of a multi-bin model, and aims to solve the technical problem that the prior art relies on manual service investigation and one-by-one analysis of data processing logic, and is difficult to efficiently identify the sharing property of dimensions.

In order to achieve the above object, the present invention provides a method for optimizing shared dimensions of a multi-bin model, comprising:

collecting a service data table, wherein the service data table comprises a data list report and a data summary table, and generating Chinese description of fields in the service data table;

the Chinese description is disassembled and counted through a text analysis method, parts of speech are marked, and an analysis result is generated;

optimizing the analysis result, and filtering parts of speech irrelevant to the sharing dimension, wherein the irrelevant parts of speech comprise time period parts of speech and measured parts of speech;

Carrying out statistical analysis on the optimized analysis result, and identifying the commonality dimension;

And processing each identified commonality dimension in combination with business logic to generate a shared dimension, and adjusting the structure of the bin model based on the shared dimension.

In one embodiment, performing statistical analysis on the filtered parsing result to identify commonalities includes:

Word frequency statistics is carried out on words in the analysis result, and the occurrence frequency of each word is determined;

screening high-frequency words with the occurrence frequency higher than the word frequency threshold value from all words based on the set word frequency threshold value;

Matching the high-frequency vocabulary in the Chinese description of the field, and judging whether the high-frequency vocabulary has the same business meaning in a plurality of Chinese descriptions;

high frequency words with the same business meaning are marked as common dimensions.

In one embodiment, processing in conjunction with business logic to generate a shared dimension for each identified commonality dimension includes:

classifying the identified commonalities, analyzing the use modes and definitions of the commonalities in a plurality of business scenes, and extracting representative scenes from the plurality of business scenes;

in the representative scenes, carrying out detailed analysis on the dimension names, the data types and the processing modes in each scene, and comparing whether the usage of the common dimension in different scenes is different or not;

And comparing and analyzing the difference of the common dimension, and adjusting and processing the difference part by making a unified standard to generate the shared dimension.

In one embodiment, the method for resolving and counting the Chinese description through a text parsing method, labeling parts of speech, and generating a parsing result includes:

Word segmentation processing is carried out on the Chinese description, and continuous texts are disassembled into independent words;

Screening the words after word segmentation, removing words with the length smaller than a set threshold value, and reserving other words as candidate words;

part-of-speech tagging is carried out on the reserved candidate vocabularies, and grammar attributes are tagged for each candidate vocabulary so as to distinguish different types of vocabularies;

performing word frequency statistics on candidate words, and recording the occurrence frequency of each candidate word in the Chinese description;

And generating an analysis result, wherein the analysis result comprises candidate words and corresponding word frequency and part-of-speech labels.

In one embodiment, optimizing the parsing result, filtering parts of speech unrelated to the shared dimension, the unrelated parts of speech including time period parts of speech and metric parts of speech, comprising:

classifying the time vocabulary in the analysis result, and dividing the time vocabulary into long-term, medium-term and short-term time dimensions according to the length of the time period;

Evaluating the use condition of the time dimension in the business scenes, and screening a first target time dimension for business analysis based on the occurrence frequency of the time dimension in different business scenes;

Evaluating cross-scene applicability of the time dimensions, judging whether each time dimension is used in a plurality of business scenes, and reserving a second target time dimension used in the plurality of business scenes;

evaluating historical data accumulation capacity of the time dimension, determining whether the time dimension supports cross-period data accumulation, and reserving a third target time dimension supporting cross-period accumulation;

evaluating trend analysis capability of the time dimension, and screening a fourth target time dimension supporting long-term trend analysis according to the performance of the time dimension in the trend analysis;

the part of speech of the time period conforming to the target time dimension is reserved, and the parts of speech of other time periods irrelevant to the sharing dimension are filtered.

In one embodiment, adjusting the structure of the integer bin model based on the shared dimension includes:

Analyzing the relation between the shared dimension and the several bin model, determining an application scene of the shared dimension in the several bin model, and identifying the existing dimension which needs to be replaced, combined or expanded;

adjusting the dimension table structure of the bin model, adding fields according to the definition of the shared dimension, and deleting the repeated dimension;

Reconstructing a fact table structure of the multi-bin model, and adjusting fields and connection modes in the fact table to enable the shared dimension to be associated with the measurement index in the fact table;

And updating the association relation between the dimension table and the fact table, and optimizing the connection mode of the shared dimension and the fact table.

In one embodiment, after adjusting the structure of the integer bin model based on the shared dimension, further comprising:

importing the business data into the adjusted multi-bin model, and verifying the integrity and loading accuracy of the data;

testing whether the adjusted multi-bin model structure meets the query requirements in various business scenes or not, and evaluating the performance of the sharing dimension in practical application;

verifying the association relation between the dimension table and the fact table in the number bin model, and evaluating the data consistency in the adjusted number bin model;

Evaluating response efficiency of the adjusted multi-bin model in different scene data queries;

and generating a data verification report, recording a verification result, and evaluating the availability and performance of the regulated log cabin model.

Further, in order to achieve the above object, the present invention further provides a shared dimension optimization device for a multi-bin model, where the shared dimension optimization device for a multi-bin model includes a memory, a processor, and a shared dimension optimization program for a multi-bin model stored on the memory and capable of running on the processor, where the shared dimension optimization program for a multi-bin model, when executed by the processor, implements the steps of the shared dimension optimization method for a multi-bin model as described above.

Further, to achieve the above object, the present invention also provides a computer storage medium, on which a shared dimension optimization program of a several-bin model is stored, which when executed by a processor, implements the steps of the shared dimension optimization method of a several-bin model as described above.

The invention has the beneficial effects that the invention relates to a shared dimension optimization method of a multi-bin model, a Chinese description of a field is generated by collecting a service data table, the Chinese description is disassembled and counted by adopting a text analysis method, part of speech is marked, and an analysis result is generated. And (3) optimizing the analysis result, filtering parts of speech irrelevant to the sharing dimension, and screening out the part of speech and measuring the part of speech of the time period. And carrying out statistical analysis on the optimized analysis result to identify the dimension with the common characteristic. And processing the identified commonality dimension based on the business logic to generate a sharing dimension. And adjusting the structure of the log bin model through the generated shared dimension, perfecting the structural design of the dimension table and the fact table, and optimizing the data query efficiency. The invention reduces the workload of manual investigation and data analysis, improves the efficiency of dimension identification and sharing through automatic processing, shortens the period of data carding, reduces the labor cost and adapts to the data processing requirement of complex business scenes.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

FIG. 1 is a flow chart of an embodiment of a shared dimension optimization method of a multi-bin model according to the present invention;

FIG. 2 is a schematic diagram of functional modules of a preferred embodiment of a shared dimension optimization apparatus for a multi-bin model according to the present invention;

FIG. 3 is a schematic structural diagram of a device hardware operating environment related to an embodiment of a shared dimension optimization device of a several-bin model according to the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Referring to fig. 1, fig. 1 is a flow chart of an embodiment of a method for optimizing shared dimensions of a multi-bin model according to the present invention. It should be noted that although a logical order is depicted in the flowchart, in some cases the steps depicted or described may be performed in a different order than presented herein.

As shown in fig. 1, the shared dimension optimization method of the several-bin model provided by the invention comprises the following steps:

s10, collecting a service data table, wherein the service data table comprises a data list report and a data summary table, and generating Chinese description of fields in the service data table;

In the present embodiment, the service data table related to analysis is acquired from different service systems. The data table includes different structured data such as transaction data, user information, product information, and the like. The source of the data table may be an internal database of the enterprise, an ERP system, or an external collaboration platform. Data collection is performed through a data interface (API), a batch import or a real-time data stream mode, and the integrity and the instantaneity of a data table are ensured.

The data list report is a data set for recording business operation details, such as customer transaction details, product sales details and the like, and the data summary list is result data obtained by summarizing the list data, such as daily sales amount, customer distribution conditions and the like. Data inventory reports are typically of higher granularity, while data summary tables are used for higher level business analysis. The collection of these two tables is typically done through SQL queries, ETL (data extraction, transformation, loading) tools, or other data analysis platforms.

Description generation is performed on the collected data table fields. These fields may exist in coded, abbreviated, or other forms for which a chinese description of the natural language needs to be generated for subsequent analysis. The manner of generation may be by automated scripting (e.g., translation tools or dictionaries that incorporate field names), rule bases, or by data dictionaries and historical business rules.

In one embodiment, an enterprise data center is adopted, and data tables distributed in a plurality of business systems are obtained in real time through an API interface, wherein the data tables comprise customer account information, transaction records and compliance censoring data. The data center uses a micro-service architecture to ensure seamless data collection between different business systems. In combination with a large data processing framework (e.g., APACHE KAFKA), real-time streaming of data is ensured, and high concurrent data requests can be supported.

By automatically collecting the service data table and generating Chinese description, errors and inconsistencies during manual description generation are avoided, and the efficiency and accuracy of data processing are greatly improved.

S20, decomposing and counting the Chinese description through a text analysis method, marking parts of speech, and generating an analysis result;

In this embodiment, the textual descriptions are processed using a specific text parsing algorithm by Natural Language Processing (NLP) techniques. Common methods include rule-based word segmentation algorithms (e.g., jieba words), or deep learning models (e.g., pretrained models such as BERT). The core goal of text parsing is to break down continuous text into more manageable independent lexical units.

Disassembly refers to the disassembly of complex sentences in the Chinese description into words by word segmentation techniques, e.g. "customer account balance" can be disassembled into "customer", "account" and "balance". After word segmentation, frequency statistics is carried out on each word, and the occurrence times of each word in different descriptions are recorded, so that subsequent analysis is facilitated.

Part-of-speech tagging is the lexical analysis of each word, marking its role in grammar (e.g., nouns, verbs, adjectives, etc.). Through part-of-speech tagging, the system can better understand the meaning of each vocabulary in business logic. Part-of-speech tagging may be accomplished using a statistical-based model or a rule-based system.

After the disassembly, statistics and part-of-speech tagging are completed, an analysis result containing all the vocabulary, word frequency and part-of-speech tagging is generated. These results will be used for subsequent commonality dimension identification and model tuning.

In one embodiment, the word segmentation process is performed on the Chinese description by using an industry standard word segmentation library (e.g., jieba library). The word segmentation library is subjected to self-defined optimization based on a glossary in the financial field, so that the professional terms can be accurately analyzed. After word segmentation, the disassembled vocabulary is counted, and the disassembled vocabulary is ranked according to the occurrence frequency so as to conveniently screen high-frequency vocabulary subsequently. And a part-of-speech tagging system based on rules is used for tagging the disassembled vocabulary into nouns, verbs and the like, so that the part-of-speech tagging of the financial terms is ensured to be accurate. And generating a result set containing vocabulary, word frequency and part-of-speech tags for subsequent analysis.

By means of the text analysis method, complex Chinese descriptions are efficiently disassembled and parts of speech are marked, analysis results are automatically generated, manual intervention is reduced, accuracy and efficiency of data processing are improved, and high-quality input data are provided for subsequent commonality dimension identification.

S30, optimizing the analysis result, and filtering parts of speech irrelevant to the shared dimension, wherein the irrelevant parts of speech comprise time period parts of speech and measured parts of speech;

In this embodiment, the result generated by text parsing is subjected to optimization processing. The optimization processing is mainly aimed at the vocabulary data obtained by previous analysis and statistics, and aims to improve the recognition accuracy of the sharing dimension. Through the optimization processing, the interference of noise data is reduced, so that the vocabulary related to high service can be correctly identified. Which vocabularies need optimization can be defined by rule bases, manual settings, or automated models. The optimization process may dynamically adjust vocabulary filtering rules based on business requirements.

In the analysis result, there is no direct relation between partial vocabulary and the construction of shared dimension, such as vocabulary for representing time and measurement. The presence of these words may affect the recognition of the shared dimension and thus require filtering. Firstly, part-of-speech tagging is carried out on the vocabulary, and then the vocabulary irrelevant to the sharing dimension is removed according to the business rule or the part-of-speech classification rule. Common methods include automatic filtering algorithms based on part-of-speech tagging, or manually set lexical class libraries.

The time period part of speech refers to time-related words such as "year", "month", "day", which tend to represent a time span in data analysis, but do not directly contribute to the sharing dimension. The measure parts of speech includes terms related to the measure "amount", "number of times", etc., which are related to a particular value but are not used as shared dimensions. The vocabularies can be predefined by a part-of-speech tagging tool and automatically filtered out by combining business rules in a vocabulary library.

By optimizing the analysis result and filtering the irrelevant parts of speech, the noise in the data is reduced, the recognition accuracy of the shared dimension is improved, and particularly in a big data analysis scene, the processing efficiency can be remarkably improved, and the manual intervention is reduced.

S40, carrying out statistical analysis on the optimized analysis result, and identifying a commonality dimension;

In this embodiment, the analysis results after filtering the irrelevant parts of speech retain high quality vocabulary related to the shared dimension. After preliminary screening, the vocabularies have stronger service relevance and are suitable for further statistical analysis. And removing irrelevant words through a result generated by text analysis to form a more simplified word set. The optimized result is basic data of subsequent dimension identification.

And carrying out statistical analysis on the quantity and distribution characteristics of the optimized vocabulary, determining which vocabularies frequently appear in a plurality of data sets and business scenes, and trying to identify the commonality dimension from the vocabularies. Statistical analysis may include word frequency statistics, business relevance analysis of words, and word combination analysis in context. Common technical means include word frequency statistics, n-gram model, TF-IDF algorithm, etc. for calculating the frequency, weight and importance of vocabulary in business scenario. Meanwhile, the statistical analysis can be combined with the historical data to further screen out words with common characteristics.

Commonality dimensions refer to dimensions that occur frequently and have consistent business meaning within multiple business scenarios and datasets. Through statistical analysis of the vocabulary, it can be identified which dimensions are generic across multiple scenarios, which can be used as shared dimensions. The identification of the commonality dimension can be accomplished through frequency statistics and vocabulary business application scenario cross-validation. For example, certain words frequently appear in multiple data tables and have the same business meaning, and can be identified as common dimensions. In addition, the recognition accuracy of the commonality dimension can be further improved by combining a semantic analysis tool.

By carrying out statistical analysis on the optimized analysis result, the commonality dimension in a plurality of business scenes can be efficiently identified, and the complexity and time consumption of manual analysis are reduced. The dimension with common characteristics in a plurality of data tables can be automatically screened, the accuracy and consistency of data sharing are improved, data analysis and decision-making of cross-business departments are supported, the degree of automation of dimension identification is improved, and the data processing requirement under complex business scenes is met.

S50, processing each identified commonality dimension in combination with business logic to generate a shared dimension, and adjusting the structure of the bin model based on the shared dimension.

In this embodiment, the commonality dimension is a dimension with general applicability obtained by statistical analysis of a plurality of data tables and business scenarios. The process of identifying these commonalities includes frequency analysis of the vocabulary, business logic verification, etc., and finally screening out the dimensions that can be used across the scene as the basis for the shared dimension. And extracting optimized and verified vocabulary from the analysis result, and taking the vocabulary as the basis of common dimensions, wherein the dimensions have cross-scene business relevance, so that the vocabulary is convenient to share.

And further processing and refining the identified commonality dimension according to specific business logic. The process of machining includes adjusting definition of dimensions, adding or modifying dimension attributes to ensure that dimensions match business requirements. The dimensions are processed by using a rule base, manual adjustment or an automatic tool, and the dimensions can be expanded or optimized according to different requirements of the service, so that the dimensions can reflect key indexes in a service scene more accurately.

After processing, a shared dimension which can be commonly used in a plurality of business scenes is finally generated. The shared dimension is a dimension adjusted by business logic, has the capacity of cross-department and cross-system use, and can support unified data analysis and decision. The generation of the shared dimension can be realized through an automation tool or script, and the processed dimension is added into the data model, so that each business department can perform data analysis and sharing under the same dimension frame.

The structure of the several-bin model is adjusted to better accommodate the shared dimensions generated. The introduction of shared dimensions requires optimization of dimension tables, fact tables, etc. of the bins in order to support data queries and analysis across the scene. The structural adjustment of the several-bin model also needs to consider factors such as data consistency, query efficiency and the like. The dimension table structure in the number bin model is modified, shared dimension fields are adjusted or added newly, and the connection mode of the dimension and the fact table is optimized, so that the flexibility and the query efficiency of the number bin model are improved.

In one embodiment, the identified commonality dimension is adjusted and expanded according to a business rule base. For the customer dimension, the customer type may be further refined according to the customer's transaction behavior, preferences, etc. The processed customer dimension is defined as a standard shared dimension and is applicable to a plurality of business scenes of customer subdivision analysis. The client dimension table in the number bin model is added with a shared field of 'client type', and the quick query of client behavior analysis is supported by connecting the fact table. The dimension processing process based on business logic enables the sharing dimension to be more accurate, and the optimization of the number bin model improves the accuracy of customer behavior analysis

The identified common dimension is processed by combining business logic and the shared dimension is generated, so that cross-department sharing and unified use of the dimension can be realized, and the accuracy and consistency of data analysis are improved. After the number bin model structure is adjusted, the number bin model can adapt to more business scenes, and the efficiency of data query and the flexibility of a system are improved.

The invention relates to a shared dimension optimization method of a multi-bin model, which is characterized in that a service data table is collected to generate Chinese description of fields, a text analysis method is adopted to disassemble and count the Chinese description, part of speech is marked, and an analysis result is generated. And (3) optimizing the analysis result, filtering parts of speech irrelevant to the sharing dimension, and screening out the part of speech and measuring the part of speech of the time period. And carrying out statistical analysis on the optimized analysis result to identify the dimension with the common characteristic. And processing the identified commonality dimension based on the business logic to generate a sharing dimension. And adjusting the structure of the log bin model through the generated shared dimension, perfecting the structural design of the dimension table and the fact table, and optimizing the data query efficiency. The invention reduces the workload of manual investigation and data analysis, improves the efficiency of dimension identification and sharing through automatic processing, shortens the period of data carding, reduces the labor cost and adapts to the data processing requirement of complex business scenes.

In one embodiment, the step S40 includes:

S401, word frequency statistics is carried out on words in the analysis result, and the occurrence frequency of each word is determined;

s402, screening high-frequency words with the occurrence frequency higher than the word frequency threshold value from all words based on the set word frequency threshold value;

s403, matching the high-frequency vocabulary in the Chinese description of the field, and judging whether the high-frequency vocabulary has the same business meaning in a plurality of Chinese descriptions;

s404, marking high-frequency words with the same business meaning as common dimensions.

In this embodiment, word frequency statistics refers to counting the parsed words, and calculating the number of times each word appears in the data set or field description. By counting word frequencies, it can be determined which words have higher frequency of use in the business data and provide a basis for subsequent high-frequency word screening. And traversing all the words in the analysis result through a statistical tool or script, and recording the occurrence times of each word. Quick statistics can be performed using collections. Counter in Python or other similar tools.

The word frequency threshold is a preset value used for screening words with higher occurrence frequency. Words with frequencies above the threshold are considered high frequency words and words below the threshold will be ignored. The method can help to reject low-frequency vocabularies which occur accidentally in the business, and ensure that only vocabularies with strong relevance in the business are reserved. And after word frequency statistics, words with frequencies exceeding a threshold value are screened out according to a preset word frequency threshold value. The screening result can be optimized by adjusting the word frequency threshold value, so that the screened vocabulary has higher service value

And further matching the screened high-frequency vocabulary to judge whether the high-frequency vocabulary has the same business meaning in field descriptions of different data tables. It is intended to ensure that the vocabulary can represent the same or similar business concepts in different business scenarios, thereby having the potential to be a shared dimension. And judging whether the high-frequency vocabulary has the same business meaning in different field descriptions through a semantic matching or word sense matching tool in the field Chinese description. Natural Language Processing (NLP) tools or regular expressions may be used for matching.

When a high frequency vocabulary has the same business meaning in multiple business scenarios or fields, the vocabulary may be labeled as a commonality dimension. The commonality dimension has applicability across scenes, and can be shared and multiplexed in different data tables and business scenes. High frequency words confirmed to have the same business meaning are marked as common dimensions through marking or data model tools. These dimensions may be the basis for subsequent data analysis and business decisions.

According to the embodiment, word frequency statistics and high-frequency vocabulary screening are carried out on the filtered analysis result, so that the high-frequency vocabulary with representativeness in the service scene can be rapidly screened out, the service meaning of the high-frequency vocabulary is further matched, and finally the high-frequency vocabulary is marked as a common dimension. The step improves the automation degree of dimension identification, reduces human intervention, and ensures more accurate data processing and analysis.

In one embodiment, in S50, processing the identified common dimensions in combination with the business logic to generate the shared dimension includes:

S501, classifying the identified commonalities, analyzing the use mode and definition of the commonalities in a plurality of business scenes, and extracting representative scenes from the business scenes;

S502, in the representative scenes, analyzing the dimension names, the data types and the processing modes in each scene in detail, and comparing whether the usage of the common dimension in different scenes is different or not;

S503, comparing and analyzing the difference of the common dimension, and adjusting and processing the difference part by making unified standards to generate the shared dimension.

In this embodiment, after the previous word frequency statistics and business meaning match, a plurality of dimensions with common features have been identified. These dimensions are further classified by their business nature and usage scenarios. Such as customer dimension, product dimension, time dimension, etc. The purpose of classification is to better understand the usage characteristics of dimensions in different business scenarios. The dimensions are classified by business logic and attributes of the dimensions. The dimensions may be divided into a primary dimension (e.g., customer, product, etc.) and a secondary dimension (e.g., time, region, etc.) based on business needs or system needs. The classification process may be rule driven or may be automatically classified using a classification model.

The using modes of the commonality dimension in different business scenes are analyzed, wherein the using modes comprise the using frequency of the dimension, the logic of data processing and the definition in practical application. The purpose is to understand the application of dimensions in multiple scenarios in depth to ensure versatility and accuracy of shared dimensions. In a plurality of business scenarios, specific usage details of the commonality dimension are analyzed. And (3) knowing the use condition of the dimension through data such as system logs, query records and the like, and analyzing whether the dimension has cross-scene applicability.

The specific scenes which can represent the different business scenes best are extracted for further dimension analysis. For example, in a banking system, credit card services, loan services, and payment services may be selected as representative scenarios in which the usage of dimensions is analyzed. By analyzing the service flow and the data flow, a representative scene is determined. Business critical scenarios may be selected using data analysis tools or manually.

In a representative scenario, the name, data type, and processing style of the commonality dimension are analyzed in detail. By analyzing the details, whether the definition of the dimension in different business scenes is consistent and whether the processing modes are the same is known. The name, data type and specific use condition of the dimension in each scene are checked by means of data query, system log and the like. Inconsistent parts are found by comparing the dimensional definitions under different scenarios.

And through the comparative analysis of dimension definitions in a plurality of business scenes, the difference points are found. These differences may be manifested in the naming of the dimensions, the data type, or the processing logic. And comparing the differences of the dimensions in different scenes by using an analysis tool or a script to generate a difference analysis report, and clearly needing to be adjusted and unified.

Based on the result of the difference analysis, the dimension is adjusted and processed by formulating a unified business standard, and finally the shared dimension which can be used across scenes is generated. The unification criteria include unification specifications for dimension names, data types, business processing logic. And combining the business requirements, the technical specifications and the difference analysis report to formulate a dimension unified standard. By adjusting definition and use modes of dimensions in different scenes, the shared dimension is ensured to be applicable to each business scene.

According to the embodiment, the commonality dimension is classified, analyzed and subjected to difference adjustment, so that the commonality of the dimension can be effectively improved, the commonality can be shared and used across business scenes, repeated definition and inconsistency in data analysis are reduced, and the efficiency and accuracy of data analysis are improved.

In one embodiment, the step S20 includes:

s201, word segmentation processing is carried out on the Chinese description, and continuous texts are disassembled into independent words;

S202, screening the words after word segmentation, removing words with the length smaller than a set threshold value, and reserving other words as candidate words;

S203, part-of-speech tagging is carried out on the reserved candidate vocabularies, and grammar attributes are tagged for each candidate vocabulary so as to distinguish different types of vocabularies;

S204, performing word frequency statistics on the candidate words, and recording the occurrence frequency of each candidate word in the Chinese description;

s205, generating an analysis result, wherein the analysis result comprises candidate words, corresponding word frequencies and part-of-speech tags.

In this embodiment, word segmentation is one of the basic steps in natural language processing, and the main purpose is to break down continuous chinese text into separate words. Chinese text does not have natural spaces like english, and therefore word boundaries in the text need to be identified by means of a word segmentation tool. The chinese descriptive text may be segmented using common chinese segmentation tools (e.g., jieba segments, NLPIR, etc.). The word segmentation tool is capable of recognizing individual words and breaking down text into individual words based on a dictionary and a statistical model.

In the word after word segmentation, there will usually be some nonsensical short words, such as punctuation marks or single letters. These words do not contribute to business analysis, and therefore words having a length smaller than the threshold need to be excluded by setting a word length threshold. A threshold value (for example, 2 characters) of the vocabulary length is set, the vocabulary conforming to the length is screened out through programming, the vocabulary is reserved as candidate vocabulary, and short vocabulary is filtered out to reduce the interference of noise data.

Part-of-speech tagging is the addition of its grammatical attributes, such as nouns, verbs, adjectives, etc., to each word. The role of the vocabulary in sentences can be further understood through part-of-speech tagging, so that subsequent semantic analysis and business correlation judgment are facilitated. Common part-of-speech tagging tools (e.g., stanford NLP, hanLP, etc.) may assign part-of-speech tags to each vocabulary according to context. Based on the pre-trained model, the tool can recognize the grammar attribute of the vocabulary and mark out the part-of-speech information.

Word frequency statistics is to record the frequency of each candidate word in the text, so as to judge the importance of the word. The vocabulary with higher occurrence frequency generally has higher correlation in the service scene, and can provide basis for subsequent commonality dimension identification. And counting the occurrence frequency of each vocabulary through programming. The word frequency can be counted rapidly by using the collector or other similar tools in the Python, and support is provided for subsequent screening of high-frequency words.

The analysis result is the final output of word segmentation processing and part-of-speech tagging, and comprises a list of candidate words, word frequency information of each word and part-of-speech tagging. These parsing results will be input to subsequent steps such as commonality dimension recognition. The word after word segmentation, the word frequency and the part of speech information corresponding to the word are integrated into a structured result, such as a vocabulary, and are exported into a format (such as JSON, CSV and the like) which can be used for subsequent analysis.

According to the embodiment, through automatic word segmentation, word screening and part-of-speech tagging, the generated analysis result accurately reflects the word structure and the use frequency in service data, a high-quality basis is provided for subsequent dimension identification and data analysis, errors of manual processing are reduced, and efficiency is improved.

In one embodiment, the step S30 includes:

s301, classifying time words in the analysis result, and dividing the time words into long-term, medium-term and short-term time dimensions according to the length of a time period;

s302, evaluating the use condition of the time dimension in service scenes, and screening a first target time dimension for service analysis based on the occurrence frequency of the time dimension in different service scenes;

S303, evaluating cross-scene applicability of time dimensions, judging whether each time dimension is used in a plurality of business scenes, and reserving a second target time dimension used in the plurality of business scenes;

S304, evaluating historical data accumulation capacity of a time dimension, determining whether the time dimension supports cross-period data accumulation, and reserving a third target time dimension supporting cross-period accumulation;

S305, evaluating trend analysis capability of the time dimension, and screening a fourth target time dimension supporting long-term trend analysis according to the performance of the time dimension in trend analysis;

s306, reserving the part of speech of the time period conforming to the target time dimension, and filtering the parts of speech of other time periods irrelevant to the sharing dimension.

In this embodiment, all time-related words (e.g., "year", "month", "day", etc.) are extracted from the analysis result, and are classified into three categories of long term, medium term, and short term according to the length of the time period. The long-term dimension typically represents year, quarter, etc., the mid-term dimension may relate to month, week, and the short-term dimension includes day, hour, etc. The time vocabulary may be classified by a rule base or an automated algorithm. These words are divided into different time dimension categories according to the time span or traffic scenario requirements.

And carrying out use frequency analysis on each time dimension, and evaluating the importance of each time dimension in each business scene. The key point is to identify which time dimensions have higher frequency of use in the business analysis, thereby screening out the time dimensions with business value. And analyzing the use frequency of each time dimension in each scene by counting the use frequency of the time dimension in each business scene and utilizing data such as a system log, a query record and the like.

And screening out a first target time dimension for business analysis according to the occurrence frequency of the time dimension in different business scenes. The first target time dimension refers to the time dimension with highest use frequency and highest correlation with service analysis in a specific service scene. Based on the use frequency, the time dimension is screened out through a preset screening threshold value, so that service analysis can be ensured to use the high-frequency time dimension in an important way.

And analyzing whether the time dimension has cross-scene applicability or not, and judging whether a certain time dimension can be used in a plurality of business scenes or not. The sharing value is higher if the time dimension can be applied in multiple scenarios. Judging whether a certain time dimension appears in a plurality of business analysis scenes or not by analyzing the use data of the time dimension in each business scene, and screening out the time dimension with strong cross-scene applicability by setting a threshold value.

The time dimension used across the scene, the second target time dimension, is reserved. Through screening, the dimension is ensured to be not only useful in one scene, but also used in a plurality of business scenes, and has higher sharing value. And screening out the time dimension which can be applied to a plurality of business scenes through cross-scene use frequency analysis, and reserving the time dimension as a second target time dimension.

Analyzing whether the time dimension can support accumulation of data across time periods ensures comparability of the data between time periods. For example, the annual and quarterly time dimensions may be data accumulation across years, while the "day" dimension is more suitable for short term analysis. And analyzing the accumulation period of the historical data, evaluating whether the time dimension supports accumulation of the historical data across time periods, and screening out the time dimension capable of accumulating the data for a long time.

The retention supports the cross-period accumulation of time dimensions, i.e., the third target time dimension, which are suitable for long period data analysis and history comparison. And screening out the time dimension supporting cross-period accumulation as a third target time dimension by analyzing the validity of data history accumulation.

Analyzing the performance of time dimensions in the trend analysis determines which time dimensions are suitable for long-term trend analysis, e.g., annual, quaternary dimensions are more suitable for long-term trend analysis. And judging whether the time dimension can support long-term trend analysis or not through data trend analysis, and ensuring that the time dimension can be used for trend judgment of business decision.

And screening out a time dimension capable of supporting long-term trend analysis, namely a fourth target time dimension. By analyzing the trend data, the time dimension capable of providing long-term data trend is screened out, and the long-term data trend is ensured to be used for long-term trend prediction.

The target time dimensions (first through fourth target time dimensions) after the pass evaluation are ultimately preserved, ensuring that these time dimensions are highly correlated with business analysis and can be used for the shared dimension across the scene. The time dimension meeting the above-described screening criteria is marked and maintained in the lexicon library as part of the shared dimension.

Time period words which are not related to the shared dimension are filtered, and the words are used less frequently in data analysis or do not have cross-scene applicability and historical data accumulation capability. Based on the screening result, the irrelevant time period part-of-speech vocabulary is automatically or filtered through a rule base, so that the analysis efficiency and the quality of the shared dimension are improved.

According to the embodiment, the analysis result is optimized, fine screening and analysis are conducted on the time dimension, the time dimension related to the service can be effectively reserved, irrelevant time period words are filtered, and the sharing performance of the dimension and the efficiency of data analysis are improved. The time dimension classification, screening and evaluation are automatically processed, the manual intervention is reduced, and the accuracy of service analysis is ensured.

In one embodiment, in S50, the structure of the integer bin model based on the shared dimension includes:

S501, analyzing the relation between the shared dimension and the number bin model, determining an application scene of the shared dimension in the number bin model, and identifying the existing dimension needing replacement, combination or expansion;

S502, adjusting a dimension table structure of the multi-bin model, adding fields according to definition of shared dimensions, and deleting repeated dimensions;

S503, reconstructing a fact table structure of the multi-bin model, and adjusting fields and connection modes in the fact table to enable the shared dimension to be associated with the measurement index in the fact table;

And S504, updating the association relation between the dimension table and the fact table, and optimizing the connection mode of the shared dimension and the fact table.

In this embodiment, the compatibility of the shared dimension and the existing several-bin model and the application situation of the shared dimension in different service scenarios are analyzed. It is desirable to identify whether the dimensions in the existing dimension table have overlapped with the shared dimension or can be optimized by the shared dimension. And comparing the shared dimension with the existing dimension of the digital bin model through automatic script or manual analysis, and analyzing the specific application scene of the shared dimension in the digital bin model. For example, the customer dimension or product dimension may already exist in the existing model, but may need to be consolidated or optimized.

After analyzing the relationship of the shared dimension to the several-bin model, it is determined which existing dimensions need to be replaced, merged, or expanded to better support the application of the shared dimension. The method aims to optimize the structure of the existing dimension through the shared dimension, reduce redundancy and improve the efficiency of the several-bin model. And carrying out structural analysis on the dimension table in the log bin model, combining with service requirements, confirming which existing dimensions and shared dimensions are repeated or need to be further optimized, and formulating a specific merging or expanding strategy.

And adjusting a dimension table in the several-bin model according to the definition of the shared dimension. It may be desirable to add a shared dimension field and delete an existing duplicate dimension. By simplifying the dimension table structure, data redundancy is reduced, and the query efficiency and consistency of the model are improved. And modifying metadata definition of the multi-bin model, and operating by using SQL or ETL tools to ensure that newly added shared dimension fields can be seamlessly integrated into the existing model, and removing redundant dimension fields.

The fact table is a core data table in the several bin model, and stores measurement indexes of service data. The fact table structure is adjusted to ensure that the shared dimension is correctly associated with the metrics in the fact table. This includes adjusting field definitions and linkage in fact tables to support efficient application of shared dimensions. By modifying field definitions of the fact table, it is ensured that the metric index can be correctly linked with the shared dimension, and the linking mode is optimized to improve query performance. This operation typically needs to be performed concurrently with adjustment of the dimension tables to ensure consistency and integrity of the data.

After the dimension table and the fact table are adjusted, the association relationship between them needs to be updated. The method aims to optimize the connection mode between the dimension table and the fact table and improve the performance of data query and analysis. By adjusting the connection conditions of SQL query and data model, the correct association relationship between the dimension table and the fact table is ensured, and the performance of cross-table query is improved. Database optimization tools or SQL tuning may be used to optimize the coupling efficiency.

According to the embodiment, the dimension table and the fact table structure of the multi-bin model are optimized by adjusting the multi-bin model based on the shared dimension, so that the repeated dimension is reduced, and the data consistency is improved. The introduction of the shared dimension enables the data warehouse model to be more unified in different business scenes, and the performance of data query and the analysis efficiency are improved.

In one embodiment, in S50, after adjusting the structure of the integer bin model based on the shared dimension, the method further includes:

s601, importing service data into an adjusted multi-bin model, and verifying the integrity and loading accuracy of the data;

S602, testing whether the adjusted multi-bin model structure meets the query requirements in various business scenes, and evaluating the performance of the sharing dimension in practical application;

s603, verifying the association relation between the dimension table and the fact table in the number bin model, and evaluating the data consistency in the adjusted number bin model;

s604, evaluating response efficiency of the adjusted multi-bin model in different scene data queries;

s605, generating a data verification report, recording a verification result, and evaluating the availability and performance of the adjusted log cabin model.

In this embodiment, after the adjustment of the structure of the several-bin model is completed, service data (such as transaction records, customer information, etc.) is imported into the new model, so as to ensure that all data can be completely and accurately loaded. This step requires verifying that all shared dimensions and metric data can be mapped correctly into the new several bin model and without loss or error. By importing data through the ETL tool, the system automatically checks for anomalies in the data loading process, such as data loss, format inconsistencies, etc., and generates an integrity report.

The adjusted several-bin model needs to be subjected to query test in the actual business scene, and the applicability of the model in different business scenes (such as customer analysis, sales data analysis and the like) is evaluated. The important test is to determine whether the shared dimension can effectively support the query requirement of multiple scenes. And testing the model by using a common SQL query statement in the service scene, and verifying whether the model can efficiently and accurately return a result in actual query.

After the model is adjusted, the fact table and the dimension table are ensured to have correct association relation, and the data consistency is not affected. Data consistency refers to the fact that information in a dimension table (e.g., customer information) corresponds correctly and without conflicts with measurement data in a fact table (e.g., sales). And checking whether the association relation between the dimension table and the fact table is complete or not through a data consistency checking tool, and ensuring consistency by comparing the data in the data source and the data in the data bins.

The adjusted multi-bin model needs to be subjected to query performance test so as to ensure that the response time of data query can meet the service requirements in different service scenes. The impact of the shared dimension on query efficiency, and the overall performance of the several-bin model, are evaluated. And executing complex queries in different service scenes, and measuring the response time of the queries. Evaluation and optimization is performed by performance optimization tools (e.g., index optimization, query tuning, etc.).

After finishing data loading, query requirement testing, data consistency verification and response efficiency evaluation, a data verification report is generated, and the results of all verification steps are recorded. The report will describe in detail the availability and performance of the several bin model after adjustment and point out potential optimization points or improvement suggestions. And automatically generating a verification report, wherein the report comprises the results of data loading, query testing, data consistency checking and performance evaluation, and the results are used as the basis for subsequent optimization.

According to the method, the device and the system, the adjusted multi-bin model is subjected to complete data import, query requirement test, data consistency verification and performance evaluation, so that the model can be ensured to run efficiently and accurately in different business scenes. The generated data verification report provides a reliable basis for subsequent model optimization and service analysis, and reduces service risks caused by inconsistent data or insufficient query performance.

The invention further provides a shared dimension optimizing device of the multi-bin model, and referring to fig. 2, fig. 2 is a schematic functional block diagram of a preferred embodiment of the shared dimension optimizing device of the multi-bin model. The shared dimension optimizing device of the number bin model comprises:

the data acquisition module is used for collecting a service data table, wherein the service data table comprises a data list report and a data summary table, and generating Chinese description of fields in the service data table;

The text analysis module is used for decomposing and counting the Chinese description through a text analysis method, marking parts of speech and generating an analysis result;

The data filtering module is used for optimally processing the analysis result and filtering parts of speech irrelevant to the sharing dimension, wherein the irrelevant parts of speech comprise time period parts of speech and measured parts of speech;

the dimension identification module is used for carrying out statistical analysis on the optimized analysis result and identifying the commonality dimension;

And the shared dimension generating and adjusting module processes each identified common dimension in combination with the business logic to generate a shared dimension, and adjusts the structure of the bin model based on the shared dimension.

The specific implementation manner of the shared dimension optimizing device of the multi-bin model is basically the same as that of each embodiment of the shared dimension optimizing method of the multi-bin model, and is not repeated here.

The invention also provides a shared dimension optimization device of the several-bin model, which can comprise a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005, as shown in FIG. 3. Wherein the communication bus 1002 is used to enable connected communication between these components. The user interface 1003 may include a Display, an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may further include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a stable memory (non-volatile memory), such as a disk memory. The memory 1005 may also optionally be a storage device separate from the processor 1001 described above.

Those skilled in the art will appreciate that the hardware architecture of the shared dimension optimization device of the several-bin model shown in fig. 3 does not constitute a definition of the shared dimension optimization device of the several-bin model, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

As shown in fig. 3, a memory 1005, which is a storage medium, may include an operating system, a network communication module, a user interface module, and a shared dimension optimization program of a several-bin model. The operating system is a program for managing and controlling shared dimension optimization equipment and software resources of the several-bin model, and supports the operation of a network communication module, a user interface module, a shared dimension optimization program of the several-bin model and other programs or software, wherein the network communication module is used for managing and controlling a network interface 1004, and the user interface module is used for managing and controlling a user interface 1003.

In the hardware structure of the shared dimension optimization device of the several-bin model shown in fig. 3, the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server, the user interface 1003 is mainly used for connecting a client and performing data communication with the client, and the processor 1001 may call the shared dimension optimization program of the several-bin model stored in the memory 1005 and perform the same operation as the shared dimension optimization method of the several-bin model.

The specific implementation manner of the shared dimension optimizing device of the several-bin model is basically the same as that of each embodiment of the shared dimension optimizing method of the several-bin model, and is not repeated here.

In addition, the embodiment of the invention also provides a computer storage medium, wherein the computer storage medium is stored with a shared dimension optimization program of the number bin model, and the shared dimension optimization program of the number bin model realizes the steps of the shared dimension optimization method of the number bin model when being executed by a processor.

The specific implementation manner of the computer storage medium is basically the same as that of each embodiment of the shared dimension optimization method of the multi-bin model, and is not repeated here.

It should be noted that, if a software tool or component other than the company appears in the embodiment of the present application, the embodiment is merely presented by way of example, and does not represent actual use.

Claims

1. The shared dimension optimization method of the multi-bin model is characterized by comprising the following steps of:

2. The shared dimension optimization method of a multi-bin model of claim 1, wherein the statistical analysis is performed on the filtered analysis result to identify a commonality dimension, comprising:

3. The shared dimension optimization method of a multi-bin model of claim 1, wherein processing in combination with business logic for each identified commonality dimension generates a shared dimension comprising:

4. The method for optimizing shared dimensions of a multi-bin model according to claim 1, wherein the step of decomposing and counting the Chinese description by a text parsing method and labeling parts of speech to generate a parsing result comprises:

5. The shared dimension optimization method of a multi-bin model according to claim 1, wherein optimizing the parsing result, filtering parts of speech irrelevant to the shared dimension, the irrelevant parts of speech including time period parts of speech and metric parts of speech, comprises:

6. The shared dimension optimization method of a number bin model of claim 1, wherein adjusting the structure of the number bin model based on the shared dimension comprises:

7. The shared dimension optimization method of a number bin model of claim 1, further comprising, after adjusting the structure of the number bin model based on the shared dimension:

8. A shared dimension optimization device for a multi-bin model, wherein the shared dimension optimization device for the multi-bin model comprises:

and the shared dimension generating and adjusting module is used for processing each identified common dimension and combining business logic to generate a shared dimension, and adjusting the structure of the bin model based on the shared dimension.

9. A shared dimension optimization device of a number bin model, characterized in that the shared dimension optimization device of a number bin model comprises a memory, a processor and a shared dimension optimization program of a number bin model stored on the memory and executable on the processor, which shared dimension optimization program of a number bin model, when executed by the processor, implements the steps of the shared dimension optimization method of a number bin model according to any of claims 1-7.

10. A computer storage medium, characterized in that a shared dimension optimization program of a number-bin model is stored on the storage medium, which when executed by a processor, implements the steps of the shared dimension optimization method of a number-bin model according to any of claims 1-7.