Disclosure of Invention
The application provides a big data intelligent warehouse management system based on data coding, which can solve the problems of high computational complexity and low algorithm efficiency of the existing data coding when carrying out cluster analysis on various information in the intelligent warehouse management system.
In order to solve the technical problems, the first technical scheme adopted by the application is as follows: provided is a big data intelligent warehouse management system based on data coding, comprising:
the data acquisition module is used for acquiring data to be processed, wherein the data to be processed comprises a plurality of data information sequences, and each data information sequence corresponds to a management parameter of each commodity;
the coding module is used for coding each data information sequence in the data to be processed, so as to obtain an actual coding result corresponding to each data information sequence;
the representative character determining module is used for determining representative characters of each data information sequence based on the difference between the actual coding result and the ideal coding result;
and the clustering module is used for clustering the data to be processed based on the representative characters of each data information sequence, so as to obtain a clustering result.
In an alternative embodiment, the encoding module is configured to:
coding the data information sequences by utilizing the character arrangement mode of each data information sequence in the data to be processed, so as to obtain an actual coding result corresponding to each data information sequence;
and coding the data information sequences by utilizing the character dictionary sequence of each data information sequence in the data to be processed, so as to obtain an ideal coding result corresponding to each data information sequence.
In an alternative embodiment, the encoding module includes:
the first coding module is used for determining the arrangement mode combination of all characters in each data information sequence by utilizing a full arrangement algorithm, the arrangement mode combination comprises a plurality of character sequences, each character sequence represents a character arrangement mode, the data information sequence is coded by utilizing a BWT coding mode based on the plurality of character sequences in the arrangement mode combination so as to obtain a plurality of first coding results, and the plurality of first coding results form the actual coding result;
the second coding module is used for determining the character dictionary sequence of each data information sequence in the data to be processed, and coding the data information sequences by utilizing a BWT coding mode based on the dictionary character sequences so as to obtain a plurality of second coding results, wherein the ideal coding results are formed by the plurality of second coding results; the character dictionary sequence comprises a plurality of dictionary sequences, and the first coding result corresponds to the second coding result one by one.
In an alternative embodiment, the representative character determination module includes:
the difference calculation module is used for determining the difference of each character in each data information sequence based on the difference between the actual coding result and the ideal coding result of each data information sequence;
and the character determining module is used for determining the representative characters of the data information sequence based on the difference of each character.
In an alternative embodiment, the difference calculating module is configured to:
determining the comprehensive difference of each character in each data information sequence based on the difference between the actual encoding result and the ideal encoding result of each data information sequence;
the variability of each character in the data information sequence is calculated based on the integrated variability of each character and the frequency with which the characters appear in the data information sequence.
In an alternative embodiment, the difference calculating module is configured to:
calculating the difference between the coding distance of each character of each first coding result in the actual coding result and the dictionary distance of each character of a second coding result corresponding to the first coding result in the ideal coding result, taking the ratio of the absolute value of the calculated difference to a larger value as the distance difference, averaging all the calculated distance differences, and taking the calculated average as the comprehensive difference of each character in each data information sequence; the larger value is the larger value in the coding distance of each character of each first coding result in the actual coding result and the dictionary distance of each character of a second coding result corresponding to the first coding result in the ideal coding result.
In an alternative embodiment, the difference calculating module is configured to:
calculating the dictionary distance of each character of the second coding result based on the dictionary character distance sequence corresponding to the second coding result; wherein each element in the dictionary character distance sequence is a dictionary distance between two characters.
In an alternative embodiment, the difference calculating module is configured to:
calculating the sum of average distances between the current character and all reference characters in the first coding result, and calculating the distance between the current character and the reference characters based on the calculated sum and the occurrence times of the current character in the first coding result, so as to obtain a coding character distance sequence of the first coding result, wherein each element in the coding character distance sequence is the coding distance between two characters;
and calculating the coding distance of each character of the first coding result based on the coding character distance sequence corresponding to the first coding result.
In an alternative embodiment, the character determining module is configured to:
normalizing the difference of each character;
taking the characters with the difference smaller than a preset value after normalization processing as candidate characters;
and determining the representative character of the data information sequence based on the frequency of the candidate character and the difference after normalization processing.
In an alternative embodiment, the character determining module is further configured to:
calculating the ratio of the frequency of each candidate character to the difference after normalization processing;
and taking the candidate character with the ratio larger than 1 as a representative character of the data information sequence.
The application has the beneficial effects that the big data intelligent warehouse management system based on the data coding, which is different from the prior art, comprises: the data acquisition module is used for acquiring data to be processed, wherein the data to be processed comprises a plurality of data information sequences, and each data information sequence corresponds to a management parameter of each commodity; the coding module is used for coding each data information sequence in the data to be processed, so as to obtain an actual coding result corresponding to each data information sequence; the representative character determining module is used for determining representative characters of each data information sequence based on the difference between the actual coding result and the ideal coding result; and the clustering module is used for clustering the data to be processed based on the representative characters of each data information sequence, so as to obtain a clustering result. According to the method, different data information sequences are clustered through the representative characters representing the data information sequences, so that the calculated amount can be reduced, the algorithm efficiency is improved, and the stability and the accuracy of the clustering result are higher.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The present application will be described in detail with reference to the accompanying drawings and examples.
Referring to fig. 1, fig. 1 is a schematic structural diagram of an embodiment of a big data intelligent warehouse management system based on data encoding according to the present application.
Specifically, the big data intelligent warehouse management system 100 based on data coding provided by the application comprises a data acquisition module 10, a coding module 20, a representative character determining module 30 and a clustering module 40.
The data acquisition module 10 is configured to acquire data to be processed, where the data to be processed includes a plurality of data information sequences, and each data information sequence corresponds to a management parameter of each commodity. The management parameters include high-dimensional information such as the source, destination, transportation mode, transportation time, vehicle scheduling information and the like of the commodity.
The encoding module 20 is configured to encode each data information sequence in the data to be processed, so as to obtain an actual encoding result corresponding to each data information sequence. Specifically, the encoding module 20 is configured to: and coding the data information sequences by utilizing the character arrangement mode of each data information sequence in the data to be processed, so as to obtain the actual coding result corresponding to each data information sequence. And coding the data information sequences by utilizing the character dictionary sequence of each data information sequence in the data to be processed, so as to obtain an ideal coding result corresponding to each data information sequence.
In one embodiment, referring to fig. 2, the encoding module 20 includes: a first encoding module 21 and a second encoding module 22. The first encoding module 21 is configured to determine an arrangement combination of all characters in each data information sequence by using a full permutation algorithm, where the arrangement combination includes a plurality of character sequences, each character sequence represents a character arrangement, and encode the data information sequence by using a BWT encoding method based on the plurality of character sequences in the arrangement combination to obtain a plurality of first encoding results, where the plurality of first encoding results form the actual encoding result. The second encoding module 22 is configured to determine a character dictionary sequence of each data information sequence in the data to be processed, encode the data information sequence by using a BWT encoding manner based on the dictionary character sequence, so as to obtain a plurality of second encoding results, where the plurality of second encoding results form the ideal encoding result; the character dictionary sequence comprises a plurality of dictionary sequences, and the first coding result corresponds to the second coding result one by one.
The BWT coding mode is a coding mode for adjusting the character sequence, and the number of characters before and after coding is unchanged. In the data information sequence after BWT coding, the sequences of the same kind of characters are not necessarily adjacent, but the characters with similar sequences in the dictionary character sequence still can be adjacent in the coded data, namely, only the positions of the characters change, but the relative distance of the characters with similar sequences changes less, the smaller the change of the relative distance, the closer the corresponding actual coding result is to the ideal coding result, and the ideal coding result is: after the characters in the data information sequence are coded, the characters with similar orders in the dictionary are still similar in the actual coding result. The BWT coding is not fully idealized, and the adjacent characters in the dictionary are not necessarily put together, so that the smaller the difference, the smaller the difference between the actual and ideal results of the BWT coding results, and in the sequential rearrangement, the more closely the characters can be aligned together, the more the characters can represent the nature of the data itself.
The present application thus provides a representative character determination module 30, the representative character determination module 30 being adapted to determine a representative character for each data information sequence based on the difference between the actual encoding result and the ideal encoding result. The clustering module 40 is configured to cluster the data to be processed based on the representative character of each data information sequence, so as to obtain a clustering result.
In one embodiment, referring to fig. 3, the representative character determining module 30 includes a variance calculating module 31 and a character determining module 32. The difference calculation module is used for determining the difference of each character in each data information sequence based on the difference between the actual coding result and the ideal coding result of each data information sequence. In a specific embodiment, the variance calculating module 31 is configured to: the overall difference for each character in each data information sequence is determined based on the difference between the actual encoding result and the ideal encoding result for each data information sequence. The variability of each character in the data information sequence is calculated based on the integrated variability of each character and the frequency with which the characters appear in the data information sequence.
Specifically, the difference calculating module 31 is configured to: calculating the difference between the coding distance of each character of each first coding result in the actual coding result and the dictionary distance of each character of a second coding result corresponding to the first coding result in the ideal coding result, taking the ratio of the absolute value of the calculated difference to a larger value as the distance difference, averaging all the calculated distance differences, and taking the calculated average as the comprehensive difference of each character in each data information sequence. The larger value is the larger value of the coding distance of each character of each first coding result in the actual coding result and the dictionary distance of each character of the second coding result corresponding to the first coding result in the ideal coding result.
First, the difference calculating module 31 is configured to calculate a dictionary distance of each character of the second encoding result based on a dictionary character distance sequence corresponding to the second encoding result; wherein each element in the dictionary character distance sequence is a dictionary distance between two characters.
In a specific embodiment, three characters a, b and c are assumed, and the corresponding dictionary character distance sequence in the second coding result of the 3 characters is: [ d (a, b), d (a, c), d (b, c) ], d () represents the distance between two characters in brackets. Where d (a, b) =1, d (a, c) =2, d (b, c) =1, the dictionary character distance sequence is therefore: [1,2,1]. The dictionary distance of each character of the second coding result can be calculated through the dictionary character distance sequence: the following formula (1):
d(a)==1.5;
d(b)==1;
d(c)==1.5。 (1)
and normalizing the calculated values to obtain the distances of the characters a, b and c, namely the dictionary distance of each character of the second coding result.
Further, the difference calculating module 31 is configured to calculate a sum of average distances between a current character and all reference characters in the first encoding result, calculate a distance between the current character and the reference characters based on the calculated sum and the number of times the current character appears in the first encoding result, and further obtain an encoding character distance sequence of the first encoding result, where each element in the encoding character distance sequence is an encoding distance between two characters; and calculating the coding distance of each character of the first coding result based on the coding character distance sequence corresponding to the first coding result.
In one embodiment, for example: the first encoding result is: abbaca, the coded character distance sequence is expressed as: [ d (a, b), d (a, c), d (b, c)]Wherein d (a, b) =) Wherein->And 3 in (2) represents three a in the first coding result, i.e. the number of times the character a appears in the first coding result, wherein the first element of the added three elements in brackets +.>Representing the average distance between the first a (the first a is recorded as the current character) and all the characters b (the characters b are recorded as the reference characters) in the actual coding result, and the second element>Representing the average distance of the second a (the second a is denoted as the current character) from all b (character b is denoted as the reference character), and so on.Representing the sum of the average distances between the current character a and all the reference characters b, calculating the distance between the current character and the reference characters based on the calculated sum and the number of times the current character appears in the first encoding result, i.e. calculating d (a, b). The results of d (a, c) and d (b, c), d (a, b), d (a, c) and d (b, c) can be calculated in the same way to form the code character distance sequence. Further based on the coded character distance sequenceThe coding distance of each character in the first coding result is calculated, that is, d (a), d (b) and d (c) in the first coding result are calculated, and the calculation mode is shown in the above formula (1) and will not be described herein.
Through the above-described process, the coding distance of each character of the first coding result is calculated, the dictionary distance of each character of the second coding result is calculated, the difference calculating module 31 further calculates the difference between the two, and the ratio of the absolute value of the calculated difference to the larger value is taken as the distance difference. The coding distance of each character of all the first coding results and the dictionary distance of each character of the second coding results are further calculated, the difference value of the coding distance and the dictionary distance is further calculated, and the corresponding distance difference is obtained in the same way. And (3) averaging all the calculated distance differences to obtain the comprehensive difference of each character in each data information sequence.
The variability of each character in the data information sequence is calculated based on the integrated variability of each character and the frequency with which the characters appear in the data information sequence. Specifically, the differential calculation formula of each character is:
indicating the frequency of occurrence of the ith character in the data information sequence,/for example>Representing the integrated variability of the ith character in the data information sequence and n representing the number of characters in the data information sequence.
Wherein,,representing differential weights, ++>The larger the difference weight is, the larger the difference weight is.
The character determining module 32 is configured to determine a representative character of the data information sequence based on the variability of each character.
Specifically, for each data information sequence, under different dictionary sequences, characters with smaller comprehensive differences can represent the character distribution characteristics of the data information sequence, so that the characters with smaller comprehensive differences in each data information sequence are obtained first, the differences of each character are obtained by combining the character frequencies, the differences represent the representativeness of the characters to the data information sequence, the representative characters of the data information sequence are obtained based on the differences of each character, and different pieces of data are clustered through the representative characters.
Specifically, the character determining module 32 normalizes the variability of each character; taking characters with the difference after normalization processing smaller than a preset value, for example, 0.6 as candidate characters; and determining the representative character of the data information sequence based on the frequency of the candidate character and the difference after normalization processing. In one embodiment, the character determination module 32 calculates a ratio of the frequency of each candidate character to the normalized difference; and taking the candidate character with the ratio larger than 1 as a representative character of the data information sequence.
The clustering module 40 is configured to cluster the data to be processed based on the representative character of each data information sequence, so as to obtain a clustering result.
Specifically, after the clustering module 40 obtains the representative character of each piece of data, different pieces of data are clustered through the representative character, for example, a hierarchical clustering method is adopted to obtain a clustering result. In the big data intelligent warehouse management system, articles with similar attributes can be stored together according to the clustering result, so that the articles with the same category are stored together.
According to the method, the difference of the actual coding result and the ideal coding result of each character in each piece of data is calculated, the difference of each character in each dictionary sequence is obtained, the comprehensive difference of each character in each piece of data is obtained by combining all dictionary sequences, the representative characters in each piece of data are further obtained, the calculated amount is reduced through representative character clustering, meanwhile, the stability and the accuracy of the clustering result are higher, different categories are obtained, the storage is further carried out, the reliability of the storage result is greatly improved, and the follow-up inquiry, article sorting and other operations are facilitated.
The foregoing is only the embodiments of the present application, and therefore, the scope of the present application is not limited by the above embodiments, and all equivalent structures or equivalent processes using the descriptions and the drawings of the present application or direct or indirect application in other related technical fields are included in the scope of the present application.