Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a method, a device and a system for monitoring junk short messages.A short message set corresponding to the content is determined according to the content of the short message by the obtained short message, and a calling number and a called number of the short message are added in the short message set; when the sending number of the short messages in the short message set is larger than or equal to a set first threshold value, the identification process of the junk short messages is carried out, and a large number of short message records sent by common users can be removed in such a way, so that the purposes of saving system resources and improving analysis efficiency are achieved.
The short message entering the spam short message identification process determines whether the short message in the message set is a spam message according to the propagation track of the short message in the message set, and by the identification mode, the incidence relation among different spam short message sending numbers can be effectively found, so that the purpose of identifying a batch of spam short message sending numbers at one time is realized.
As shown in fig. 1, a method for monitoring spam messages in an embodiment of the present invention includes:
101. acquiring a short message;
102. determining a short message set corresponding to the content according to the content of the short message, and adding a calling number and a called number of the short message in the short message set;
103. and when the sending quantity of the short messages in the short message set is greater than or equal to a set first threshold value, determining whether the short messages in the message set are spam messages according to the propagation track of the short messages in the message set.
The first threshold may be set as required, such as 500, 1000, 2000, 10000, etc.
In an embodiment of the present invention, in step 101, the short message may be obtained by using a real-time obtaining method or a non-real-time obtaining method:
the method for acquiring the short message in real time comprises the following steps:
the Short Message Peer to Peer protocol (SMPP) or a private protocol is in butt joint with Short Message data source equipment to obtain a Message to be analyzed in real time; the short message data source device can be a short message service center or a short message gateway.
Specific examples are as follows:
1) the spam message monitoring system collects the short messages from the short message service center in real time through a real-time interface, wherein the real-time interface can be an SMPP interface and the like.
2) The SPM (Signal Process Machine, signaling processing device) acquires signaling No. seven (SS 7: signaling System No.7) short message Signaling in the network link, and then through a customized transmission control protocol (TCP: transmission Control Protocol)/internet Protocol (IP: internet Protocol) interface sends the short message to the spam monitoring system.
In addition, the non-real-time short message acquisition mode comprises the following steps:
the method comprises the steps of obtaining short message ticket data from short message data source equipment through a non-real-time interface for analysis, wherein the non-real-time interface can be an FTP (File Transfer Protocol) interface and the like.
Specifically, the spam monitoring system can collect MO (mobile originated) tickets from the short message service center through the FTP interface, so as to realize non-real-time acquisition of short message data.
Optionally, after the short message is acquired, the content of the short message, the calling number, the called number and the sending time in the short message can be extracted; and the content of the short message is subjected to anti-interference rejection processing, and particularly, special characters in the content of the short message can be rejected.
In an optional embodiment of the present invention, for the content of the extracted short message or the content of the short message from which the special characters are removed, the step 102 determines a short message set corresponding to the content of the short message according to the content of the short message, including:
compressing the content of the short message by using a data compression algorithm to obtain a short message content value;
and determining a short message set corresponding to the content of the short message according to the short message content value.
Specifically, the data compression algorithm in the embodiment of the present invention may specifically be a Tianlhash algorithm, a strish algorithm, an elfhah algorithm, or an Hflp algorithm, and the like, and the process of compressing the content of the short message using the data compression algorithm may specifically be to calculate a hash value of the content of the short message according to the data compression algorithm, and may directly use the hash value as the content value of the short message.
Further optionally, determining, according to the content of the short message, a short message set corresponding to the content specifically includes:
searching a corresponding short message set according to the message content value; if the corresponding short message set is found according to the message content value, taking the found short message set as the short message set corresponding to the content; and if the corresponding short message set is not found according to the content value, generating the short message set corresponding to the content. In the embodiment of the invention, each short message content value has a corresponding short message set.
Specifically, in an embodiment of the present invention, the data structure in the short message set may be represented as the following table:
optionally, in another embodiment of the present invention, the table above may further increase the information of the sending amount of the short message, and each time an element is added in the table above, the sending amount of the short message is increased by 1. It is to be understood that the sending amount of the short message may also be stored in another location, for example, in another table with the short message content value as an index, the table storing the corresponding relationship between the short message content value and the sending amount of the short message.
Optionally, in step 102, adding the calling number and the called number of the short message to the short message set, including:
and adding an element in the short message set, wherein the calling number and the called number are used as information of the element.
In an alternative embodiment of the present invention, step 103 may comprise:
counting the number of calling numbers with out degree greater than 0 in the short message set; calculating the ratio of the number of the calling numbers with the out degree greater than 0 to all the numbers in the short message set, wherein all the numbers comprise the calling numbers and the called numbers; and when the ratio is smaller than a set second threshold value, determining that the short messages in the short message set are suspected spam messages, and the calling numbers with the outgoing degree larger than 0 are suspected spam message sending numbers. Specifically, the following procedure may be adopted to count the number of calling numbers with out-degree greater than 0 in the short message set: setting a calling number set; sequentially extracting elements in the short message set, determining whether the calling number of the current element is stored in the calling number set, and if so, adding 1 to the out-degree of the calling number in the calling number set; if not, the calling number is added to the calling number set and the out degree is set to 1. After traversing all the elements in the short message set, the number of calling numbers with out degree greater than 0 can be determined.
The number of calling numbers with out degree greater than 0 in the short message set can be represented by t; the ratio can be expressed as r ═ T/T, where T is the number of all numbers; the second threshold is an empirical value and may be set as desired, such as 1%, 5%, 10%, etc. And when r is smaller than a second threshold value, determining that the short message in the message set is a suspected spam short message, and determining that the calling number with the out-degree larger than 0 is a sending number of the suspected spam short message.
A popular short message of blessing type or joke type is usually forwarded by the user continuously, so that the sending amount of the content can easily reach the set second threshold; fig. 2 illustrates the propagation trajectory of popular short messages, and as shown in fig. 2, after 13800000000 sends a short message to 13800000001, 13800000002 and 13800000003, 13800000002 further forwards the short message to 13800000003 and 13800000004, and after 13800000004 receives the short message, it forwards the short message to 13800000003.
However, a user of a spam short message usually cannot forward the spam short message after receiving the spam short message, the number of calling numbers of the short message is limited, and the number of called numbers is many, so that the propagation track is relatively single, that is, the number of sent contents is usually smaller than the set second threshold. Fig. 3 illustrates a propagation trajectory of spam short messages, as shown in fig. 3, after 13700000000 sends a spam short message to 13700000001, 13700000002, 13700000003, 13700000004 and 13700000005, 13700000001, 13700000002, 13700000003, 13700000004 and 13700000005 do not forward the spam short message any more; 13800000000 sends a spam short message to 13800000001, 13800000002, 13800000003, 13800000004 and 13800000005, 13800000001, 13800000002, 13800000003, 13800000004 and 13800000005 will not forward the spam short message. In this case, 13700000000 and 13800000000 are the numbers suspected of sending spam messages. Since many short messages sent by Service Providers (SPs), companies, etc. have the propagation path of the spam short message, and short messages sent by SPs, companies, etc. cannot be regarded as spam short messages, a white list may be set, and short messages sent by SPs, companies, etc. are not put in a short message set for processing.
Therefore, the identification method can effectively discover the association relation among different spam short message sending numbers, and realize the effect of identifying a batch of spam short message sending numbers at one time.
Specifically, in the embodiment of the present invention, the number of the calling numbers with out-degree greater than 0 in the short message set is counted, and the short message propagation trajectory of the message set may be used for counting, and specifically, a directed graph G (V, E) may be used for representing, where V is a set of all elements in a message set, each element is composed of a calling number and a called number, and E is a set of all short message contents;
n ═ V | represents the number of all numbers in V, and E ═ E | represents the amount of short message transmission in E;
using directed edges (i, j) between the number i and the number j to represent the number i to send a short message to the number j, wherein i, j is taken from the set V;
by di inThe incoming degree of the number i is represented, namely the number of short messages with the number i as a called number; by di outD is the number of short messages indicating the out-degree of the number i, i.e. i is the calling numberi in=di out=e(i=1:n);
In the directed edges (i, j), a number i is adjacent to a number j, the number j is adjacent to the number i, and an adjacent table is used to indicate other number sets adjacent or adjacent to a given number, which is used in the embodiment of the present invention to indicate a number set of a given number of a calling or called party.
In the foregoing embodiment, the number of the calling numbers with the out-degree greater than 0 in the short message set is counted, and is described by taking the out-degree of the numbers as an example, if the link table in the adjacency relation is applied in the scenario of the foregoing embodiment, when it is determined that the calling number in the information of the current element is included in the calling number set, the out-degree of the calling number in the information of the current element is added by 1 in the calling number set, and the called number is added to the adjacency table; if not, adding the calling number in the information of the current element in the calling number set, setting the out degree of the calling number in the information of the current element as 1, and adding the called number into the adjacency list;
specifically, in the embodiment of the present invention, the following may be referred to describe the data structure of each element in the element set V in a manner of using an adjacency linked list:
since popular short messages are forwarded between different users and recipients of spam short messages are not substantially forwarded, step 103 may be implemented in another alternative embodiment of the present invention, comprising:
counting the number of called numbers with the incoming degree greater than 0 in the short message set;
calculating the ratio of the number of the called numbers with the income degree larger than 0 to all the numbers in the short message set, wherein all the numbers comprise calling numbers and called numbers;
and when the ratio is larger than a set third threshold value, determining that the short message in the short message set is a suspected spam message, and the calling number for sending the short message in the short message set is a suspected spam message sending number. The third threshold is also an empirical value, and may be set as needed, such as 99%, 95%, 90%, and the like.
In an optional embodiment of the present invention, the step 102 of adding the calling number and the called number of the short message to the short message set includes:
adding an element in the short message set, and taking the calling number and the called number as information of the element;
the counting the number of called numbers with the incoming degree greater than 0 in the short message set comprises:
sequentially acquiring information of elements in the short message set;
judging whether the called number in the information of the current element is included in a called number set or not; if yes, adding 1 to the income degree of the called number in the information of the current element in the called number set; if not, adding the called number in the information of the current element in the called number set, and setting the in-degree of the called number in the information of the current element as 1.
It should be noted that, in the embodiment of the present invention, the identification process of the short message is described by taking the number in degree as an example, which is the same as the principle implemented by the embodiment adopting the unified number out degree, and detailed description is not repeated here, and specific reference may be made to the specific scheme of the embodiment.
In an optional embodiment of the present invention, when it is determined that the short message in the message set is a spam message according to the propagation trajectory of the short message in the message set, the method may further include:
processing the spam short message and the number for sending the spam short message;
specifically, the general suspected spam short message processing method includes at least one of the following three methods:
1) adding the number judged to send the suspected spam Short Message into a blacklist, and synchronizing the blacklist to an external System, such as an SMSC (Short Message Service Center), a BOSS (business and Operation Support System) and the like;
2) sending the number judged to send the suspected spam short message, the content of the short message, the sending quantity of the short message and other information to a manual auditing platform for manual secondary confirmation;
3) and adding keywords into the corresponding short message content for interception.
As shown in fig. 4, a monitoring device for spam messages according to an embodiment of the present invention includes:
a message acquisition module 21, configured to acquire a short message;
a data preprocessing module 22, configured to determine a short message set corresponding to the content according to the content of the short message, and add a calling number and a called number of the short message to the short message set;
the message identification module 23 determines whether the short messages in the message set are spam messages according to the propagation track of the short messages in the message set when the sending number of the short messages in the short message set is greater than or equal to a set first threshold.
Optionally, the message collection module 21 may be specifically configured to:
the method comprises the steps that butt joint is carried out through a real-time interface short message data source device, and a message to be analyzed is obtained in real time; or,
and acquiring the call ticket data of the short message in the short message data source equipment in a non-real-time manner through a non-real-time interface for analysis.
Further optionally, the message acquisition module sends the acquired short message data packet to the data preprocessing module, when the short message flow is large and the processing capacity of a single server is limited, the data preprocessing module needs to adopt cluster-mode distributed deployment, and at this time, the traditional load balancing mode of distributing according to the calling number rule cannot guarantee that the same short message content is sent but short messages with different calling numbers are distributed to the same server, so that the scheme can provide two load balancing modes:
one is that: and a load balancing mode for distributing according to the content length of the short message. If the short message with the content length within 20 bytes is sent to the server 1, the short message with the length of 20-39 bytes is sent to the server 2, the short message with the length of 40-70 bytes is sent to the server 3, and the short message with the length more than 70 bytes is sent to the server 4;
the other is as follows: the content of the short message is converted into a content value (such as an integer) of the short message by a certain algorithm, and then the traditional load balancing mode is adopted, for example, the load balancing is realized according to the mantissa of the content value of the short message.
In an embodiment of the present invention, the message identification module 23 is specifically configured to: counting the number of calling numbers with out degree greater than 0 in the short message set; calculating the ratio of the number of the calling numbers with the out degree greater than 0 to all the numbers in the short message set, wherein all the numbers comprise the calling numbers and the called numbers; and when the ratio is smaller than a set second threshold value, determining that the short messages in the short message set are suspected spam messages, and the calling numbers with the outgoing degree larger than 0 are suspected spam message sending numbers.
In an embodiment of the present invention, the data preprocessing module 22 adds the calling number and the called number of the short message in the short message set, which may specifically include: adding an element in the short message set, and taking the calling number and the called number as information of the element;
the message identification module 23 counts the number of the calling numbers with out degree greater than 0 in the short message set, including: sequentially acquiring information of elements in the short message set; judging whether the calling number in the information of the current element is included in the calling number set; if yes, adding 1 to the out degree of the calling number in the information of the current element in the calling number set; if not, adding the calling number in the information of the current element in the calling number set, and setting the out degree of the calling number in the information of the current element as 1.
In another embodiment of the present invention, the message identification module 22 is specifically configured to: counting the number of called numbers with the incoming degree greater than 0 in the short message set; calculating the ratio of the number of the called numbers with the income degree larger than 0 to all the numbers in the short message set, wherein all the numbers comprise calling numbers and called numbers; and when the ratio is larger than a set third threshold value, determining that the short message in the short message set is a suspected spam message, and the calling number for sending the short message in the short message set is a suspected spam message sending number.
In another embodiment of the present invention, the data preprocessing module 22 adds the calling number and the called number of the short message to the short message set, specifically including: adding an element in the short message set, and taking the calling number and the called number as information of the element;
the message identification module 23 counts the number of called numbers with an incoming degree greater than 0 in the short message set, including: sequentially acquiring information of elements in the short message set; judging whether the called number in the information of the current element is included in a called number set or not; if yes, adding 1 to the income degree of the called number in the information of the current element in the called number set; if not, adding the called number in the information of the current element in the called number set, and setting the in-degree of the called number in the information of the current element as 1.
In an optional embodiment of the present invention, the determining, by the data preprocessing module 22, a short message set corresponding to the content of the short message according to the content of the short message specifically includes: compressing the content of the short message by using a data compression algorithm to obtain a short message content value; and determining a short message set corresponding to the content of the short message according to the short message content value.
In an optional embodiment of the present invention, the data preprocessing module 22 determines, according to the content of the short message, a short message set corresponding to the content, further comprising: searching a corresponding short message set according to the message content value; if the corresponding short message set is found according to the message content value, taking the found short message set as the short message set corresponding to the content; and if the corresponding short message set is not found according to the content value, generating the short message set corresponding to the content.
It should be noted that the embodiment of the spam short message monitoring apparatus in the present invention is directly obtained based on the embodiment of the method, and includes the same or corresponding technical solutions of the embodiment of the method, wherein there is a correspondence between each module and each step in the embodiment of the method in the embodiment of the present invention, and reference may be specifically made to the related description of the embodiment of the method.
As shown in fig. 5, a spam monitoring system according to an embodiment of the present invention includes:
short message data source device 31: the spam short message monitoring equipment is used for providing short message data to the spam short message monitoring equipment for identifying spam short messages and receiving the identification result of the spam short messages;
and includes the spam short message monitoring device 32 provided by the embodiment of the present invention.
The method, the device and the system for monitoring the spam short message provided by the embodiment of the invention not only can solve the problem that a short message sending party cannot be found in time to avoid system monitoring by reducing the short message sending flow of a single number; and the effect of identifying a batch of spam short message sending numbers at one time can be realized by analyzing the incidence relation between the sending numbers and the receiving numbers for sending the same or similar short message contents.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.