CN104750663B - The recognition methods of text messy code and device in the page - Google Patents
The recognition methods of text messy code and device in the page Download PDFInfo
- Publication number
- CN104750663B CN104750663B CN201310737443.4A CN201310737443A CN104750663B CN 104750663 B CN104750663 B CN 104750663B CN 201310737443 A CN201310737443 A CN 201310737443A CN 104750663 B CN104750663 B CN 104750663B
- Authority
- CN
- China
- Prior art keywords
- text
- coded format
- characteristic information
- character
- page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000006243 chemical reaction Methods 0.000 claims abstract description 26
- 239000000284 extract Substances 0.000 claims description 5
- 230000008569 process Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000006467 substitution reaction Methods 0.000 description 4
- 239000013589 supplement Substances 0.000 description 4
- 230000009471 action Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 230000035755 proliferation Effects 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 206010038743 Restlessness Diseases 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The application provides the recognition methods of text messy code and device in a kind of page.The first coded format that the embodiment of the present application passes through the first text to be identified in the acquisition page, and then the corresponding relationship between character corresponding to the character according to corresponding to the second coded format and other coded formats, it is the second text with second coded format by first text conversion, further according to the corresponding relationship between character corresponding to character corresponding to second coded format and first coded format, it is third text by second text conversion, make it possible to according to the third text and first text, it determines in first text with the presence or absence of messy code, identification process is participated in without operator, it is easy to operate, and accuracy is high, to improve the efficiency and reliability of the identification of text messy code.
Description
[technical field]
This application involves WWW (World Wide Web, Web) page processing technique more particularly to a kind of page Chinese
The recognition methods of this messy code and device.
[background technique]
WWW (World Wide Web, Web) page may include by one or more hypertext markup language
One display block of (HyperText Markup Language, HTML) label composition, referred to as page elements, for example, literary
Sheet, label, hyperlink, button, input frame, combobox etc..Text meeting due to parsing of Web page etc., in Web page
There is mess code phenomenon.In the prior art, it needs one by one to check Web page by operator, to find in the Web page
Text whether there is mess code phenomenon.
However, the identification operating time of existing text messy code is long, and is easy error, so as to cause the knowledge of text messy code
The reduction of other efficiency and reliability.
[summary of the invention]
The many aspects of the application provide the recognition methods of text messy code and device in a kind of page, to improve text unrest
The efficiency and reliability of the identification of code.
The one side of the application provides a kind of recognition methods of text messy code in the page, comprising:
Obtain the first coded format of the first text to be identified in the page;
According to the corresponding relationship between character corresponding to character corresponding to the second coded format and other coded formats,
It is the second text by first text conversion, the coded format of second text is second coded format;
According between character corresponding to character corresponding to second coded format and first coded format
Second text conversion is third text by corresponding relationship;
According to the third text and first text, determine in first text with the presence or absence of messy code.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, described second compiles
Code format includes Unicode coded format.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, it is described according to institute
Third text and first text are stated, is determined in first text with the presence or absence of messy code, comprising:
The third text and first text are compared;
If the third text and first text are inconsistent, determine that there are messy codes in first text;Or
If the third text is consistent with first text, determine that there is no messy codes in first text.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, it is described to described
Third text and first text are compared, comprising:
Extract the characteristic information of the third text and the characteristic information of first text;
The characteristic information of characteristic information and first text to the third text is compared;
If the characteristic information of the third text is not identical as the characteristic information of first text, illustrate the third text
This is inconsistent with first text;Or
If the characteristic information of the third text is identical as the characteristic information of first text, illustrate the third text
It is consistent with first text.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the feature letter
Breath includes MD5 value.
The another aspect of the application provides a kind of identification device of text messy code in the page, comprising:
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Converting unit, for character corresponding to the character according to corresponding to the second coded format and other coded formats it
Between corresponding relationship, be the second text by first text conversion, the coded format of second text is described second to compile
Code format;
The converting unit is also used to the character according to corresponding to second coded format and first coded format
Second text conversion is third text by the corresponding relationship between corresponding character;
Determination unit, for determining whether deposit in first text according to the third text and first text
In messy code.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, described second compiles
Code format includes Unicode coded format.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, it is described determining single
Member is specifically used for
The third text and first text are compared;
If the third text and first text are inconsistent, determine that there are messy codes in first text;Or
If the third text is consistent with first text, determine that there is no messy codes in first text.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, it is described determining single
Member is specifically used for
Extract the characteristic information of the third text and the characteristic information of first text;
The characteristic information of characteristic information and first text to the third text is compared;
If the characteristic information of the third text is not identical as the characteristic information of first text, illustrate the third text
This is inconsistent with first text;Or
If the characteristic information of the third text is identical as the characteristic information of first text, illustrate the third text
It is consistent with first text.
The aspect and any possible implementation manners as described above, it is further provided a kind of implementation, the feature letter
Breath includes MD5 value.
As shown from the above technical solution, the embodiment of the present application is compiled by the first of the first text to be identified in the acquisition page
Code format, and then the corresponding pass between the character according to corresponding to the second coded format and character corresponding to other coded formats
First text conversion is the second text with second coded format, further according to second coded format by system
Second text conversion is by the corresponding relationship between character corresponding to corresponding character and first coded format
Third text makes it possible to be determined in first text according to the third text and first text with the presence or absence of disorderly
Code participates in identification process without operator, easy to operate, and accuracy is high, thus improve the identification of text messy code
Efficiency and reliability.
[Detailed description of the invention]
It in order to more clearly explain the technical solutions in the embodiments of the present application, below will be to embodiment or description of the prior art
Needed in attached drawing be briefly described, it should be apparent that, the accompanying drawings in the following description is some realities of the application
Example is applied, it for those of ordinary skill in the art, without any creative labor, can also be attached according to these
Figure obtains other attached drawings.
The flow diagram of the recognition methods of text messy code in the page that Fig. 1 provides for one embodiment of the application;
The structural schematic diagram of the identification device of text messy code in the page that Fig. 2 provides for another embodiment of the application.
[specific embodiment]
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art
Whole other embodiments obtained without creative efforts, shall fall in the protection scope of this application.
It is understood that the page involved in the application, can be based on hypertext markup language (HyperText
Markup Language, HTML) webpage (Web Page) write, it is referred to as Web page.
It should be noted that terminal involved in the embodiment of the present application can include but is not limited to mobile phone, individual digital
Assistant (Personal Digital Assistant, PDA), wireless handheld device, wireless networking sheet, PC, portable electricity
Brain, PC (Personal Computer, PC), MP3 player, MP4 player etc..
In addition, the terms "and/or", only a kind of incidence relation for describing affiliated partner, indicates may exist
Three kinds of relationships, for example, A and/or B, can indicate: individualism A exists simultaneously A and B, these three situations of individualism B.Separately
Outside, character "/" herein typicallys represent the relationship that forward-backward correlation object is a kind of "or".
The flow diagram of the recognition methods of text messy code in the page that Fig. 1 provides for one embodiment of the application, such as Fig. 1 institute
Show.
101, the first coded format of the first text to be identified in the page is obtained.
Wherein, first coded format can be all optional text code modes in the prior art, for example, GBK
Coding mode, UTF-8 coding mode or GB2312 coding mode etc., the present embodiment is to this without being particularly limited to.
GBK is one of encoding of chinese characters standard, and (GBK is " national standard ", the spelling of " extension " Chinese to full name " Chinese Internal Code Specification "
The first letter of sound, can also be known as Chinese character international proliferation code, and English name is Chinese Internal Code
Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set conversion lattice
Formula.
It optionally,, specifically can be according to the phase of the page in 101 in a possible implementation of the present embodiment
Information is closed, the first coded format of the first text to be identified in the page is obtained.
For example, can be " < meta http-equiv=" Content-Type " content according to the META label of the page
="text/html;Charset=gb2312 " > ", the first coded format for obtaining the first text to be identified in the page are
GB2312 coded format.
Alternatively, for another example can be according in Cascading Style Sheet (Cascading Style Sheet, CSS) file of the page
Definition be "@charset " UTF-8 " ", the first coded format for obtaining the first text to be identified in the page is that UTF-8 is compiled
Code format.
Alternatively, for another example first of the first text to be identified in the page according to the website belonging to the page, can be obtained
Coded format.Such as, the coding mode that Baidu uses is GB2312 coding mode, and the coding mode that Google is used is UTF-8 volume
Code mode etc..
102, corresponding between the character according to corresponding to the second coded format and character corresponding to other coded formats
Relationship, is the second text by first text conversion, and the coded format of second text is second coded format.
Optionally, in a possible implementation of the present embodiment, second coded format may include but not
It is limited to Unicode coded format.The Chinese of Unicode can be translated as Unicode, international code, Unicode or single code, it is
Each character rather than the unique code of glyph definition (i.e. an integer), for example, unique binary coding.
During conversion, if some character in first text has corresponding to corresponding second coded format
Character, then the character can be then converted to character corresponding to corresponding second coded format;If first text
Some character in this does not have character corresponding to corresponding second coded format, then can then execute former preconfigured behaviour
Make, for example, abandoning the character, or one preset substitution character of supplement, the present embodiment is to this without being particularly limited to.
103, character corresponding to the character according to corresponding to second coded format and first coded format it
Between corresponding relationship, by second text conversion be third text.
During conversion, if some character in second text has corresponding to corresponding first coded format
Character, then the character can be then converted to character corresponding to corresponding first coded format;If second text
Some character in this does not have character corresponding to corresponding first coded format, then can then execute former preconfigured behaviour
Make, for example, abandoning the character, or one preset substitution character of supplement, the present embodiment is to this without being particularly limited to.
104, it according to the third text and first text, determines in first text with the presence or absence of messy code.
It optionally, in 104, specifically can be to the third text in a possible implementation of the present embodiment
This and first text are compared.If the third text and first text are inconsistent, described can be determined
There are messy codes in one text;If the third text is consistent with first text, can determine first text
In be not present messy code.
Specifically, compare the i.e. described third text of two texts and first text, many methods can be used.
For example, can directly two texts be carried out with the matching of character, judge one by one character in two texts whether one
It causes.
Alternatively, for another example extract the characteristic information of the third text and the characteristic information of first text, for example,
Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value;In turn, to the feature of the third text
The characteristic information of information and first text is compared;If the characteristic information of the third text and first text
Characteristic information is not identical, it can be said that the bright third text and first text are inconsistent;If the third text
Characteristic information it is identical as the characteristic information of first text, it can be said that the bright third text and first text one
It causes.
It, can be with for example, Web page editing machine it should be noted that 101~104 executing subject can be identification device
In the client being located locally, to carry out identified off-line, or it may be located in the server of network side, to be known online
Not, the present embodiment is to this without limiting.
It is understood that the client can be mounted in the application program in terminal, or it can also be browsing
One webpage of device, as long as the objective reality form that can be realized page processing is ok, the present embodiment is to this without limiting.
Existing recognition methods needs one by one to check Web page by operator, to find in the Web page
Text whether there is mess code phenomenon.However, manually checking whether messy code is easy to bring two problems the page.
The first, efficiency is very low, the website of especially slightly larger type, and subpage frame just has hundreds of thousands, and operator can not be one by one
It checks;
The second, manual identified is easy to miss the messy code in the page, for example, the feelings that messy code is seldom in the page, there are many text
Condition, operator are difficult naked eyes and find.
Using technical solution provided in this embodiment, participated in without operator, it is easy to operate, and also accuracy is high.
In the present embodiment, by the first coded format of the first text to be identified in the acquisition page, and then according to second
Corresponding relationship between character corresponding to character corresponding to coded format and other coded formats turns first text
It is changed to the second text with second coded format, further according to character corresponding to second coded format and described the
Second text conversion is third text, makes it possible to root by the corresponding relationship between character corresponding to one coded format
According to the third text and first text, determines in first text with the presence or absence of messy code, participated in without operator
Identification process, it is easy to operate, and also accuracy is high, to improve the efficiency and reliability of the identification of text messy code.
In addition, can be carried out automatically to the messy code that the text in the page occurs using technical solution provided by the present application
Identification, real-time are good.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the application is not limited by the described action sequence because
According to the application, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, the embodiments described in the specification are all preferred embodiments, related actions and modules not necessarily the application
It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
The structural schematic diagram of the identification device of text messy code, such as Fig. 2 in the page that Fig. 2 provides for another embodiment of the application
It is shown.The identification device of text messy code may include acquiring unit 21, converting unit 22 and determine single in the page of the present embodiment
Member 23.Wherein, acquiring unit 21, for obtaining the first coded format of the first text to be identified in the page;Converting unit 22,
For the corresponding relationship between character corresponding to the character according to corresponding to the second coded format and other coded formats, by institute
Stating the first text conversion is the second text, and the coded format of second text is second coded format;The conversion is single
Member 22, is also used between character corresponding to the character according to corresponding to second coded format and first coded format
Corresponding relationship, by second text conversion be third text;Determination unit 23, for according to the third text and described
First text determines in first text with the presence or absence of messy code.
Wherein, first coded format can be all optional text code modes in the prior art, for example, GBK
Coding mode, UTF-8 coding mode or GB2312 coding mode etc., the present embodiment is to this without being particularly limited to.
GBK is one of encoding of chinese characters standard, and (GBK is " national standard ", the spelling of " extension " Chinese to full name " Chinese Internal Code Specification "
The first letter of sound, can also be known as Chinese character international proliferation code, and English name is Chinese Internal Code
Specification).
UTF is the abbreviation of " UCS Transformation Format ", can translate into Unicode character set conversion lattice
Formula.
Optionally, in a possible implementation of the present embodiment, the acquiring unit 21 specifically can be according to page
The relevant information in face obtains the first coded format of the first text to be identified in the page.
For example, the acquiring unit 21 can be " < meta http-equiv=" according to the META label of the page
Content-Type"content="text/html;Charset=gb2312 " > " obtains the first text to be identified in the page
This first coded format is GB2312 coded format.
Alternatively, for another example the acquiring unit 21 can be according to Cascading Style Sheet (the Cascading Style of the page
Sheet, CSS) definition in file is "@charset " UTF-8 " ", obtain first of the first text to be identified in the page
Coded format is UTF-8 coded format.
Alternatively, for another example the acquiring unit 21 can obtain to be identified in the page according to the website belonging to the page
First coded format of the first text.Such as, the coding mode that Baidu uses is GB2312 coding mode, the coding that Google is used
Mode is UTF-8 coding mode etc..
Optionally, in a possible implementation of the present embodiment, second coded format may include but not
It is limited to Unicode coded format.The Chinese of Unicode can be translated as Unicode, international code, Unicode or single code, it is
Each character rather than the unique code of glyph definition (i.e. an integer), for example, unique binary coding.
Specifically, the converting unit 22 is during executing conversion for the first time, if certain in first text
A character has character corresponding to corresponding second coded format, then the character can be then converted to corresponding second coding
Character corresponding to format;If some character in first text does not have word corresponding to corresponding second coded format
Symbol, then former preconfigured operation can be then executed, for example, the character is abandoned, or one preset substitution character of supplement, this
Embodiment is to this without being particularly limited to.
Specifically, the converting unit 22 is during executing second of conversion, if certain in second text
A character has character corresponding to corresponding first coded format, then the character can be then converted to corresponding first coding
Character corresponding to format;If some character in second text does not have word corresponding to corresponding first coded format
Symbol, then then can be with the preconfigured operation of executor, for example, abandoning the character, or one preset substitution character of supplement, originally
Embodiment is to this without being particularly limited to.
Optionally, in a possible implementation of the present embodiment, the determination unit 23 specifically can be used for pair
The third text and first text are compared;It, can be with if the third text and first text are inconsistent
Determine that there are messy codes in first text;If the third text is consistent with first text, can determine institute
It states and messy code is not present in the first text.
Specifically, the determination unit 23 compares the i.e. described third text of two texts and first text, can adopt
With many methods.
For example, the determination unit 23 can directly carry out the matching of character to two texts, two texts are judged one by one
In character it is whether consistent.
Alternatively, for another example the determination unit 23 extracts the characteristic information and first text of the third text
Characteristic information, for example, Message Digest Algorithm 5 (Message Digest Algorithm, MD5) value;In turn, to described
The characteristic information of the characteristic information of third text and first text is compared;If the characteristic information of the third text with
The characteristic information of first text is not identical, it can be said that the bright third text and first text are inconsistent;Or
If the characteristic information of the third text is identical as the characteristic information of first text, it can be said that the bright third text with
First text is consistent.
It should be noted that in the page provided in this embodiment text messy code identification device, for example, Web page editor
Device in the client that can be located locally, to carry out identified off-line, or may be located in the server of network side, with into
Row online recognition, the present embodiment is to this without limiting.
It is understood that the client can be mounted in the application program in terminal, or it can also be browsing
One webpage of device, as long as the objective reality form that can be realized page processing is ok, the present embodiment is to this without limiting.
Existing identification device needs one by one to check Web page by operator, to find in the Web page
Text whether there is mess code phenomenon.However, manually checking whether messy code is easy to bring two problems the page.
The first, efficiency is very low, the website of especially slightly larger type, and subpage frame just has hundreds of thousands, and operator can not be one by one
It checks;
The second, manual identified is easy to miss the messy code in the page, for example, the feelings that messy code is seldom in the page, there are many text
Condition, operator are difficult naked eyes and find.
Using technical solution provided in this embodiment, participated in without operator, it is easy to operate, and also accuracy is high.
In the present embodiment, by the first coded format of the first text to be identified in the acquiring unit acquisition page, in turn
As converting unit character according to corresponding to the second coded format pass corresponding between character corresponding to other coded formats
First text conversion is the second text with second coded format, further according to second coded format by system
Second text conversion is by the corresponding relationship between character corresponding to corresponding character and first coded format
Third text enables determination unit according to the third text and first text, and determining in first text is
No there are messy codes, participate in identification process without operator, easy to operate, and accuracy is high, to improve text messy code
Identification efficiency and reliability.
In addition, can be carried out automatically to the messy code that the text in the page occurs using technical solution provided by the present application
Identification, real-time are good.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of hardware adds SFU software functional unit.
The above-mentioned integrated unit being realized in the form of SFU software functional unit can store and computer-readable deposit at one
In storage media.Above-mentioned SFU software functional unit is stored in a storage medium, including some instructions are used so that a computer
It is each that equipment (can be personal computer, server or the network equipment etc.) or processor (processor) execute the application
The part steps of embodiment the method.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (Read-
Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic or disk etc. it is various
It can store the medium of program code.
Finally, it should be noted that above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although
The application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution spirit and
Range.
Claims (8)
1. the recognition methods of text messy code in a kind of page characterized by comprising
Obtain the first coded format of the first text to be identified in the page;
Corresponding to other coded formats according to character corresponding to the second coded format and in addition to second coded format
Character between corresponding relationship, by first text conversion be the second text, the coded format of second text is institute
State the second coded format;
According to corresponding between character corresponding to second coded format and character corresponding to first coded format
Second text conversion is third text by relationship;
According to the third text and first text, determine in first text with the presence or absence of messy code;
Wherein first coded format are as follows: GBK coded format, UTF-8 coded format or GB2312 coded format, described second
Coded format is Unicode coded format.
2. the method according to claim 1, wherein described according to the third text and first text,
It determines in first text with the presence or absence of messy code, comprising:
The third text and first text are compared;
If the third text and first text are inconsistent, determine that there are messy codes in first text;Or
If the third text is consistent with first text, determine that there is no messy codes in first text.
3. according to the method described in claim 2, it is characterized in that, described carry out the third text and first text
Compare, comprising:
Extract the characteristic information of the third text and the characteristic information of first text;
The characteristic information of characteristic information and first text to the third text is compared;
If the characteristic information of the third text is not identical as the characteristic information of first text, illustrate the third text with
First text is inconsistent;Or
If the characteristic information of the third text is identical as the characteristic information of first text, illustrate the third text and institute
It is consistent to state the first text.
4. method described in any claim according to claim 1~3, which is characterized in that the characteristic information includes MD5
Value.
5. the identification device of text messy code in a kind of page characterized by comprising
Acquiring unit, for obtaining the first coded format of the first text to be identified in the page;
Converting unit, for the character according to corresponding to the second coded format and other volumes in addition to second coded format
First text conversion is the second text by the corresponding relationship between character corresponding to code format, second text
Coded format is second coded format;
The converting unit, is also used to the character according to corresponding to second coded format and the first coded format institute is right
Second text conversion is third text by the corresponding relationship between character answered;
Determination unit, for according to the third text and first text, determining in first text with the presence or absence of disorderly
Code;
Wherein first coded format are as follows: GBK coded format, UTF-8 coded format or GB2312 coded format, described second
Coded format is Unicode coded format.
6. device according to claim 5, which is characterized in that the determination unit is specifically used for the third text
It is compared with first text;
If the third text and first text are inconsistent, determine that there are messy codes in first text;Or
If the third text is consistent with first text, determine that there is no messy codes in first text.
7. device according to claim 6, which is characterized in that the determination unit is specifically used for extracting the third text
The characteristic information of characteristic information originally and first text;
The characteristic information of characteristic information and first text to the third text is compared;
If the characteristic information of the third text is not identical as the characteristic information of first text, illustrate the third text with
First text is inconsistent;Or
If the characteristic information of the third text is identical as the characteristic information of first text, illustrate the third text and institute
It is consistent to state the first text.
8. according to device described in claim 5~7 any claim, which is characterized in that the characteristic information includes MD5
Value.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310737443.4A CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201310737443.4A CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN104750663A CN104750663A (en) | 2015-07-01 |
| CN104750663B true CN104750663B (en) | 2019-05-28 |
Family
ID=53590375
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201310737443.4A Active CN104750663B (en) | 2013-12-27 | 2013-12-27 | The recognition methods of text messy code and device in the page |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN104750663B (en) |
Families Citing this family (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105279247A (en) * | 2015-09-30 | 2016-01-27 | 北京奇虎科技有限公司 | Expression library generation method and device |
| CN106598689A (en) * | 2016-12-20 | 2017-04-26 | 绿金在线电子商务有限公司 | Universal Chinese coding method |
| CN108271041B (en) * | 2016-12-30 | 2021-01-22 | 北京国双科技有限公司 | Garbled code processing method and device |
| CN110728115B (en) * | 2018-07-17 | 2024-01-26 | 珠海金山办公软件有限公司 | Method, device and electronic equipment for identifying garbled characters in document content |
| CN111259628B (en) * | 2020-02-18 | 2021-09-28 | 北京金堤科技有限公司 | Webpage information extraction method and device, electronic equipment and storage medium |
| CN113595683A (en) * | 2021-07-07 | 2021-11-02 | 西安震有信通科技有限公司 | Conversion processing method, device, terminal and medium based on various encoding files |
| CN115348232B (en) * | 2022-08-10 | 2024-04-19 | 中国建设银行股份有限公司 | Decoding method, decoding device, electronic equipment, medium and product |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
| CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
| JP2010128672A (en) * | 2008-11-26 | 2010-06-10 | Kyocera Corp | Electronic apparatus and character conversion method |
| CN103150293A (en) * | 2011-12-06 | 2013-06-12 | 富泰华工业(深圳)有限公司 | Electronic device with messy code recovery function and messy code recovery method |
-
2013
- 2013-12-27 CN CN201310737443.4A patent/CN104750663B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101110072A (en) * | 2007-08-21 | 2008-01-23 | 无敌科技(西安)有限公司 | Device and method for automatic identifying literal code |
| CN101551792A (en) * | 2008-04-03 | 2009-10-07 | 鸿富锦精密工业(深圳)有限公司 | Messy code recovery system and method |
| JP2010128672A (en) * | 2008-11-26 | 2010-06-10 | Kyocera Corp | Electronic apparatus and character conversion method |
| CN103150293A (en) * | 2011-12-06 | 2013-06-12 | 富泰华工业(深圳)有限公司 | Electronic device with messy code recovery function and messy code recovery method |
Also Published As
| Publication number | Publication date |
|---|---|
| CN104750663A (en) | 2015-07-01 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104750663B (en) | The recognition methods of text messy code and device in the page | |
| US10620945B2 (en) | API specification generation | |
| US11055373B2 (en) | Method and apparatus for generating information | |
| JP2023036681A (en) | Task processing method, processing device, electronic equipment, storage medium, and computer program | |
| CN101526963A (en) | Method for identifying web page coding, device and terminal equipment | |
| CN112269862B (en) | Text role labeling method, device, electronic device and storage medium | |
| US20190163699A1 (en) | Method and apparatus for information interaction | |
| WO2016124074A1 (en) | Information processing method, client, server and computer storage medium | |
| CN104112002A (en) | Form adaption method, device and system | |
| CN110007906B (en) | Script file processing method, device and server | |
| WO2014154033A1 (en) | Method and apparatus for extracting web page content | |
| CN108595468A (en) | A kind of acquisition methods of web data, device, server, terminal and system | |
| CN109492177B (en) | web page blocking method based on web page semantic structure | |
| CN107153716B (en) | Webpage content extraction method and device | |
| CN104267953A (en) | Control and method for importing Word test questions based on browser | |
| CN104978325B (en) | A kind of web page processing method, device and user terminal | |
| CN109828759A (en) | Code compiling method, device, computer installation and storage medium | |
| CN106294480A (en) | A file format conversion method, device and test question importing system | |
| CN118170378A (en) | Page generation method, device, electronic device, storage medium and program product | |
| US20180205630A1 (en) | System and method for automated generation of web decoding templates | |
| CN114398138A (en) | Interface generation method and device, computer equipment and storage medium | |
| CN119127152A (en) | API code generation method, device, equipment, medium and program product | |
| CN111783482A (en) | Text translation method and device, computer equipment and storage medium | |
| CN102073694A (en) | Original translated text multi-page checking method | |
| CN113792232B (en) | Page feature calculation method, page feature calculation device, electronic equipment, page feature calculation medium and page feature calculation program product |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| C06 | Publication | ||
| PB01 | Publication | ||
| C10 | Entry into substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant | ||
| TR01 | Transfer of patent right | ||
| TR01 | Transfer of patent right |
Effective date of registration: 20240402 Address after: Singapore Patentee after: Alibaba Singapore Holdings Ltd. Country or region after: Singapore Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands Patentee before: ALIBABA GROUP HOLDING Ltd. Country or region before: Cayman Islands |