Summary of the invention: we provide a solution for this reason, and purpose is to realize the classification of file and to the inquiry of content under the prerequisite of Study document content that do not open file, and makes inquiry system oversimplify.This inquiring technology is replenishing the file polling system that has been shaped, purpose is to utilize classification feature to improve the recall ratio and the precision ratio of file polling, remedy the defective of traditional file inquiring technology, realize subject-oriented, towards the inquiry of content, excavate the hiding data in the common inquiry system, propose the new application mode of file polling.
Content of the present invention and technical scheme are as follows:
Ff method of the present invention comprises three kinds of document classification querying methods based on file extension, catalogue and enquiry frequency.They have been constituted the complete ff technology based on classification together.
1, inquires by classification based on the file layout of filename
The type distribution of the matched character string that is used to inquire about for analysis user, we have added up the coupling string of user's input of 840,000 times of FTP search engine, obtain match query string type distribution plan Fig. 1.I represents single key word type ratio among the figure, and II represents only extension name type ratio, and III represents full filename type ratio.As seen from Figure 1, all be only to import a key word during most user inquiring, and concrete extension name can't be provided.For domestic consumer, extension name is a more indigestible thing, movie file for example, possible expansion is called " .rm ", " .mpeg ", " .dat " or the like, requires the user to provide extension name can make domestic consumer that inquiry system is hung back in order to search film.But, the user extension name is not provided and in entire database inquiry just have and much do not meet the Query Result that the user needs, really obtained the source code download address of this program such as the download address of certain program of inquiry, thereby made precision ratio not high.Thereby what may need the common user query file time is certain type file, rather than the file of particular extension, and for example the user may wish to inquire music file, is " .mp3 " file or " .au " file but do not limit.Even the user knows under the situation of extension name,, be necessary for this first song and specify a plurality of extension name, otherwise just may miss many download address, and this is often pretty troublesome, also is not easy in the realization in order to find all download address of a first song.
In order to solve the memory extension name to the burden of domestic consumer and be implemented in a file polling in the big classification, All Files can be divided into several simple file layout types, only need specify the file layout type that he needs during user inquiring and need not specify concrete extension name just can inquire about.The file layout type can be divided into image, sound, video, compression, document, program, source code, catalogue and " other " or the like several big classifications by general knowledge.Inquiry system is given each file layout classification numbering, and definition belongs to " the famous extension name " of this classification in a large number.Because the difference of file layout is the extension name of file, inquiry system can not be opened each file and detect its actual file layout, so use " famous extension name " standard as the file layout classification." famous extension name " derives from and popular this extension name belonged to the common recognition of what type file, should belong to the file of Doctype such as " .doc ", " .ppt ", " .txt ", " .pdf ".If certain file has used " .doc " as extension name, but its file layout is not generally accepted " .doc " form, and this situation is not considered in system.Belong to the situation of plurality of classes for a kind of extension name, get its modal classification.When inquiry system obtains a file entries, utilize its extension name to obtain its corresponding file form classification, be kept in the attribute of file entries.When the file of user inquiring specified file format type, just the type number that can select with the user and the type number in the file attribute are made comparisons, and obtaining filtering out the result filename coupling from the filename coupling is again simultaneously the Query Result of specifying the class file format type.By file layout classification synoptic diagram such as Fig. 2, I represents by the partial document before the file layout classification among the figure, and II represents three class files that are divided into by after the file layout classification; Music, video and document.
File layout classified inquiry method based on file extension is exactly to utilize the standard of the extension name of filename as document classification, and file is divided into various Format Type, every kind of corresponding some extension name of Format Type.Wherein the file layout type comprises types such as document, video, audio frequency, image, program, source code, catalogue.The type of generally understanding for the employing of certain extension name institute respective file type; Belong to the situation of plurality of classes for a kind of extension name, get its modal classification.When inquiry system obtains a file entries, utilize its extension name to obtain its corresponding file form classification, be kept in the attribute of file entries; When the file of user inquiring specified file format type, type number of selecting with the user and the type number in the file entries attribute are made comparisons; From the result that the filename coupling obtains, filter out the Query Result that filename coupling while specified file format type is also mated.
2, inquire by classification based on the file content of catalogue
In the inquiry system at filename, each file is to analyze its file content owing to unlikely read (or download), and all analyses to file content can only rely on filename.Though general filename can both embody the content of file, we find that the filename of a lot of multimedia files (referring to video, audio frequency and image file) all can not embody its file content.For video file, often situation about occurring is that filename is a.rm, b.rm rather than concrete " movie name .rm ".For audio file, on the one hand may file be filename and do not comprise singer's name, but may two for user inquiring all need with the title of the song, are very common inquiry behaviors because inquire about all songs of a singer; Similar with the movie file name on the other hand, be exactly that the CD dish audio frequency that changes record may use track0.mp3 track1.mp3 or the like name, and this name can't be determined the content of music at all.For image file, usually the situation of Chu Xianing is with numeral name image file, 1.jPg for example,, 2.jpg or the like, this is because image usually is a form with a series image to be occurred, and all giving one for numerous similar images, the name of independent meaning is arranged is the thing that very bothers.We are to 8,642, and the filename of 123 multimedia files is analyzed, and obtain multimedia file name property list table [1].By table [1] as seen, this filename characteristic of multimedia file has hindered normal multimedia file inquiry.
Table [1]
| By the file layout classification type | Account for all files ratio | Filename can't embody the ratio of file content |
| Video file | ????1.08% | ????7.35% |
| Audio file | ????14.99% | ????2.17% |
| Image file | ????8.16% | ????6.73% |
Solve the problem that filename can't embody file content, at first have a look at the effect of file system directories.Why most of operating system adopts the bibliographic structure of tree type to be because tree type contents can be realized powerful classification capacity, and the directory name of each catalogue has embodied the content or the association attributes of file and sub-directory under this catalogue.Especially in the catalogue that comprises the multimedia file with above-mentioned situation, its directory name has generally just embodied the file content of multimedia file in this catalogue.Even the last layer catalogue of multimedia file generally can embody the content of these files, but, when the user inquiring key word, his resulting the possibility of result much is a directory name, and the user must catalogue of a catalogue enter to check whether the file of just knowing the inside is that the user is required really.This operates the fast query ability of just having offset search engine slowly.How to make the user need not enter each catalogue and just can determine that whether the inside is his required file really? the way that addresses this problem is that the directory name of multimedia file place catalogue and filename are come along the matching inquiry string, and the problems referred to above just can solve easily like this.The file layout type numbering that file layout classification above utilizing produces, the directory name that will have the filename of file entries of audio frequency, video and image type and the last layer catalogue that it may exist merges is used as an integral body, inquiry system is all done it as a whole use when setting up index, user inquiring and result and show, also must guarantee the correctness of link in the time of last certainly output download link.Use file layout and file content classification synoptic diagram as shown in Figure 3.I represents by ftp listed files before file layout and the file content classification among the figure, listed files behind the II presentation class, and wherein video, its filename of audio file and its last layer catalogue merge, and file path is stored in addition as file attribute.
Classified inquiry is the filename of file and its last layer catalogue to be merged to do the as a whole inquiry that is used for when inquiry based on the file content of catalogue, and hit results or filename have hit the coupling string, or its last layer directory name has hit the coupling string.This file polling method is used for the inquiry of multimedia file, comprises multimedia file types such as audio frequency, video and image; When the user searches this class multimedia file, adopt file content classified inquiry based on catalogue, the directory name and the filename of file place catalogue come along the matching inquiry string.Wherein the filename of file and its last layer catalogue merged during computer inquery and do the as a whole inquiry that is used for, inquiry system is being set up index, all filename and its last layer catalogue is being done as a whole carrying out when the result shows.
3, inquire about based on the document classification of enquiry frequency
For the naive user of not searching for general knowledge, they often use the bad searching request that can't return information needed, but they have accounted for netizen's the overwhelming majority, and this situation changes never.Through to the log analysis of user inquiring, available conclusion is that most of user is: I can not express me and want what is looked for, but I will know that I look for when I see it is exactly it.If search engine only provides the list of an input frame and a lot of complexity may be at a loss for domestic consumer.Because the keyword scope that it is exactly user search that the FTP search engine has a characteristic is more limited, in more than 90,000 inquiries of our statistics, it is mutually different having only more than 5000 inquiry.If shortcut is made in popular inquiry, the user is once clicking the Query Result that can obtain this software, and what then the user will do to search engine is to indicate oneself what is wanted no longer just, but search engine tells what the user can want.
The definition shortcut refers to indicate a URL link that inquiry is corresponding with a name.After search engine had had file layout classification feature and file content classification, the shortcut system that sets up inquiry was just feasible.This is because in shortcut, makes full use of file layout classification capacity and file content classification capacity, and the Query Result of shortcut can be very accurate and comprehensive.
When shortcut increases, can make and find a shortcut extremely to bother if all shortcuts all offer the user, thereby must classify shortcut.The inquiry class categories of formulating a two-stage is more appropriate, and first order classification is similar to the classification of file layout classification, for example: film, music, program, document etc.; The second level is categorized as the classification by content in this classification, such as action, love type etc. are arranged under the film, system, compression, recreation etc. is arranged under the program.After setting up the shortcut system of this two-stage, by user and keeper in each classification, add enquiry frequency than higher inquiry as shortcut.Utilize cgi script to write down the number of clicks of each shortcut, by the output of clicks descending, then the user can know the software seniority among brothers and sisters of current this classification when showing all shortcut clauses and subclauses of a classification.Shortcut under the part classification is defaulted as a specific file layout, is defaulted as the video file format type, so just can automatically shortcut be combined with the document classification function, guarantee the accuracy of shortcut such as the shortcut of movies category.Shortcut system logic synoptic diagram such as Fig. 4.1 expression shows the shortcut tabulation among the figure, 2 expressions show the shortcut in the classification, 3 expressions are inquired about by the inquiry URL of shortcut correspondence, 4 expression users register new shortcut, the shortcut of 5 expression keeper filter user registrations, the shortcut that 6 expression Admin Administrations have existed, 7 is the shortcut database.
In the document classification inquiry based on enquiry frequency, inquiry URL commonly used is carried out two-stage classification, the first order is categorized as the file layout classification, and the second level is categorized as the classification by content in this classification.Can utilize simultaneously program to write down the number of clicks of each shortcut, when showing all shortcut clauses and subclauses of a classification,, provide such other inquiry seniority among brothers and sisters thus simultaneously by clicks ordering output.
More than in 3 kinds of methods, second kind of file content classified inquiry based on catalogue can be used in combination separately or with other two kinds of methods, is used for searching of multimedia file: promptly can inquire about according to classifying based on the file content of catalogue; When multimedia type file is searched in user's appointment, by the file of inquiry system inquiry file name or file place upper directory name matching inquiry key word.
Other two kinds of querying methods are inquired about and can be used in combination based on the file layout classified inquiry of filename and based on the document classification of enquiry frequency: the user can be according to the file layout classification based on file extension, two search requests of import file name key word and file layout are exported the file that meets these two requirements by the inquiry system coupling; And can according to the listed files of often searching that inquiry system is provided, select needed file in of all categories according to inquiry sorting technique based on enquiry frequency according to the enquiry frequency arrangement.
Embodiment:
Be described further below in conjunction with embodiment.
Peking University's computer science and technology is network and the compartment system field project since " day net " FTP search engine in 1999.At present Beijing University's " day net " FTP search engine has been one and has collected more than 3000 website in the whole nation, 13,000,000 ftp file entry data are arranged, used the powerful FTP search engine of the technology of searching based on the document classification of filename, catalogue and enquiry frequency.About 200 milliseconds, every day, inquiry times reached about 100,000, and this numeral constantly rises during at present average enquiry fee.
1. inquire design sketch by classification based on the file layout of filename
In the inquiry of Fig. 5, the user has only imported key word " Lu xun ", and has selected to inquire about in Doctype, and Query Result has returned the various format files (.txt and .doc and .htm) that comprise " Lu xun " in the filename.Be that the user need not to specify specific extension name just can inquire about in particular type to obtain his desired result.If the user does not have specified type, then Query Result may much not be that the user is required, and the user must page turning check the file that just can find particular type, and precision ratio is just not high yet.In last example, actually or the user often and be indifferent to file .txt form .doc form, provide extension name if rely on the user, may just can't comprise the file of all similar contents.
2. inquire by classification based on the file content of catalogue
In the inquiry of Fig. 6, user entered keyword " Tokyo Love Story ", and filename does not mostly comprise " Tokyo Love Story " in the result who returns, but tls0? .rm, be that its filename can't embody file content, just because its last layer directory name has comprised " Tokyo Love Story ", under file content classified inquiry based on catalogue, the file that these filenames can't embody file content is able to be found by people, otherwise, the user may only see that some comprise the catalogue of " Tokyo Love Story ", can know just whether the file in this catalogue is required after must entering corresponding catalogue.
3. inquire about based on the document classification of enquiry frequency
Fig. 7 and Fig. 8 two figure are respectively the classification page and the interior shortcut page of certain classification (" swordsman " class of " film, cartoon " lining) in the inquiry classification.The classification page conveniently finds the shortcut of particular category, shows the inquiry that some is commonly used in the shortcut page, and the user only just need click can obtain Query Result, and need not any input.
Advantage of the present invention and good effect are:
Inquiring technology with existing object oriented file name is compared, and has following advantage and good effect based on the document classification technology of searching of filename, catalogue and enquiry frequency:
1. the precision ratio of file polling system improves greatly.After the file layout classified inquiry technology of application based on filename, a general medium file search engine has become a plurality of topic search engines.The user can be in various specified type locating file and needn't lie in its extension name.Especially when the Query Result number of filename coupling is very huge, only show that the result's of a type mode has greatly reduced the number of times of user's page turning, improved the efficient of inquiry.For example, inquire about the relevant documentation of C++builder, directly use the inquiry of object oriented file name, 237 hit results are arranged when not specifying extension name, have only 7 hit results when specifying the .doc extension name, and we specify in and inquire about in the Doctype after having used file layout classified inquiry technology, and then hit results has 19, and such result does not have unnecessary alternative document information (as program file of C++builder etc.) to comprise the document of the various forms of all needs again.
2. improved the recall ratio of inquiry system.Application is based on the file layout classified inquiry technology of filename with based on behind the catalogue file classifying content inquiring technology, and the hit results number increases considerably during searching multimedia files, and many files with numeral or sequence number name are able to be it is found that.To the inquiry of TV series, to singer's inquiry, to the inquiry of special edition, all very convenient directly perceived to the inquiry of pictures.This improvement is main film, the music inquiry sharp weapon that kept general polling simultaneously again with making inquiry system become one from a general file polling system with the multimedia inquiry.
3. make that inquiry system is oversimplified, ease-of-use.Can encourage domestic consumer to use the file polling system greatly inquiry classification and the mode of setting up the shortcut system.Because the classification of inquiry is based upon on file layout sorting technique and the file content sorting technique, various complicated query options (comprise the file layout type, size restriction or the like) all is hidden in the inquiry URL of shortcut correspondence, for the user who does not much know to want to look for what software (as want to see action movie and do not mind the user of any action movie) or to the not clear user of the dbase of wanting to look for (as want to look for NetAnts and do not know that its dbase is the user of netant), the user uses inquiry system to do, and just can be to select rather than do requirement.After using shortcut, the user uses the ratio of shortcut will account for major part in all inquiries, because the shortcut that system provided has comprised the inquiry of most of user's needs.Like this, because the coupling string of shortcut is fixed, have the Cache hit rate of the inquiry system of buffering to increase greatly, most of inquiry can obtain Query Result in Cache in the extremely short time, thereby has also improved the efficient of inquiry.
4. be the upgrading and the important supplement of the inquiring technology of object oriented file name.The document classification technology of searching based on filename, catalogue and enquiry frequency not is substituting of traditional file inquiring technology, but upgrading with replenish but do not used filename coupling and attribute filtering technique because it proposes how to carry out the coupling of filename.On the inquiry system of ready-made object oriented file name, carry out part and revise the inquiry system that just can become a use classification with interpolation, also kept the query function of old object oriented file name simultaneously.The document classification technology of searching based on filename, catalogue and enquiry frequency makes the inquiry system of object oriented file name have the ability of subject-oriented and excavation hiding data, the inquiry manual sort technology of considering for domestic consumer makes inquiry system more popular simultaneously, is easy to be accepted by the user.
The present invention can be applied to comprise related fields such as FTP search engine, MP3 searcher, this machine file polling, Library Resources retrieval.