CN103218453A - Method and device for splitting file - Google Patents
Method and device for splitting file Download PDFInfo
- Publication number
- CN103218453A CN103218453A CN2013101549863A CN201310154986A CN103218453A CN 103218453 A CN103218453 A CN 103218453A CN 2013101549863 A CN2013101549863 A CN 2013101549863A CN 201310154986 A CN201310154986 A CN 201310154986A CN 103218453 A CN103218453 A CN 103218453A
- Authority
- CN
- China
- Prior art keywords
- file
- directory
- content
- splitting
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 238000010008 shearing Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 7
- 238000007781 pre-processing Methods 0.000 abstract 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000005194 fractionation Methods 0.000 description 3
- 230000010365 information processing Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and a device for splitting a file. The device comprises a pre-processing module, a positioning module, a searching module and a splitting module, wherein the pre-processing module is used for pre-processing a directory structure of an original file, the positioning module is used for positioning a directory sequence and a directory gradation of the original file, the searching module is used for searching file contents needing clipping in a content file and used for positioning positions of a file directory at the beginning of contents, and the splitting module is used for clipping and cutting extracted contents of a sub-directory, pasting the clipped contents on a new empty file and storing the clipped contents in a database in a webpage mode. The device is fast in splitting and good in splitting effect and cannot cause the problems of disorder file layout and system crash.
Description
Technical field
The present invention relates to the administrative skill of large data objects, belong to the Intelligent Information Processing field in the Computer Science and Technology subject.
Background technology
A gordian technique in the large data objects management is how the data file to be cut apart, so that carry out file management and intelligent search.File splitting method is employing order cutting techniques usually, but for the cutting apart of large data objects, and when reaching GB as the capacity of data, it is very low that it cuts apart efficient, even can cause the Installed System Memory collapse.This is owing to will be in internal memory the data file such as be opened, copies, pastes, preserves, uploads at operation, therefore a large amount of consumption the Installed System Memory space.
Summary of the invention
It is very low that the technical problem to be solved in the present invention is that existing file splitting method is cut apart efficient, even can cause the Installed System Memory collapse.
For solving the problems of the technologies described above, the technical solution used in the present invention is: a kind of file method for splitting may further comprise the steps: 1) document directory structure is carried out pre-service, make its standardization; 2) adopt two pointer techniques that file directory is positioned, obtain the catalogue number of data file; 3) afterbody from file begins to shear to the section start of article according to this, order according to file directory is carried out segmented extraction by catalogue, content to sub-directory is sheared, and then the content of shearing is pasted in the new empty file again, and is kept in the database with the form of webpage.
The dividing method of traditional file is an employing order dividing method, promptly carry out segmented extraction by catalogue according to the order of file directory, content to sub-directory is sheared, and then the content of shearing is pasted in the new empty file again, and is kept in the database with the form of webpage; The present invention is according to the bibliographic structure partition data file of file, with the base unit of sub-directory as the storage and management file, and file carried out shearing manipulation.Adopt the advantage of shearing manipulation to be: along with constantly carrying out of splitting, the shared memory headroom of original will gradually reduce, and fractionation speed is constantly to accelerate; In addition, the present invention takes down the method for ordering, begins to shear from the afterbody of file, so just can not cause moving of file content, when having avoided the employing sequential system to split, file layout confusion that may cause and system crash problem, and then obtain satisfied fractionation effect.
As a kind of improvement project of the present invention, step 2) in two pointers comprise pointer Count and pointer Catalog(); Pointer Count is the directory order of file, and its initial value is the maximum catalogue number of file; Pointer Catalog() is array, for splitting the TOC level at catalogue place.
A kind of file detachment device comprises pretreatment module, is used for the bibliographic structure of pre-service source document, makes its standardization; Locating module is used to locate the directory order and the TOC level of source document; Search module, be used for, search the file content that needs shearing at described content file, and the position at the file directory place of locating content section start; Split module, be used to shear the content of the sub-directory of segmented extraction, and the content of shearing is pasted in the new empty file, preserve into database with the form of webpage.
Advantage of the present invention is: fractionation speed soon, the file layout confusion that can not cause and system crash problem, split effective.
Description of drawings
Fig. 1 is a schematic flow sheet of the present invention.
Embodiment
The inventive system comprises with lower module:
Pretreatment module is used for the bibliographic structure of pre-service source document, makes its standardization;
Locating module is used to locate the directory order and the TOC level of source document;
Search module, be used for, search the file content that needs shearing at described content file, and the position at the file directory place of locating content section start;
Split module, be used to copy the content of the sub-directory of segmented extraction, and the content of shearing is pasted in the new empty file, preserve into database with the form of webpage.
Method of the present invention may further comprise the steps:
1) document directory structure is carried out pre-service, make its standardization;
2) adopt two pointer techniques that file directory is positioned, obtain the catalogue number of data file; Described pair of pointer comprises pointer Count and pointer Catalog(); Pointer Count is the directory order of file, and its initial value is the maximum catalogue number of file; Pointer Catalog() is array, for splitting the TOC level at catalogue place
3) afterbody from file begins to shear to the section start of article according to this, order according to file directory is carried out segmented extraction by catalogue, content to sub-directory is sheared, and then the content of shearing is pasted in the new empty file again, and is kept in the database with the form of webpage.
As shown in Figure 1, the present invention can directly split the word file with standard bibliographic structure, if the word document directory structure is lack of standardization, must carry out pre-service, after it is standardized, re-uses method of the present invention and splits.Sort method of the present invention adopts two pointer techniques that file directory is positioned, and pointer Count is the directory order of file, and its initial value is the maximum catalogue number of file.Pointer Catalog() being array, is the TOC level that will split the catalogue place, as first class catalogue, second-level directory etc.Start is for splitting the reference position of content, and End is a final position.It is as follows that the present invention splits flow process:
Step 1: obtain the catalogue number of data file, and assignment is given pointer Count;
Step 2: split reference position Start and put initial value 0;
Step 3: split final position End and put initial value, point to the end of file;
Step 4: obtain the paragraph number of data file, the paragraph number is added in the lump give variable i with its assignment;
Step 5: the paragraph number is subtracted one, and give variable i with its assignment;
Step 6: judge whether the i section is the catalogue of file, if, change step 7, not to change step 5;
Step 7: obtain the reference position of i section, and assignment is given Start;
Step 8: the content between shearing from Start to End;
Step 9: the content of shearing is saved as webpage and be saved in the database;
Step 10: the catalogue number is subtracted one, and be kept among the pointer Count;
Step 11: Count is saved in the database;
Step 12: the level at this catalogue place is saved in array Catalog() in, and is saved in the database;
Step 13: variate-value End=Start is set;
Step 14: whether interpretation i is greater than 0, if commentaries on classics step 5 is not to change step 15;
Step 15: algorithm finishes.
Claims (3)
1. a file method for splitting is characterized in that, may further comprise the steps:
1) document directory structure is carried out pre-service, make its standardization;
2) adopt two pointer techniques that file directory is positioned, obtain the catalogue number of data file;
3) afterbody from file begins to shear to the section start of article according to this, order according to file directory is carried out segmented extraction by catalogue, content to sub-directory is sheared, and then the content of shearing is pasted in the new empty file again, and is kept in the database with the form of webpage.
2. a kind of file method for splitting according to claim 1 is characterized in that: step 2) in two pointers comprise pointer Count and pointer Catalog(); Pointer Count is the directory order of file, and its initial value is the maximum catalogue number of file; Pointer Catalog() is array, for splitting the TOC level at catalogue place.
3. adopt any employed device of described a kind of file method for splitting among the claim 1-2, comprising:
Pretreatment module is used for the bibliographic structure of pre-service source document, makes its standardization;
Locating module is used to locate the directory order and the TOC level of source document;
Search module, be used for, search the file content that needs shearing at described content file, and the position at the file directory place of locating content section start;
Split module, be used to shear the content of the sub-directory of segmented extraction, and the content of shearing is pasted in the new empty file, preserve into database with the form of webpage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101549863A CN103218453A (en) | 2013-04-28 | 2013-04-28 | Method and device for splitting file |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2013101549863A CN103218453A (en) | 2013-04-28 | 2013-04-28 | Method and device for splitting file |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103218453A true CN103218453A (en) | 2013-07-24 |
Family
ID=48816240
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2013101549863A Pending CN103218453A (en) | 2013-04-28 | 2013-04-28 | Method and device for splitting file |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103218453A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN112651988A (en) * | 2021-01-13 | 2021-04-13 | 重庆大学 | Finger-shaped image segmentation, finger-shaped plate dislocation and fastener abnormality detection method based on double-pointer positioning |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2326805A1 (en) * | 2000-11-24 | 2002-05-24 | Ibm Canada Limited-Ibm Canada Limitee | Method and apparatus for deleting data in a database |
CN1853166A (en) * | 2003-09-30 | 2006-10-25 | 英特尔公司 | Method and apparatus for thread management of multithreading |
CN101128820A (en) * | 2004-12-30 | 2008-02-20 | 谷歌公司 | Document Segmentation Based on Visual Gap |
CN101692239A (en) * | 2009-10-19 | 2010-04-07 | 浙江大学 | Method for distributing metadata of distributed type file system |
CN102143215A (en) * | 2011-01-20 | 2011-08-03 | 中国人民解放军理工大学 | Network-based PB level cloud storage system and processing method thereof |
CN102262658A (en) * | 2011-07-13 | 2011-11-30 | 东北大学 | Method for extracting web data from bottom to top based on entity |
CN102819599A (en) * | 2012-08-15 | 2012-12-12 | 华数传媒网络有限公司 | Method for constructing hierarchical catalogue based on consistent hashing data distribution |
-
2013
- 2013-04-28 CN CN2013101549863A patent/CN103218453A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA2326805A1 (en) * | 2000-11-24 | 2002-05-24 | Ibm Canada Limited-Ibm Canada Limitee | Method and apparatus for deleting data in a database |
CN1853166A (en) * | 2003-09-30 | 2006-10-25 | 英特尔公司 | Method and apparatus for thread management of multithreading |
CN101128820A (en) * | 2004-12-30 | 2008-02-20 | 谷歌公司 | Document Segmentation Based on Visual Gap |
CN101692239A (en) * | 2009-10-19 | 2010-04-07 | 浙江大学 | Method for distributing metadata of distributed type file system |
CN102143215A (en) * | 2011-01-20 | 2011-08-03 | 中国人民解放军理工大学 | Network-based PB level cloud storage system and processing method thereof |
CN102262658A (en) * | 2011-07-13 | 2011-11-30 | 东北大学 | Method for extracting web data from bottom to top based on entity |
CN102819599A (en) * | 2012-08-15 | 2012-12-12 | 华数传媒网络有限公司 | Method for constructing hierarchical catalogue based on consistent hashing data distribution |
Non-Patent Citations (2)
Title |
---|
穆飞等: "基于定位目录的元数据管理方法", 《清华大学学报》, 15 August 2009 (2009-08-15) * |
高良才等: "一种基于聚类技术的图书目录识别方法", 《北京大学学报》, 20 July 2010 (2010-07-20) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112597422A (en) * | 2020-12-30 | 2021-04-02 | 深圳市世强元件网络有限公司 | PDF file segmentation method and PDF file loading method in webpage |
CN112651988A (en) * | 2021-01-13 | 2021-04-13 | 重庆大学 | Finger-shaped image segmentation, finger-shaped plate dislocation and fastener abnormality detection method based on double-pointer positioning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103605758B (en) | The method and device that a kind of mobile terminal document is searched | |
US9846702B2 (en) | Indexing of file in a hadoop cluster | |
CN102169507A (en) | Distributed real-time search engine | |
US11003845B2 (en) | Systems and methods for reduced memory usage when processing spreadsheet files | |
CN106874481B (en) | Method and system for reading metadata information of distributed file system | |
US9842158B2 (en) | Clustering web pages on a search engine results page | |
CN107844493B (en) | File association method and system | |
CN103778202A (en) | Enterprise electronic document managing server side and system | |
CN102930060A (en) | Method and device for performing fast indexing of database | |
CN102411617A (en) | Method for storing and inquiring mass URLs | |
US8818971B1 (en) | Processing bulk deletions in distributed databases | |
CN104778182A (en) | Data import method and system based on HBase (Hadoop Database) | |
US11210134B2 (en) | Atomic execution unit for object storage | |
CN107066503A (en) | The method and device of magnanimity metadata burst distribution | |
CN102708148A (en) | Duplication eliminating method based on multidimensional lattice data spatial model | |
CN104462349A (en) | File processing method and file processing device | |
CN103218453A (en) | Method and device for splitting file | |
US20130304871A1 (en) | Continually Updating a Channel of Aggregated and Curated Media Content Using Metadata | |
CN102799661A (en) | Method and system for implementing semantic retrieval on electronic files | |
CN102831181B (en) | Directory refreshing method for cache files | |
CN102521383A (en) | Method for storing and accessing mass files in distributed system | |
CN104252537A (en) | Index fragmentation method based on mail characteristics | |
US8700583B1 (en) | Dynamic tiermaps for large online databases | |
CN103853832A (en) | Customizable data capturing method in full-text retrieval system | |
US20130297576A1 (en) | Efficient in-place preservation of content across content sources |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20130724 |
|
RJ01 | Rejection of invention patent application after publication |