WO2009087999A1 - Dispositif de spécification de structure d'index - Google Patents
Dispositif de spécification de structure d'index Download PDFInfo
- Publication number
- WO2009087999A1 WO2009087999A1 PCT/JP2009/050045 JP2009050045W WO2009087999A1 WO 2009087999 A1 WO2009087999 A1 WO 2009087999A1 JP 2009050045 W JP2009050045 W JP 2009050045W WO 2009087999 A1 WO2009087999 A1 WO 2009087999A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- slide
- contents
- slides
- format
- extracting
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/34—Browsing; Visualisation therefor
- G06F16/345—Summarisation for human users
Definitions
- the present invention relates to a table of contents structure specifying device for specifying a table of contents structure of a presentation document, a table of contents structure specifying method, and a recording medium thereof.
- presentation documents are increasing as electronic documents.
- the presentation document is a so-called electronic picture-story show, which is a document in which information to be conveyed to readers and audiences is collected on a plurality of slides. Each slide has a title of the slide, an explanatory text about the information to be conveyed, an illustration, a chart, and the like.
- a typical example of software for creating a presentation document is Microsoft Powerpoint (registered trademark).
- Presentation documents are a collection of necessary information in a compact form, and are highly valuable as materials. Further, since one topic is explained by one or a plurality of slides, a document is often modularized for each topic. Therefore, when a document having a similar content to the document is created, the presentation document is easy to reuse.
- presentation documents have a hierarchical structure in order to convey the contents logically and easily.
- a format in which an outline is explained by one slide and detailed contents are explained by subsequent slides is often used.
- the slides for explaining the detailed contents later correspond to chapters 1.1, 1.2,.
- a chapter is defined as a set of slides related to a certain content.
- a chapter can take a nested structure with more chapters in it.
- the nested structure that occurs between the slides is called a relational structure. This corresponds to a case where one content is explained as a whole and is divided into fine topics.
- the structure of the modularized and hierarchized presentation document is called a table of contents structure.
- the table of contents structure corresponds to a table of contents in a general document.
- the table of contents is the most simplified summary of the document. By browsing the table of contents, you can understand the flow of the story and the content that is important. In addition, the table of contents makes the chapter breaks clear, so that documents can be managed and utilized in fine units. For example, by indexing a document by its chapter, even if the document is very long, it is possible to easily search the document where the content required by the searcher is described. Can do. Further, even when a part of data stored in the device is damaged due to a device failure or the like, it is possible to avoid that all the documents cannot be used. Of course, the presentation document can obtain the same merit as described above by specifying the table of contents structure.
- Microsoft Word registered trademark
- Microsoft Word has a function of automatically generating a table of contents using a character string on an arbitrary line designated by a document creator. It is desirable for the creator to specify the table of contents structure sequentially. However, it is difficult to perform such processing on already-presented presentation documents.
- Patent Document 1 An example of a technique for specifying a table of contents structure is described in Patent Document 1.
- the “image processing apparatus, image reading apparatus, and program” described in Patent Literature 1 specifies image data corresponding to a page on which a table of contents is described among image data of a document including a plurality of pages. Then, the page number corresponding to each headline in the table of contents is associated with the headline of the page from the image data.
- Patent Document 2 An example of a technique for specifying another table of contents structure is described in Patent Document 2.
- a method for analyzing an electronic document to be searched and an electronic document registration system described in Patent Document 2 specifies a part corresponding to each extracted heading of the table of contents from the text. Thereby, the text can be divided and the divided text data can be registered as a search unit.
- Patent Document 3 An example of a technique for specifying another table of contents structure is described in Patent Document 3.
- the “slide structuring device” described in Patent Document 3 extracts keywords that are chapter numbers, such as “Chapter 1”, “Unit 2”, “(3)”, and the like from the character strings in the slide. Is identified.
- Patent Document 4 describes a technique for printing a structured document in a manner in which chapter search is easy, and preventing printed matter from other print jobs from being mixed.
- the technique of Patent Document 4 analyzes an HTML document file to generate an abstract tree structure, and determines the print position of each drawing object.
- the table of contents structure in order to specify the table of contents structure, there is a page where the table of contents is written, or it is necessary to explicitly describe the chapter number etc. in the document. Also, only a fixed table of contents structure can be specified, such as the number of hierarchies of a slide having a certain feature. Furthermore, the table of contents structure cannot be specified by integrating a plurality of information sources.
- Patent Document 1 and Patent Document 2 use a table of contents page for specifying a table of contents structure, and therefore cannot be applied to a presentation document without a table of contents page.
- Patent Document 1 since it is necessary to integrate various analysis techniques such as image processing and OCR, high accuracy is required for each analysis technique.
- the depth (level) of a hierarchy is determined by a rule. Since the correspondence between the rule and the level is one-to-one, if the item corresponding to “Chapter 1” is determined as Level 1, “Chapter 1” is determined to be Level 1 in any document. However, since how to create a chapter differs depending on the creator of the material, one person uses “Chapter 1” as level 1 and another person uses “Chapter 1” as level 2. That is, the table of contents structure is relatively determined in the overall format of the document, and a document that cannot be handled by a method using a fixed rule occurs.
- Patent Document 4 since the chapter number is extracted from the character string in the slide and the table of contents structure is specified, the table of contents page does not need to exist.
- presentation documents are often used with verbal explanations. For this reason, it is sufficient that the created slide is one in which the topic change can be understood by the person who is viewing the slide, and the chapter number or the like is not always explicitly described on the slide.
- the technique of Patent Document 4 is originally intended for a structured HTML document.
- the table of contents structure is specified based on certain information such as a table of contents page, a title chapter number, and a characteristic character string.
- certain information such as a table of contents page, a title chapter number, and a characteristic character string.
- a table of contents structure suitable for a document based on a plurality of pieces of information such as a format and a character string in a presentation document, without depending on only the table of contents in which the table of contents is written, the chapter number, etc.
- the purpose is to specify.
- a table-of-contents structure specifying device extracts a relational structure between slides of a presentation document composed of a plurality of slides including objects that are text boxes, tabular forms, vector graphics, or images.
- a table-of-contents structure specifying device that acquires a predetermined format for specifying a relational structure between slides from characteristics of a format of an object included in the slide, and includes a plurality of consecutive slides included in the presentation document
- the hierarchy determined by the relative relationship structure between the slides that serve as the basis for the subset It integrates the hierarchy in presentation documents, characterized in that it comprises, an integrated means for determining the respective slide of the hierarchy constituting the presentation document.
- the table-of-contents structure specifying method extracts a relational structure between slides of a presentation document composed of a plurality of slides including an object which is a text box, a table format, a vector graphic or an image.
- a table of contents structure specifying method wherein a predetermined format for specifying a relational structure between the slides is acquired from characteristics of a format of an object included in the slide, and is composed of a plurality of continuous slides included in the presentation document
- the hierarchy determined by the relative relationship structure between and the slides that serve as the basis for the subset Serial integrates the hierarchical presentation document, characterized in that it comprises, an integrated determining each slide hierarchy constituting the presentation document.
- a recording medium is for extracting a relational structure between slides of a presentation document composed of a plurality of slides including an object which is a text box, a table format, a vector graphic or an image.
- a partial structure extracting means for extracting a relative relation structure between the slides of the subset by searching for slides conforming to the predetermined format in a predetermined order;
- the hierarchy determined by the relational structure and the presentation of the slide that is the basis of the subset
- a table of contents structure suitable for a document can be specified based on a plurality of pieces of information such as formats and character strings in a presentation document even when there is no table of contents information specified.
- FIG. 10 is an explanatory diagram illustrating an example of a table of contents specification process according to Embodiment 2.
- FIG. 10 is an explanatory diagram showing another example of the specification process of the table of contents structure according to the second embodiment.
- 10 is an explanatory diagram illustrating an example of a display mode in which a table of contents structure is hierarchized in Embodiment 2.
- FIG. FIG. 10 is an explanatory diagram illustrating an example of a display mode in which a table of contents structure is hierarchized in Embodiment 2.
- FIG. 10 is an explanatory diagram illustrating another example of a display mode in which a table of contents structure is hierarchized in the second embodiment.
- FIG. 10 is an explanatory diagram illustrating an example of a display mode in which the table of contents structure according to the second embodiment is used as an index. It is a block diagram which shows an example of the structure of the hardware of the table-of-contents structure specific
- identification apparatus which concerns on embodiment of this invention.
- ⁇ Prepare a table of contents page to make the whole document easier to understand. ⁇ Assign chapter numbers to the titles of each slide, such as “1. Introduction” and “1-1. Background” so that the chapter structure can be understood. ⁇ Insert a slide that contains only the title in the middle of the presentation document to clarify the separation between the slides. ⁇ As shown in “Case 1” and “Case 2”, serial numbers are assigned to the titles so that the connection between slides can be understood. -The contents to be explained are enumerated in advance, and then the slide is composed in the order of detailing each enumerated information so that the story can be logically advanced.
- the creator of the presentation document generates a table of contents structure either explicitly or implicitly. Therefore, conversely, it is possible to specify the table of contents structure by extracting such information from the presentation document.
- the table-of-contents structure specifying device is characterized by specifying a chapter structure in a finer unit from one chapter, recursively repeating the process of specifying the relational structure between the chapters, and specifying the table of contents structure.
- a chapter five pieces of information are used: the presence of a cover slide, the presence of a table of contents slide, the presence of a title number, the presence of a headline slide, and the presence of a partial hierarchy.
- the table of contents structure obtained from each information can be integrated into one table of contents structure without causing competition. However, not all the information is necessary for specifying the table of contents structure. If there is one or more information sources among the information, the table of contents structure can be specified.
- the relationship structure between chapters is expressed by a depth indicating a hierarchy, a flag indicating that chapters are switched, or the like.
- FIG. 1 is a block diagram showing an example of the configuration of a table of contents structure specifying device according to the present invention.
- the table of contents structure specifying device 10 includes a configuration information extraction unit 101, a cover extraction unit 102, a table of contents equivalent information extraction unit 103, a segment extraction unit 104, a partial hierarchy extraction unit 105, and an extraction information combining unit 106.
- the present embodiment also includes a document database 301 and an output unit 302.
- the table-of-contents structure specifying apparatus 10 is realized by, for example, a computer (for example, an information processing apparatus such as a personal computer) that operates according to a program.
- FIG. 2 is a flowchart showing an example of the operation for specifying the table of contents structure performed using the table of contents structure specifying device 10.
- the processing from step S101 to step S106 shown in FIG. 2 is processing for subdividing chapters and specifying the relational structure between these chapters.
- the table of contents structure is appropriately identified.
- specification apparatus 10 are demonstrated.
- the table of contents structure includes the depth of each slide layer arranged in the slide order, and a flag indicating whether or not the slide corresponds to a topic break in each slide in the slide order. Expressed by things.
- Presentation documents are recorded in a format that can extract information such as text information included in each slide, position of each text, character decoration information (color, font type, font size), lines, diagrams, tables, etc. It shall be. XML is a typical example of such a format.
- elements constituting a slide such as text, vector graphics, images, and tables are called objects.
- a presentation document that is a target for specifying a table of contents structure is recorded. Registration of the presentation document in the document database 301 may be performed by the creator of each presentation document, or may be automatically performed by a crawler or the like.
- the configuration information extraction unit 101 selects a presentation document from the document database 301 and extracts configuration information of each slide in the document. This process corresponds to step S101 in FIG.
- Structuring information is a collection of attributes extracted from various viewpoints from the objects included in each slide for each object.
- the object includes attribute information such as whether the text is a title of a slide or is described in a bulleted list.
- the attribute information can be referred to from all the components in the table of contents structure specifying device.
- the configuration information includes an identification code (material ID) of the presentation document, a slide number indicating the number of slide pages, an object ID for specifying an object included in each slide, the position, type, and background color of the object.
- Text, attribute information of the text, font size, type, color, and the like are stored.
- the position information is stored in the form of the object (the upper left x coordinate, the upper left y coordinate), the (lower right x coordinate, the lower right y coordinate), and the color information is stored as an RGB value.
- the configuration information is collected in a table format for easy viewing, but other formats such as XML may be used.
- the text of one object when the character size or font color is changed halfway, the text may be separated and stored at the change point. Even if stored in this way, the original text is reproduced by collecting texts having the same object ID.
- configuration information relating to the first slide whose document material ID is P001 is stored.
- the slide has four objects, three of which have text.
- the object ID: 001 contains the text “Introduction”. Moreover, it turns out that a background color is white and a character is black. It is also clear from the configuration information that the text is used as a title.
- text attribute information can be extracted by using the XML tag.
- the template function of the presentation creation tool gives tags such as ⁇ title> ⁇ / title> to text corresponding to a title.
- the function also gives a tag such as ⁇ itemize> ⁇ / itemize> to a character string written using a bulleted template.
- the function provides a tag such as ⁇ item> ⁇ / item> for each character string listed in the tag. From these tags, the configuration information extraction unit 101 extracts attribute information of each text.
- Attribute information can be obtained from the characteristics of the format of the presentation document even if there is no tag information. For example, the following characteristics may be used to obtain title attribute information.
- a typical example of a title format in a presentation document is shown in FIG. “Title text” 501 in FIG. 4 is the title of this slide.
- the title is (i) described in a relatively large font within the slide, (ii) described at the top of the slide, (iii) above the boundary 502, etc. It has the characteristics of.
- the boundary line is a horizontally long line or figure used for separating a body part 503 that describes an explanatory text or a chart on a slide and a title. Boundaries are often used in presentation document templates. For example, in a slide, when a text whose font size is larger than the other text is within the upper ⁇ % of the slide, processing such as that text is regarded as a title, the title is Extracted.
- the cover extraction unit 102 extracts a cover slide from each presentation document based on the configuration information, and outputs the result to the table of contents equivalent information extraction unit 103 as a table of contents structure specifying table. This processing corresponds to S102 in FIG.
- the cover slide is a slide on which the entire title of the presentation document is written.
- the cover slide is used to specify the simplest table of contents structure of a cover slide and other slides.
- a document in which a plurality of presentation documents are grouped together may have a slide corresponding to a plurality of cover slides.
- the cover extraction unit 102 extracts zero or more cover slides.
- the cover slide is a root in the table of contents structure expressed as a tree structure.
- FIG. 5 shows an example of a cover slide.
- the cover slide has features such as (i) a title exists in the center of the slide, (ii) information on date, person name, and affiliation is described. Therefore, for example, if the slide includes these features and is the first slide, the slide is determined to be a cover slide.
- a dictionary is required for date, person name, and affiliation determination, but the dictionary may be held inside by the cover extraction unit 102, or a dictionary storage device may be separately prepared.
- the presentation document creation tool generally has a cover template. Therefore, if the first slide is a slide using a cover template, this slide may be extracted as a cover slide.
- the cover extraction unit 102 collects the extracted cover slides as a table of contents structure specification table.
- the table of contents structure specifying table clues to the table of contents structure extracted by each extracting means in the table of contents structure specifying device 10 are sequentially recorded.
- An example of the table of contents structure specification table after the cover is extracted is shown in FIG.
- the table of contents structure identification table of FIG. 6 includes a “#” column in which slide numbers are stored, and a delimiter flag column in which a flag for indicating whether or not the content of each slide is separated from the previous slide is stored.
- There is a cover row for storing the depth of each slide specified by the cover extraction unit 102.
- the example shown in FIG. 6 represents the table of contents structure specification table specified by the cover extraction unit 102 for a presentation document having the first slide as a cover slide and 16 slides.
- the cover extraction unit 102 gives “0” to the cover row of the cover slide row of the table of contents structure specifying table, and gives “1” to the cover row of the other slide rows.
- “0” and “1” represent the depth of the hierarchy in the table of contents structure.
- the example shown in FIG. 6 indicates that the cover is the root and the other slides are in the first layer.
- the cover extraction unit 102 assigns “1” to the delimiter flag column of the cover slide of the table of contents structure specifying table and the next slide row, and sets “0” as the initial setting to the delimiter flag columns of the other slide rows. give.
- the cover extraction unit 102 cannot extract the cover slide, it gives “1” to the cover row of all the slides in the table of contents structure specification table. Further, the cover sheet extraction unit 102 gives “1” to the separation flag string of the first slide and “0” to the separation flag strings of the other slides.
- each extraction stage of the table-of-contents structure specifying apparatus 10 the part where the delimiter flag string of the table-of-contents structure specifying table is arranged as “10... 0” (0 is one or more) is subdivided.
- the portion 504 from the slide 2 to the slide 16 is processed by the table of contents equivalent information extraction unit 103.
- the table of contents equivalent information extraction unit 103 obtains the table of contents structure specification table after the cover extraction from the cover cover extraction unit 102, adds the extraction result of the table of contents equivalent information to the table of contents structure specification table, and adds the table of contents structure specification table to the segment extraction unit 104. Output to.
- This table of contents equivalent information extraction process corresponds to step S103 in FIG.
- the table of contents equivalent information is correct information regarding the table of contents structure (tree structure) specified in the presentation document.
- Representative information sources from which table of contents equivalent information can be extracted include (i) a table of contents slide and (ii) a title number.
- the table of contents equivalent information extraction unit 103 specifies the table of contents structure from the information of the table of contents slide and the title number. When these two pieces of information are obtained at the same time, the table of contents equivalent information extraction unit 103 specifies the table of contents structure from the title number after specifying the table of contents structure from the table of contents slide.
- a table of contents slide is a slide on which a table of contents of a presentation document is described. In the table of contents, the headings (hereinafter referred to as chapter names) of each chapter included in the presentation document are often described in bullets. Furthermore, generally, a slide having a chapter name as a title (hereinafter referred to as a chapter name slide) appears on a slide after the table of contents slide. The depth of indentation in the bullet points corresponds to the hierarchy of chapters. Therefore, the table of contents equivalent information extraction unit 103 can specify the table of contents structure of the entire presentation document by extracting the indentation depth and the chapter name slide.
- FIG. 7 shows an example of a table of contents slide and a chapter title slide.
- “ABC” and “JKL” are in a parallel relationship in the first hierarchy, and “DEF” and “GHI” are located in the second hierarchy having “ABC” as a parent.
- the table of contents equivalent information extraction unit 103 sets the depth of the hierarchy of the slide group describing the topic “ABC” to “1” and the depth of the hierarchy of the slide group describing the topics “DEF” and “GHI”.
- the table of contents equivalent information for extracting “1” as the depth of the hierarchy of the slide group describing the topics “2” and “JKL” is extracted.
- the title text string is used to specify the table of contents slide.
- a title that is likely to be used for a table of contents slide is registered in the table of contents equivalent information extraction unit 103, and if there is a slide that matches the character string of the title, the slide is identified as the table of contents slide.
- Character strings such as “table of contents”, “outline”, and “Table of contents” are registered in the table of contents equivalent information extraction unit 103.
- words corresponding to the “table of contents” in other languages may be registered in the table of contents equivalent information extraction unit 103.
- the table of contents slide may be specified on condition that all the text in the slide is described in bullets and the slide exists in the first half of the presentation document. In this process, the table of contents slide is extracted even when the character string of “table of contents” is not specified in the title.
- each slide it can be determined from the attribute information in the configuration information whether or not there is a text described in bullets. Whether or not the slide is in the first half of the presentation document can be determined by comparing the same document ID with the slide number in the document in the configuration information.
- the chapter name slide can be determined by the coincidence between the chapter name in the table of contents slide and the text information of the slide titles thereafter. When there are a plurality of matching chapter name slides for one chapter name, only the slide with the smallest slide number is determined as the chapter name slide.
- FIG. 8 shows a presentation document in which a plurality of table of contents slides are inserted. If the number of table of contents slides matches the number of chapter names in the table of contents slide, it can be estimated that a table of contents slide is inserted at each chapter change.
- a slide group 508 from the first table of contents slide 505 to the next table of contents slide 506 is a slide group describing a topic related to the first chapter name “ABC” 507, and the table of contents slide 506.
- To the next table of contents slide 509 is determined to be a slide group describing a topic related to the second chapter name “DEF”.
- the table of contents slide when the table of contents structure is expressed as a hierarchical structure, the table of contents slide is arranged in parallel with the shallowest chapter title slide on the hierarchical structure.
- a chapter title slide may be arranged as a subtree of the table of contents slide.
- 1 may be added to all values in the table of contents column of the slide other than the table of contents slide in the slide group to be extracted.
- the user of the table-of-contents structure specifying apparatus may arbitrarily determine the positional relationship between the table of contents slide and the chapter name slide.
- the title number is a chapter number such as “1. Introduction” or “2.2. Search method” assigned to the slide title. This is the document hierarchical structure itself, and is useful table of contents equivalent information. In each slide, the number given before the title is extracted. An example of the title number is shown in FIG. In FIG. 10, there are two slides after the first slide, and after the second slide, there are slides related to chapters 2.1 and 2.2. There are various variations of title numbering, such as “Chapter 1”, “1-1”, “(1)”, “Step 1”, and so on. Formats that are easy to use, such as “Chapter *” and “*-*”, are templated, and the formats are pattern-matched to extract title numbers.
- the table of contents equivalent information extraction unit 103 adds a table of contents column to the table of contents structure specification table, and records the relative change in the depth of the hierarchy between slides obtained by the processes (i) and (ii) in the table of contents column. To do. That is, the table-of-contents equivalent information extraction unit 103 sets “0” in the table of contents column of the slide corresponding to the chapter “*” which is the first layer in the table of contents structure, and “*. * "+1" in the table of contents column of the row of the slide corresponding to the chapter is the third layer *. *. * Give “+2” to the TOC column of the row of the slide corresponding to the chapter.
- the table of contents equivalent information extraction unit 103 follows the rules described above, “+ (N ⁇ 1)” is given to the table of contents column.
- the table of contents equivalent information extraction unit 103 gives “1” to the delimiter flag column of the slide row that becomes the delimiter of the topic in the table of contents structure specification table. Since the table of contents slide is considered to be a topic different from previous slides, the table of contents equivalent information extraction unit 103 gives “1” to the delimiter flag column of the table of contents slide in the table of contents structure specification table. Also, since the chapter name slide is the beginning of a new topic, the table of contents equivalent information extraction unit 103 also gives “1” to the delimiter flag column in the chapter name slide row. For a slide having a title number, the table of contents equivalent information extraction unit 103 gives “1” to the delimiter flag column of the row of the slide whose title number is switched.
- FIG. 11 is an example of a table of contents structure specification table of a presentation document composed of 16 slides.
- the left side of FIG. 11 is an example of configuration information in which text information whose title is attribute information is collected in each slide. Configuration information unnecessary for explanation is omitted.
- the slide group described in this example is composed of three chapters, and each chapter has a hierarchical structure. Yes.
- TOC information (i) and (ii) may be obtained at the same time.
- the process according to the table of contents slide in (i) is performed first, and the section flag column of the table of contents structure specification table is arranged as “10... 0” (0 is one or more) ( (Ii) is processed for the title number.
- the following describes how to obtain the depth of the hierarchy based on the title number when both the table of contents slide and the title number are obtained.
- Step 1 The title number of the first slide of a new chapter is extracted.
- Step 2 If a title number exists at Step 1, the depth of the title number is set to “d”. If there is no title number, “title number depth-1” of the slide having the title number first in the chapter is set as d. Note that the depth of the title number is divided to express a hierarchy, such as 2 for “1.2”, 3 for “2-3-1”, and 4 for “1.2.4.12”. The number of areas.
- Step 3 The depth D of the title numbers of all the slides with title numbers in the new chapter is obtained.
- Step 4 “Dd” is added to the table of contents column (the depth of the hierarchy obtained by the table of contents slide is recorded) of the row of the slide having the title number.
- Step 5 If the slide has no title number, the slide with the title number that is the nearest slide ahead of the slide is detected. Then, the same value as the table of contents column of the detected slide row is assigned to the table of contents column of the slide row without the title number.
- the method (ii) may be used as it is as the method for assigning the separation flag string for each slide.
- the slide group (chapter) divided by the chapter name slide is further subdivided by the title number.
- the entire presentation material is divided into several large chapters for each chapter name of the table of contents slide, and a title number may be given in each chapter.
- a unique title number may be used in each chapter, an unnatural table of contents structure may be obtained if the processing for the title number is performed first.
- a table of contents slide and (ii) a title number are examples of table of contents equivalent information. If there is other information regarding the table of contents explicitly described, that information may be used and stored in the same format.
- the table of contents equivalent information extraction unit 103 can extract most of the table of contents structure. This is because the table of contents equivalent information is written on each slide without omission, as is apparent from the text string of the configuration information in the example of FIG. However, there are few examples where the table of contents equivalent information is described without omission.
- FIG. 12 shows the table of contents structure specification table after extracting the table of contents equivalent information when the table of contents equivalent information is not obtained.
- slide numbers 2 to 16 indicated by “ ⁇ ” 512 are processed by the segment extraction unit 104.
- the chapter name of the table of contents slide and the table of contents structure specification using the chapter name slide and the table of contents structure specification using the title number are also performed in the related technology.
- the table of contents structure can be specified by combining them.
- the table of contents structure is specified by the subsequent processing even in the situation shown in FIG. 12 where the table of contents equivalent information cannot be obtained.
- the table of contents column of the table of contents structure specification table is all zero except for the top slide, it is not necessary to store them, and the memory used can be saved.
- the presentation document shown in FIG. 13 is composed of 16 slides.
- the “# number” written on the side of each slide represents the slide number.
- This presentation document has a cover slide, but does not include table of contents equivalent information. Therefore, when the processing in the table of contents equivalent information extracting unit 103 is completed, the table of contents structure specifying table is as shown in FIG.
- the segment extraction unit 104 acquires a table of contents structure specification table from the table of contents equivalent information extraction unit 103, and extracts a segment that becomes a topic break. Then, the segment extraction unit 104 adds the result to the table of contents structure specification table and outputs the obtained table of contents structure specification table to the partial hierarchy extraction unit 105.
- the segment extraction process corresponds to step S104 in FIG.
- a segment is a group of slides that form one chapter according to the characteristics of the format.
- a headline slide is a typical format for identifying a segment.
- a headline slide is a slide in which only the title has substantial contents in a presentation document.
- FIG. 14 An example of a headline slide is shown in FIG. 14, a slide including only the title “XYZ” is a headline slide. Headline slides are often inserted when the topic changes relatively large. Therefore, one segment is from the headline slide to the next headline slide or the last slide, and a slide representing the segment is a headline slide.
- slide number 2 and slide number 10 are headline slides.
- the headline slide identification method is almost the same as the cover slide identification method. However, since information such as name, affiliation, and date is rarely included in the headline slide, such information is not applied when specifying the headline slide. Note that if the headline slide is extracted at the time of cover slide extraction, the segment extraction unit 104 does not need to extract the headline slide again.
- the segment extraction unit 104 identifies the table of contents structure using the headline slide as an information source.
- the segment extraction unit 104 identifies the relational structure between the headline slide that is the base point of the segment and other slides.
- two methods of expressing the relational structure obtained by the segment extraction unit 104 are described.
- the headline slide corresponds to the base point of the segment.
- the headline slide and the subsequent slide group are slides that explain the content of that chapter. Therefore, the headline slide and the subsequent slide group can be regarded as the same hierarchy in the table of contents structure expressed as a hierarchical structure.
- the headline slide is a slide that represents a slide group up to the next headline slide, it can also be regarded as a parent of the slide group following the headline slide in the table of contents structure.
- the former is called a parallel pattern
- the latter is called a hierarchical pattern.
- any of the extraction means of the cover cover extraction unit 102, the table of contents equivalent information extraction unit 103, and the partial hierarchy extraction unit 105 is used to express the relational structure between the slide serving as the base point and the other slides. However, it has the same degree of freedom as the segment extraction means.
- the cover and other slides in the cover extraction unit 102 are hierarchical patterns, and the table of contents slide and chapter name in the table of contents equivalent information extraction unit 103 The slide is based on a parallel pattern. Further, for the partial hierarchy extraction unit 105 described later, the table of contents structure is specified based on the hierarchy pattern.
- FIG. 15 and 16 show a table of contents structure specification table in which segment columns are added to the table of contents structure specification table shown in FIG. 12 obtained from the presentation document of FIG.
- FIG. 15 shows a parallel pattern storage method
- FIG. 16 shows a hierarchical pattern storage method.
- the segment extraction unit 104 gives “0” to the segment column of the slide row to be processed in the table of contents structure specification table in the parallel pattern shown in FIG. A portion where nothing is described in the segment column means that the segment extraction unit 104 does not process the slide. Further, since the topic is changed by the segment, the segment extraction unit 104 gives “1” to the delimiter flag column of the headline slide (slide numbers 2 and 10) in the table of contents structure specification table.
- the segment extraction unit 104 gives “0” to the segment column of the headline slide row in the table of contents structure specification table in order to hierarchize the headline slide and the subordinate slide. “+1” is given to the segment column of the row of the subordinate slide group. Since this value represents the relative amount of change in the depth of the layer, “+1” indicates that the layer of the slide group subordinate to the headline slide is one level deeper than the headline slide.
- the segment extraction unit 104 gives “1” to the delimiter flag column of the headline slide and the next slide row in the table of contents structure specification table.
- “1” is given to the delimiter flag columns of the rows of slide number 2 and slide number 3, and slide number 10 and slide number 11.
- a plurality of headline slides may exist continuously.
- the table of contents structure is specified by the following processing.
- Step 1 The maximum number of consecutive headline slides is extracted within the chapter to be processed. (Assuming that ⁇ times have been extracted)
- Step 2 Only the head slide of the head slide that continues for the number of ⁇ is regarded as the headline slide, and the segment extraction unit 104 performs the processing. However, even in the parallel pattern, when the slide next to the first slide is a headline slide, “1” is exceptionally given to the delimiter flag column of the latter headline slide row.
- Step 3 In the table of contents structure specification table after the processing of the segment extraction unit 104 in Step 2, the part where the delimiter flag string is arranged as “10..0” (0 is one or more) is regarded as a new chapter. Are subject to extraction processing.
- Step 4 Step 1 to Step 3 are recursively repeated for the chapter to be processed.
- the hierarchical relationship of the continuous head slides may be determined in advance as follows to specify the table of contents structure. If the first headline slide 513 represents the title of Chapter 1, it is considered that the second headline slide 514 represents the title of Chapter 1.1. Therefore, the segment extraction unit 104 adds “0” to the segment column of the first headline slide 513 and “+1” to the segment column of the second headline slide 514 in the table of contents structure specification table. Giving “+1” or “+2” to the segment column of the row of the slide group subordinate to the headline slide 514. Whether the segment extraction unit 104 gives “+1” or “+2” to the segment column of the row of the slide group subordinate to the headline slide 514 depends on which of the parallel pattern and the hierarchical pattern is used.
- rules other than those described above may be used as rules for increasing the accuracy of segment extraction when there are two consecutive headline slides.
- the partial hierarchy extraction unit 105 acquires the table of contents structure specification table from the segment extraction unit 104, and extracts partial hierarchical relationships existing between slides. Then, the partial hierarchy extraction unit 105 adds the result to the table of contents structure specification table and outputs the table of contents structure specification table to the extraction information combining unit 106.
- the partial hierarchy extraction process corresponds to step S105 in FIG.
- the slide that is the target of the partial hierarchy extraction processing is only a portion where the delimiter flag string of the table of contents structure identification table obtained up to the segment extraction unit 104 is “10... 0”. For this reason, when the number of delimiter flag columns in the table of contents structure specifying table includes 1 by the extraction means so far, the amount of calculation is greatly reduced compared to the case where the processing is applied to all slides.
- a partial table of contents slide is a typical feature of extracting a partial hierarchical structure from a presentation document.
- the partial table of contents slide is a slide including titles of a plurality of slides after the slide.
- the text appearing as the title in the later slide is called “subheading”, and the slide having the subheading as the title is called the subheading slide.
- FIG. 18 shows an example of the relationship between the partial table of contents slide and the subtitle slide.
- the example on the left side of FIG. 18 is similar to the relationship between the table of contents slide and the chapter title slide in the table of contents equivalent information extraction unit 103.
- the example on the left side of FIG. 18 represents a presentation document configured such that there is a partial table of contents slide 518 having subheadings “ABC”, “DEF”, and “GHI”, and a subheading slide on the rear slide.
- the subheadings listed in the partial table of contents slide the subheadings are in a parallel relationship. Therefore, it is presumed that the slide in which the topic of the subtitle slide is described is also located in the same hierarchy in the table of contents structure.
- FIG. 18 shows another example of the partial table of contents on the right side.
- This is a presentation document in which a partial table of contents slide 519 including subheadings “ABC”, “DEF”, and “GHI” exists, and a subheading slide is located on the rear slide.
- the partial table of contents slide 519 is different from the slide on the left side of FIG. 18 in that the subheadings in the partial table of contents slide are not organized text information such as bullets.
- the subheadings are presumed to have a parallel relationship. That is, it is presumed that the subtitle slides are also in a parallel relationship in the example in the right figure.
- the partial hierarchy may be extracted by the following method.
- the extraction method of the partial table of contents slide and the subheading slide for the example on the left side of FIG. 18 will be described below.
- the configuration information is referred to, and the text whose attribute information is “itemized” is extracted. And the character string currently itemized in this text is extracted as a subheading.
- the slide from which the subheadings are extracted is a partial table of contents slide candidate. If there is a sub-heading slide for the sub-heading, the partial table of contents slide candidate is an official partial table of contents slide and extracted together with the sub-heading slide.
- the configuration information is referred to, and a text group in which similar emphasis expression is used in one slide is extracted as a subheading.
- similar emphasis expressions include the use of the same emphasis object and special colors for fonts and background colors. Objects that are easily used for emphasis are registered in the partial hierarchy extraction unit 105 in advance.
- the text color information if the text is painted in a color that is not often used in the presentation document, it is determined that a similar emphasis expression is used even if the color is different. For example, it is assumed that text A and text B are painted in red and blue, respectively, and that red and blue are colors that are not often used in this presentation document. In that case, it is determined that similar emphasis expressions are used for the text A and the text B. This is because in presentation documents, when comparing equivalent information, the text color indicating each information is often changed and emphasized.
- the slide from which the subheadings are extracted in this way is a partial table of contents candidate. If there is a sub-heading slide for the sub-heading, the partial table of contents slide candidate is regarded as an official partial table of contents slide, and the sub-heading slide is extracted.
- subheading slides are not necessarily extracted for all subheadings. Therefore, if all the subheadings have a corresponding subheading slide of ⁇ or more or ⁇ percent or more, the partial table of contents slide It is determined that a subheading slide corresponding to the subheading is found.
- the above method is an algorithm in which a partial table of contents candidate is first found. However, the slide titles are first extracted, and one slide including the bulleted text including the title group or the text using the similar emphasis on the slide ahead of the slide from which the titles are extracted. Similar results are obtained even if an algorithm is used to detect.
- the partial hierarchy extraction unit 105 adds a partial hierarchy column to the table of contents structure specifying table, and records the extraction result of the partial hierarchy in this partial hierarchy column.
- the value to be recorded is the relative change in the depth of the hierarchy caused by the extraction of the partial hierarchy.
- “0” is given to the partial hierarchy column of the row of the partial table of contents slide of the table of contents structure specification table. Since the subheading slide that is a child of the partial table of contents slide is one level deeper than the partial table of contents slide, “+1” is given to the partial hierarchy column of the subheading slide of the table of contents structure specification table.
- FIG. 19 and FIG. 20 show tables in which partial hierarchy columns are added to the table of contents structure specification table for the presentation document shown in FIG.
- FIG. 19 shows a partial hierarchy extraction result when the parallel pattern is adopted in the segment extraction unit 104
- FIG. 20 shows a partial hierarchy extraction result when the hierarchy pattern is adopted.
- FIG. 19 and FIG. 20 since the segments are different, the slide groups from which the partial hierarchies are extracted are different.
- the presentation document shown in FIG. 13 has two slides having a partial table of contents.
- One is slide number 4 including bullets and the other is slide number 11 including the same decorative character string.
- the slide groups 5, 6, 7, and 8 correspond to the subheading slide with the slide number 4
- the slide groups 12, 14, 15, and 16 correspond to the subheading slide with the slide number 11. Therefore, in the table of contents structure specification table shown in FIG. 19 and FIG. 20, “+1” is given to the partial hierarchical column of the row of the subheading slide group.
- the slide number 13 has no title. However, when the slide is processed by the partial hierarchy extraction unit 105, the portion of the row with the slide number 12 that is the previous slide is added to the partial hierarchy column of the slide row without the title in the table of contents structure specification table. The value of the hierarchy column is copied. Since slide numbers 5, 7, and 8 and slide numbers 12, 14, and 15 are sub-heading slides, “1” is given to the delimiter flag column of the row of the slide in the table of contents structure specification table.
- the partial table of contents is an example in which a partial hierarchical structure is extracted, and other information may be used as long as a partial hierarchical relationship between slides can be extracted.
- the extraction information combining unit 106 acquires the table of contents structure specification table from the partial hierarchy extraction unit 105 and specifies the table of contents structure. Then, the extracted information combining unit 106 outputs the result to the output unit 302. The extracted information combining process corresponds to step S106 in FIG.
- the extracted information combining unit 106 adds the values of the cover column, table of contents column, segment column, and partial hierarchy column of the table of contents structure specific table extracted by each information extraction unit for each slide, and calculates the depth of the final hierarchy of each slide. Identify Further, the extracted information combining unit 106 adds a hierarchy depth column to the table of contents structure specifying table, and records the calculated hierarchy depth in the hierarchy depth column.
- the base layer is extracted by the cover extraction unit 102, and the relative change in the depth of the layer is sequentially identified by the subsequent table of contents equivalent information extraction unit 103, the segment extraction unit 104, and the partial layer extraction unit 105.
- the table of contents structure can be identified by the addition process.
- FIGS. 21 and 22 show the results of specifying the depth of the hierarchy from the table of contents structure specification tables of FIGS. 19 and 20, respectively.
- the separator flag column and the depth column of the hierarchy are extracted from the table of contents structure specifying table, and the result is summarized as the table of contents structure. It can be seen from the depth column of the table of contents structure in FIGS. 21 and 22 whether each slide belongs to a large chapter or a small chapter.
- the delimiter flag string indicates when a new chapter starts.
- the cover sheet extraction unit 102 and the table of contents equivalent information extraction unit 103 need not all be provided. However, if the cover extraction unit 102 is not provided, any one of the table-of-contents equivalent information extraction unit 103, the segment extraction unit 104, and the partial hierarchy extraction unit 105 that operates after the configuration information extraction unit 101 is used for all cover columns.
- a table of contents structure identification table 1 is generated.
- the output unit 302 displays the table of contents structure specification table specified by the table of contents structure specifying device 10 using, for example, a display device.
- the output unit 302 is realized by a CPU of an information processing apparatus that operates according to a program.
- the extraction result may be output as a file or may be output using a printing machine or the like.
- FIG. 23 and FIG. 24 show an example of the mode of output, and show the results of displaying the table of contents structure of FIG. 21 and FIG. 22 as a tree structure graph, respectively.
- a node in the figure represents one slide, and “# number” in the node represents a slide number.
- the tree structure of FIG. 23 or FIG. 24 is created according to the following rules.
- each slide is a child of the adjacent forward slide having a depth column value smaller than the slide depth column value.
- FIG. 23 in which the parallel pattern is adopted in the segment extraction unit 104, the headline slide and the subordinate slide group are not expressed in a hierarchical relationship in the broken line frame (520, 521).
- slide number 9 and slide number 10 are in the same hierarchy. However, since “1” is given to the value of the delimiter column of the row of the slide number 10 in the table of contents structure specification table of FIG. 21, it can be seen that a new topic starts from the slide number 10. Therefore, the output unit 302 can determine that the contents of the slide number 9 and the slide number 10 are different.
- a separator line may be displayed between the nodes between the slides. 23 broken-line frames (520, 521) may be displayed.
- FIG. 24 in which the hierarchical pattern is adopted in the segment extraction unit 104, the portions corresponding to the broken line frames (520, 521) in FIG. 23 are expressed in a hierarchical relationship as indicated by the broken line frames (522, 523). Has been. This is because the headline slide and the subordinate slide group are hierarchized. Compared with FIG. 23, since there are many hierarchies, FIG. 24 has the feature that it can be understood only by the hierarchical relationship that the topic is divided between the slide number 9 and the slide number 10.
- FIG. 25 An example of another output mode is shown in FIG. In FIG. 25, the results are displayed in an index format based on the obtained table of contents structure.
- the left figure 524 of FIG. 25 is the display result for FIG. 21, and the right figure 525 is the display result for FIG.
- the slide of the row in which “1” is given to the value of the depth column of the hierarchy of the table of contents structure specification table and “1” is given to the value of the delimiter flag column is the beginning of the chapter. It is considered to be.
- the slide numbers 2 and 10 regarded as the beginning of the chapter are given chapter numbers “Chapter 1” and “Chapter 2”, respectively.
- the slide is not output.
- the user of the table of contents structure specifying apparatus 10 can browse the table of contents structure in a familiar manner used in general books, so that the user can find the slides that he needs. Cheap.
- FIG. 26 is a block diagram illustrating an example of a configuration of a table of contents structure specifying device according to the second embodiment.
- the continuous slide extraction unit 107 acquires a table of contents structure specification table from the table of contents equivalent information extraction unit 103 and extracts continuous slides. Then, the continuous slide extraction unit 107 adds the result to the table of contents structure specification table and outputs the result to the segment extraction unit 104.
- “Sequential slide” refers to a group of slides having the same or part of the title in the same chapter.
- Typical features of continuous slides are: (i) the same titles are continuous, (ii) titles are numbered consecutively, (iii) “continue”, “Cont'd”, “slides without titles” ”Continues.
- the continuous slide extraction unit 107 adds a continuous row to the table of contents structure specifying table, and records the continuous slide extracted by the processes (i) to (iii) in the continuous row.
- the recording method is as follows. In the table of contents structure specification table, “slide number of the head of the continuous slide” is given to the continuous column of the row of the slide other than the head slide of the continuous slide. However, this is an example of a method for recording continuous columns in the table of contents structure specification table, and the recording method is not limited to this as long as it is understood that slides are continuous.
- FIG. 28 shows a table of contents structure specification obtained when continuous slides are extracted from the presentation document shown in FIG.
- slide number 5 and slide number 6, slide number 12 and slide number 13, slide number 15 and slide number 16 are extracted as continuous slides by the processes (i) to (iii). .
- slide number 5 and slide number 6 have consecutively assigned titles
- slide number 13 does not have a title
- slide number 15 and slide number 16 have the same title in succession. It is.
- “# 5” is the continuous column of the row of the slide number 6 in the table of contents structure specification table
- “# 12” is the continuous column of the row of the slide number 13
- “# 15” is the continuous column of the row of the slide number 16. Is given.
- a continuous slide is regarded as one slide. Then, the table of contents structure is specified by performing the same processing as in the first embodiment.
- the partial hierarchy extraction unit 105 needs to refer to the title of the slide in order to specify the subtitle slide. Therefore, when a continuous slide is regarded as one slide, a representative title in the continuous slide is required.
- the representative title is obtained by the following process. (1) A character string common to the slides in the continuous slide is used as a representative title. (2) When there is no title or title corresponding to “continuation” and there is a continuous slide, the title of the first slide in the continuous slide is set as the representative title.
- FIG. 29 and FIG. 30 show the table of contents structure table and the table of contents structure obtained from the table of contents structure specification table of FIG. 28 through the segment extraction unit 104, the partial hierarchy extraction unit 105, and the extraction information combining unit 106. 29, a parallel pattern is used in the segment extraction unit 104, and a hierarchical pattern is used in the segment extraction unit 104 in FIG. Both FIG. 29 and FIG. 30 show the table of contents structure specifying table and the segment column, partial layer column, and layer depth column of the slide row that is a continuous slide of the table of contents structure.
- the output unit 302 displays a table of contents structure specifying table including information on continuous slides specified by the table of contents structure specifying device 10 using, for example, a display device.
- FIG. 31 and FIG. 32 are the output states of the tree structure graphs, as in FIG. 23 and FIG. 24, and are the results of displaying the table of contents structures of FIG. 29 and FIG. 30, respectively.
- “# number” of each node represents a slide number
- “# number- # number” represents a continuous slide.
- the table of contents structure is hierarchized, and the continuous slides are displayed together to display the hierarchical relationship between the slides more easily than in FIGS.
- the title of each slide may be displayed instead of the slide number. At this time, the representative title may be used as a node of the continuous slide.
- FIG. 33 shows the result displayed in the same manner of output as in FIG.
- the obtained table of contents structure is displayed in an index format.
- 33 shows the result of displaying the table of contents structure of FIG. 29 and the right figure 530 of FIG. 33 shows the table of contents structure of FIG.
- the same title appears continuously as an index.
- FIG. 30 since continuous slides can be regarded as one slide, duplication of titles can be eliminated by using a plurality of slide numbers as indexes. Therefore, compared with FIG. 25, the result shown in FIG. 33 is displayed more naturally as an index. As described above, the extraction of the continuous slide exhibits a particularly great effect at the time of output.
- FIG. 34 is a block diagram illustrating an example of a hardware configuration of the table of contents structure specifying device 10 illustrated in FIG. 1 or FIG.
- the table of contents structure specifying device 10 includes a control unit 31, a main storage unit 32, an external storage unit 33, an operation unit 34, a display unit 35, and an input unit 36.
- the main storage unit 32, the external storage unit 33, the operation unit 34, the display unit 35, and the input unit 36 are all connected to the control unit 31 via the internal bus 30.
- the control unit 31 includes a CPU (Central Processing Unit) and the like.
- the control unit 31 executes the process of the table of contents structure specifying device 10 described above in accordance with the table of contents structure specifying program 500 stored in the external storage unit 33.
- CPU Central Processing Unit
- the main memory 32 is composed of RAM (Random-Access Memory) or the like.
- the main storage unit 32 loads the table of contents structure specifying program 500 stored in the external storage unit 33 and is used as a work area of the control unit 31.
- the configuration information in FIG. 3, the table of contents structure specification table in FIG. 6, and the data in the table of contents structure in FIG. 21 are configured as a storage area structured in the main storage unit 32.
- the external storage unit 33 includes a non-volatile memory such as a flash memory, a hard disk, a DVD-RAM (Digital Versatile Disc Random-Access Memory), a DVD-RW (Digital Versatile Disc ReWritable).
- the external storage unit 33 stores in advance a table of contents structure specifying program 500 for causing the control unit 31 to perform the above processing. Further, in accordance with an instruction from the control unit 31, the data stored in the program is supplied to the control unit 31, and the data supplied from the control unit 31 is stored.
- the document database 301 in FIG. 1 or FIG. 26 is configured in the external storage unit 33. When the table of contents structure specifying process is performed, a part of the data is stored in the main storage unit 32 and used for the operation of the control unit 31.
- the document database 301 may be configured by a device different from the hardware of the table of contents structure specifying device 10 and connected to the table of contents structure specifying device 10 via a network. Further, the document database 301 may be supplied by the above-described storage medium that can be connected to the external storage unit 33.
- the operation unit 34 includes a pointing device such as a keyboard and a mouse, and an interface device that connects the keyboard and the pointing device to the internal bus 30.
- a command or the like for designating a target document for specifying the table of contents is input via the operation unit 34 and supplied to the control unit 31.
- the display unit 35 includes, for example, a CRT (Cathode Ray Tube) or an LCD (Liquid Crystal Display) and a circuit that drives the image display unit, and uses them to display a table of contents structure or a block diagram of a tree structure. Display in the format.
- CTR Cathode Ray Tube
- LCD Liquid Crystal Display
- the input unit 36 includes, for example, a network interface, inputs target document data from the external document database 301, and supplies the data to the control unit 31.
- the table of contents structure specifying program 500 is executed by processing using the control unit 31, the main storage unit 32, the external storage unit 33, the operation unit 34, the display unit 35, the input unit 36, and the like as resources.
- the table of contents structure specifying apparatus 10 of the present invention is suitable for a document even when there is no specified table of contents information based on a plurality of information sources such as formats and character strings in a presentation document.
- the table of contents structure can be specified.
- the table of contents structure specifying device 10 can display a presentation document having an arbitrary hierarchical structure.
- the table of contents structure can be specified. Furthermore, since the table of contents structure specifying device 10 can extract continuous slides and display them together, the structure of the presentation document can be easily grasped.
- the partial structure extracting unit extracts a slide including only a text box having a predetermined format as a headline slide, and outputs the one headline slide.
- a group of slides is extracted as a segment until just before the next headline slide including a text box of a format similar to the text box of the headline slide or until the last slide.
- the partial structure extracting means includes a slide that includes a subtitle that is a character string included in the title of a slide that includes a title that is a text box of a predetermined format in a character string of a text box other than the title.
- a partial table of contents extraction means for extracting the partial table of contents slide and the sub heading slide that is a slide including the sub headings in the title as a relative relation structure between the contents and the contents of the subset slides. But you can.
- the partial table of contents extracting means extracts a character string included in a text box with only bullets or a character string of a text box having a common format as the subheading.
- the partial structure extracting means includes a plurality of slides including the same bulleted text box in the presentation document, the number of slides including the same bulleted text box, and the same bulleted item.
- a multi-table of contents slide extraction means for extracting a relative relational structure of the slide group in accordance with each row of the same itemized list.
- the plurality of table of contents slide extracting means includes a group of slides sandwiched between slides including the same bulleted text box based on a difference in format for each line in the same bulleted text box, The character string of each line of the same itemized list is associated.
- a cover extraction unit may be provided that identifies a cover slide in the presentation document using the slide format characteristics, and extracts the identified cover slide and other slides as a relation structure between the cover and the text.
- the table of contents corresponding to the table of contents and / or the slide corresponding to the heading item is identified by detecting the table of contents equivalent information using the format of the text box of the slide and / or the character string information.
- Equivalent information extraction means may be provided.
- the table of contents equivalent information extraction unit specifies a slide including the table of contents and a title slide whose title is a headline included in the table of contents.
- the adjacent slides in the presentation document are text boxes having the same format, and part or all of the character strings included therein are identical, the same format of the adjacent slides
- a continuous slide extracting means for extracting the same character string in a text box having a representative title of the adjacent slide may be provided.
- the partial structure extracting step extracts a slide including only a text box having a predetermined format as a headline slide, and the one headline slide.
- a group of slides is extracted as a segment until just before the next headline slide including a text box of a format similar to the text box of the headline slide or until the last slide.
- a segment extraction step in which the one headline slide is a representative slide representing the segment.
- the partial structure extracting step includes, as a partial table of contents slide, a slide including a subtitle that is a character string included in the title of a slide including a title that is a text box of a predetermined format in a character string of a text box other than the title.
- a character string included in a text box with only bullets or a character string of a text box having a common format is extracted as the subheading.
- the partial structure extracting step includes a plurality of slides including the same bulleted text box in the presentation document, the number of slides including the same bulleted text box, and the same bulleted item.
- the number of lines of the character string matches, the slide group sandwiched between the slides including the text box of the same bullet, and the character string of each line of the same bullet,
- a multi-table of contents slide extraction step for extracting a relative relational structure of the slide group in accordance with each row of the same itemized list.
- the multiple table of contents slide extraction step includes a group of slides sandwiched between slides including the same bulleted text box based on a format difference for each line in the same bulleted text box; The character string of each line of the same itemized list is associated.
- the table of contents structure specifying method may include an output step of displaying the hierarchy of each slide determined in the integration step in a tree structure having each slide as a node.
- the table-of-contents structure specifying method specifies a cover slide in the presentation document using a slide format feature, and extracts the specified cover slide and other slides as a cover-text relational structure. Steps may be provided.
- the table of contents structure specifying method detects information corresponding to the table of contents using the format of the text box of the slide and / or the information of the character string for the entire presentation document, and includes the table of contents including the table of contents and / or the heading item. You may provide the table of contents equivalent information extraction step which specifies the applicable slide.
- the table of contents equivalent information extraction step specifies a slide including the table of contents and a title slide whose title is a headline included in the table of contents.
- table of contents structure specifying method may be configured so that adjacent slides in the presentation document are text boxes having the same format and part or all of the character strings included in the text boxes have the same identity.
- the central part that performs the table of contents structure identification process including the control unit 31, the main storage unit 32, the external storage unit 33, the operation unit 34, the internal bus 30 and the like is not a dedicated system, but a normal computer system. It can be realized using.
- a computer program for executing the above operation is stored and distributed on a computer-readable recording medium (flexible disk, CD-ROM, DVD-ROM, etc.), and the computer program is installed in the computer.
- a table-of-contents structure specifying device that performs the above-described processing may be configured.
- the table of contents structure specifying device may be configured by storing the computer program in a storage device included in a server device on a communication network such as the Internet and downloading the normal computer system.
- a computer program can be superimposed on a carrier wave and distributed via a communication network.
- the computer program may be posted on a bulletin board (BBS, “Bulletin” Board System) on a communication network, and the computer program may be distributed via the network.
- BBS bulletin board
- the table of contents structure specifying device may be configured such that the computer program is started and executed in the same manner as other application programs under the control of the OS, whereby the above-described processing is executed.
- the present invention is suitably applied to uses such as a table of contents structure extraction service and a document correction service in companies and the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Document Processing Apparatus (AREA)
Abstract
L'invention porte sur un dispositif de spécification de structure d'index qui comporte une unité d'extraction de segment (104) et/ou une unité d'extraction de hiérarchie partielle (105) qui acquiert une forme de document prédéterminée pour spécifier une structure relationnelle de diapositives à partir d'une forme de document caractéristique d'un objet contenu dans la diapositive d'un document de présentation comprenant une pluralité de diapositives contenant un objet tel qu'une boîte de texte, recherche une diapositive cohérente avec une forme de document prédéterminée dans un ordre prédéterminé à partir d'un ensemble partiel composé d'une pluralité de diapositives continues contenues dans un document de présentation, de façon à extraire une structure relativement relationnelle entre des diapositives de l'ensemble partiel ; et une unité de couplage d'informations extraites (106) qui ajoute une hiérarchie déterminée par la structure relativement relationnelle entre les diapositives de l'ensemble partiel à une hiérarchie dans un document de présentation dans une diapositive qui devient le standard de l'ensemble partiel de façon à déterminer une hiérarchie de chaque diapositive composant le document de présentation.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2009548918A JP5446877B2 (ja) | 2008-01-11 | 2009-01-06 | 目次構造特定装置 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2008003964 | 2008-01-11 | ||
| JP2008-003964 | 2008-01-11 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2009087999A1 true WO2009087999A1 (fr) | 2009-07-16 |
Family
ID=40853112
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2009/050045 Ceased WO2009087999A1 (fr) | 2008-01-11 | 2009-01-06 | Dispositif de spécification de structure d'index |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JP5446877B2 (fr) |
| WO (1) | WO2009087999A1 (fr) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2010019349A3 (fr) * | 2008-08-11 | 2010-04-15 | Microsoft Corporation | Sections d'une présentation ayant des propriétés pouvant être définies par l'utilisateur |
| JP2012018596A (ja) * | 2010-07-09 | 2012-01-26 | Konica Minolta Business Technologies Inc | プレゼンテーション支援装置 |
| JP2018084936A (ja) * | 2016-11-22 | 2018-05-31 | 株式会社インタラクティブソリューションズ | スライド情報管理装置、スライド情報管理システム、スライド情報管理装置の制御方法及びスライド情報管理装置の制御プログラム |
| CN109670047A (zh) * | 2018-11-19 | 2019-04-23 | 内蒙古大学 | 一种抽象笔记生成方法、计算机装置及可读存储介质 |
| CN110704573A (zh) * | 2019-09-04 | 2020-01-17 | 平安科技(深圳)有限公司 | 目录存储方法、装置、计算机设备及存储介质 |
| US10620795B2 (en) | 2013-03-14 | 2020-04-14 | RELX Inc. | Computer program products and methods for displaying digital looseleaf content |
| CN116227441A (zh) * | 2022-12-23 | 2023-06-06 | 北京彩漩科技有限公司 | 一种pptx文件的切分、合并方法和装置 |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2006134036A (ja) * | 2004-11-05 | 2006-05-25 | Matsushita Electric Ind Co Ltd | スライド構造化装置 |
-
2009
- 2009-01-06 WO PCT/JP2009/050045 patent/WO2009087999A1/fr not_active Ceased
- 2009-01-06 JP JP2009548918A patent/JP5446877B2/ja not_active Expired - Fee Related
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2006134036A (ja) * | 2004-11-05 | 2006-05-25 | Matsushita Electric Ind Co Ltd | スライド構造化装置 |
Non-Patent Citations (1)
| Title |
|---|
| "Information Processing Society of Japan 70th National Meeting, 13 March, 2008", 13 March 2008, article YASUTAKA YAMAMOTO ET AL.: "Shanai Bunsho Kensaku System (4) -Segment Overlay ni yoru Presentation Shiryo kara no Mokuji Kozo Tokutei", pages: 1-451 - 1-452 * |
Cited By (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10423301B2 (en) | 2008-08-11 | 2019-09-24 | Microsoft Technology Licensing, Llc | Sections of a presentation having user-definable properties |
| US8108777B2 (en) | 2008-08-11 | 2012-01-31 | Microsoft Corporation | Sections of a presentation having user-definable properties |
| US8954857B2 (en) | 2008-08-11 | 2015-02-10 | Microsoft Technology Licensing, Llc | Sections of a presentation having user-definable properties |
| WO2010019349A3 (fr) * | 2008-08-11 | 2010-04-15 | Microsoft Corporation | Sections d'une présentation ayant des propriétés pouvant être définies par l'utilisateur |
| JP2012018596A (ja) * | 2010-07-09 | 2012-01-26 | Konica Minolta Business Technologies Inc | プレゼンテーション支援装置 |
| US10620795B2 (en) | 2013-03-14 | 2020-04-14 | RELX Inc. | Computer program products and methods for displaying digital looseleaf content |
| JP2021168135A (ja) * | 2016-11-22 | 2021-10-21 | 株式会社インタラクティブソリューションズ | スライド説明練習情報管理装置、スライド説明練習情報管理システム、スライド説明練習情報管理装置の制御方法及びスライド説明練習情報管理装置の制御プログラム |
| JP2018084936A (ja) * | 2016-11-22 | 2018-05-31 | 株式会社インタラクティブソリューションズ | スライド情報管理装置、スライド情報管理システム、スライド情報管理装置の制御方法及びスライド情報管理装置の制御プログラム |
| JP7161720B2 (ja) | 2016-11-22 | 2022-10-27 | 株式会社インタラクティブソリューションズ | スライド説明練習情報管理装置、スライド説明練習情報管理システム、スライド説明練習情報管理装置の制御方法及びスライド説明練習情報管理装置の制御プログラム |
| CN109670047B (zh) * | 2018-11-19 | 2022-09-20 | 内蒙古大学 | 一种抽象笔记生成方法、计算机装置及可读存储介质 |
| CN109670047A (zh) * | 2018-11-19 | 2019-04-23 | 内蒙古大学 | 一种抽象笔记生成方法、计算机装置及可读存储介质 |
| CN110704573A (zh) * | 2019-09-04 | 2020-01-17 | 平安科技(深圳)有限公司 | 目录存储方法、装置、计算机设备及存储介质 |
| CN110704573B (zh) * | 2019-09-04 | 2023-12-22 | 平安科技(深圳)有限公司 | 目录存储方法、装置、计算机设备及存储介质 |
| CN116227441A (zh) * | 2022-12-23 | 2023-06-06 | 北京彩漩科技有限公司 | 一种pptx文件的切分、合并方法和装置 |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2009087999A1 (ja) | 2011-05-26 |
| JP5446877B2 (ja) | 2014-03-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7823061B2 (en) | System and method for text segmentation and display | |
| JP3425408B2 (ja) | 文書読取装置 | |
| US8107727B2 (en) | Document processing apparatus, document processing method, and computer program product | |
| JP5663866B2 (ja) | 情報処理装置及び情報処理プログラム | |
| JP3952216B2 (ja) | 翻訳装置及び辞書検索装置 | |
| JP5446877B2 (ja) | 目次構造特定装置 | |
| CN101661465B (zh) | 图像处理装置及图像处理方法 | |
| US7853869B2 (en) | Creation of semantic objects for providing logical structure to markup language representations of documents | |
| WO2012057891A1 (fr) | Transformation d'un document en contenu multimédia interactif | |
| US12307197B2 (en) | Systems and methods for generating social assets from electronic publications | |
| JP2003288334A (ja) | 文書処理装置及び文書処理方法 | |
| JP4682284B2 (ja) | 文書差分検出装置 | |
| US9049400B2 (en) | Image processing apparatus, and image processing method and program | |
| JP2003186889A (ja) | 文書に注釈付けし、文書イメージから要約を生成する方法及び装置 | |
| Ramel et al. | Interactive layout analysis, content extraction, and transcription of historical printed books using Pattern Redundancy Analysis | |
| JP2011060268A (ja) | 画像処理装置及び画像処理プログラム | |
| JP2000250908A (ja) | 電子書籍の作成支援装置 | |
| CN114997138B (zh) | 一种化学品说明书解析方法、装置、设备及可读存储介质 | |
| US20230385540A1 (en) | Information processing method, information processing apparatus, and storage medium | |
| JPS6154569A (ja) | 文書画像処理方式 | |
| JP4462508B2 (ja) | 情報処理装置並びに定義情報生成方法 | |
| JP4256841B2 (ja) | 情報処理装置、情報処理方法、情報処理プログラム | |
| JP2024091186A (ja) | 電子文書の閲覧用電子機器 | |
| CN119807502A (zh) | 网页内容提取方法以及电子设备 | |
| CN115661300A (zh) | 基于深度学习的网络安全可视化仪表盘生成方法 |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09700548 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2009548918 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 09700548 Country of ref document: EP Kind code of ref document: A1 |