[go: up one dir, main page]

WO2010132062A1 - System and methods for sentiment analysis - Google Patents

System and methods for sentiment analysis Download PDF

Info

Publication number
WO2010132062A1
WO2010132062A1 PCT/US2009/044197 US2009044197W WO2010132062A1 WO 2010132062 A1 WO2010132062 A1 WO 2010132062A1 US 2009044197 W US2009044197 W US 2009044197W WO 2010132062 A1 WO2010132062 A1 WO 2010132062A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentences
comparative
entities
sentiment
entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/US2009/044197
Other languages
French (fr)
Inventor
Liu Bing
Ding Xiaowen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Illinois at Urbana Champaign
University of Illinois System
Original Assignee
University of Illinois at Urbana Champaign
University of Illinois System
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Illinois at Urbana Champaign, University of Illinois System filed Critical University of Illinois at Urbana Champaign
Priority to PCT/US2009/044197 priority Critical patent/WO2010132062A1/en
Publication of WO2010132062A1 publication Critical patent/WO2010132062A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present disclosure relates generally to data mining techniques, and more specifically to a system and methods for sentiment analysis.
  • FIG. 1 depicts an illustrative embodiment of a method for assigning entities
  • FIG. 2 depicts an illustrative embodiment of a method for identifying entities using a variety of seeds
  • FIG. 3 depicts an illustrative diagrammatic representation of a machine in the form of a computer system within which a set of instructions, when executed, may cause the machine to perform any one or more of the methodologies disclosed herein;
  • Table 1 depicts an illustrative embodiment of a plurality of data sets associated with two forums
  • Table 2 depicts an illustrative embodiment of experimental results for entity identification
  • Table 3 depicts an illustrative embodiment of experimental results for entity assignment
  • Table 4 depicts an illustrative embodiment of part-of-speech (POS) tags.
  • One embodiment of the present disclosure entails identifying a plurality of entities in opinionated text generated by a plurality of users, each user expressing one or more opinions about at least one of the plurality of entities, identifying a plurality of comparative sentences and a plurality of non-comparative sentences in the opinionated text, identifying inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences, determining a semantic orientation for each of the plurality of non-comparative sentences, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the determined semantic orientation of the plurality of non-comparative sentences, the plurality of comparative opinions, and sentiment consistency between consecutive sentences in the opinionated text.
  • An embodiment of the present disclosure entails a computer-readable storage medium having computer instructions to identify a plurality of entities in opinionated text, identify a plurality of comparative sentences and a plurality of non- comparative sentences in the opinionated text, identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
  • An another embodiment of the present disclosure entails an evaluation system having a controller to identify a plurality of entities in opinionated text, identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from a plurality of comparative sentences in the opinionated text, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and a plurality of non- comparative sentences of the opinionated text according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
  • a popular semantic level analysis is sentiment analysis or opinion mining, which tries to discover user opinions about products and services.
  • Such studies are mainly conducted in the context of product reviews [4, 6, 8, 9, 18, 19, 20, 22] due to the fact that reviews are focused on the entities being reviewed and contain little irrelevant information.
  • the first problem is similar to a named entity recognition (NER) problem.
  • NER named entity recognition
  • common NER methods do not work well because of the ungrammatical nature of the forum posts, over-capitalization and under capitalization. Overcapitalization means that the user may capitalize every word in the sentence, and under-capitalization means that the first letters of many entity names are not capitalized. These cause serious problems for existing entity recognition methods.
  • the second problem bears some resemblance to pronoun resolution [3, 24, 25] in natural language processing (NLP), which identifies what each pronoun in a sentence refers to. Pronoun resolution is still a major challenge in NLP.
  • Example 1 "(1) I bought Camera-A yesterday. (2) I took some pictures in the evening in my living room. (3) The images are very clear.
  • a simple approach to identifying the entities talked about in each sentence is the following: The algorithm sequentially processes each sentence. Whenever an entity name is encountered in a sentence, it is assumed that the sentence talks about that entity. It is also assumed that the subsequent sentences talk about that entity as well until a new entity name occurs. Then the new entity is the one talked about in its sentence. The subsequent sentences also talk about the new entity, and so on. This simple strategy works reasonably well in practice. However, it breaks down when a comparative sentence is encountered.
  • Example 2 "(1) I bought Camera-A yesterday. (2) I took a few pictures in the evening in my living room. (3) The images are very clear. (4) They are definitely better than those from my old Camera-B. (5) The pictures of that camera were blurring for night shots, but for day shots it was ok"
  • Example 2 is the same as example 1 except the last sentence. Obviously, sentence (5) of example 2 talks about Camera-B. The above algorithm does not work with example 2. Since the method does not rely on pronouns, it has two advantages.
  • sentence (5) in example 2 which expresses a negative sentiment in its first clause, should refer to the inferior product.
  • This phenomenon can be called sentiment consistency, which says that consecutive sentiment expressions should be consistent with each other. It would be ambiguous if this consistency were not observed in writing.
  • a sentiment analysis method can be adapted for direct opinions to solve aforementioned problem.
  • a direct opinion [17], what is meant is a sentence or a clause that directly expresses a positive or negative opinion on an entity or a feature of the entity, such as illustrated by sentence (5) of Example 1.
  • Direct opinions are in contrast to comparative opinions.
  • a comparative opinion does not directly express a positive or negative opinion on anything but expresses a preference of some entities.
  • sentence (4) expresses a comparative opinion, i.e., "Camera- A" is superior or preferred to "Camera-B" when comparing their images. It can be observed that sentence (4) alone does not say any camera is good or bad, but just states a comparison.
  • NER entity recognition
  • NER aims to identify entities such as persons, organizations and locations in natural language text.
  • references [6, 11] have studied the problem in the context of comparative sentences. The methods described in these references exploit specific structures of such sentences for extraction.
  • the present disclosure is more general and not focused on comparative sentences.
  • the present disclosure can also be different from the classic NER as there is only an interest in product type of entities.
  • Reference [2] provides good surveys of existing information extraction algorithms.
  • Conditional random fields (CRF) [16] have been shown to perform the best so far. It will be shown that the method described in the present disclosure outperforms CRF dramatically for our task.
  • sentiment classification which investigates ways to classify whole product reviews as positive, negative, or neutral [19, 22]. Sentiment classification is not applicable to sentences and clauses considered in the present disclosure. Sentence-level and clause sentiment classification has been studied in [e.g., 15, 21, 23].
  • the present disclosure relates to feature-based sentiment analysis or opinion mining [4, 9, 18, 20], which finds sentiments expressed on product features.
  • sentiments expressed on product features For example, in the above example, "photo quality” and “battery” are product features. The sentiment on "photo quality” is positive and the sentiment on "battery” is negative.
  • Sentiment words are words that express desired or undesired states. Positive words express desired states, e.g., "great” and "good”. Negative words express undesired states, e.g., "bad” and “poor”. Identifying sentiment words have been studied in [5, 9, 13, 14]. Several lists have been compiled.
  • Context dependent opinions are determined based on the pair.
  • the present disclosure does not use this context definition.
  • a specification language is disclosed to enable the user to add/delete complex sentiment indicators, which can be words, phrases or other language constructs without touching the underlying program.
  • the present disclosure shows that a sentiment analysis method for analyzing direct opinions can be adapted to analyzing comparative sentences to mine comparative opinions.
  • references [11, 12] propose a method to find comparative and superlative sentences.
  • the teachings in these references do not determine superior entities expressed in comparative sentences. They only extract some useful items from sentences. Such items alone are not sufficient in determining the superior entities.
  • Reference [1] proposes a method to extract items from superlative sentences. It does not study sentiments either.
  • reference [7] the authors tried to identify which entity has more of a certain property in a comparative sentence. Again, it is not concerned with the problem of identifying the superior entities.
  • Reference [8] studied the sentiment analysis of comparative sentences. However, it needs a large volume of external information, i.e., product reviews.
  • the basic information unit of forums, blogs and discussion boards consists of a start post and a list of follow-up posts or replies.
  • This basic information unit is often called a thread.
  • a thread t thus can be modeled as a sequence of posts, ⁇ p ⁇ , p 2 , ..., p n >- Pi is the start post.
  • Each post consists of a sequence of sentences, ⁇ su S 2 , ..., s m >.
  • An entity can be a person, a product, an organization, an event, etc.
  • Entity identification identify the set of entities E discussed in the posts of the threads.
  • Entity assignment determine the entities in E that each sentence S 1 of each post/?, in t (e T) talks about.
  • direct opinion A direct opinion is a positive or negative opinion on an entity or some feature of the entity without mentioning any other similar entities.
  • Camera-X is comparative sentence, which states that "Camera-Y” is superior or preferred to "Camera X" when comparing their "picture quality”.
  • the algorithm is thus iterative. Pattern mining is employed at each iteration to find more entities based on already found entities. The iterative process ends when no new entity names are found. Pruning methods are also proposed to remove those unlikely entities.
  • Step 1 - data preparation for sequential pattern mining This step perform two tasks, it first finds all sentences that contain any one of the seed entities, e ⁇ , ⁇ 2 , ..., e n in the dataset, and then generates a sequence for each occurrence of e t for pattern mining.
  • the present disclosure can use only the window of 5 words before each entity name and 5 words after each entity name.
  • Each word of a seed entity name is replaced with a generic (unique) name "ENTITYXYZ”. Utilizing this generic word can ensure that generic patterns about any entities are found.
  • each entity name can consist of more than one word.
  • the part-of-speech (POS) tag of each word can also be used.
  • each element of the sequence can be a pair, POS tag of the word and the word.
  • Example 3 The sentence that follows has POS tags attached.
  • n95 is a phone model (an entity).
  • the window is (n95 has been replaced with ENTITYXYZ): mad/JJ everyone/NN doesnt/NN have/VBP a/DT ENTITYXYZ /CD phone/NN fetish/NN ducky/JJ
  • the resulting sequence is:
  • Table 4 depicts POS Tags used above and throughout the rest of the disclosure.
  • Step 2 Sequential pattern mining: Given the set of sequences generated from step 1 , a sequential pattern mining algorithm is applied to generate sequential patterns. Sequential pattern mining is a popular data mining algorithm [17], which finds all patterns that appear frequently in the data. The frequency threshold is set by the user, which is called the minimum support. The present disclosure uses 0.01 as the minimum support. In the present disclosure each pattern contains ⁇ POStag, ENTITYXYZ ⁇ with a length greater than or equal to 2.
  • An example pattern is:
  • Step 3 Pattern matching to extract candidate entities: For each sentence in the test dataset, a system can match the generated patterns to extract a set of candidate entities. The patterns can be sorted based on their supports. In order not to generate too many spurious candidates, the matching process in a sentence terminates after five patterns have been matched.
  • Example 4 The following sentence is presented with POS tags attached:
  • The/DT misses/VBZ has/VBZ currently/RB got/VBN a/DT Nokia/NNP 7390/CD at/IN the/DT end/NN of/IN the/DT day ,/VBG all/DT she/PRP does/VBZ is/VBZ text/NN and/CC make/VB calls,/NN but/CC the/DT reception/NN is/VBZ serious,/VBG where/WRB my/PRP$ 6233/CD would/MD get/VB full/JJ bars/NNS hers/PRP would/MD only/RB get/VB I/CD or/CC 2./CD
  • the pattern, ⁇ DT ⁇ , ⁇ NNP, ENTITYXYZ ⁇ , ⁇ CD ⁇ > can match the sentence segment: a/DT Nokia/NNP 7390/CD to produce the candidate entity: "Nokia”.
  • the pattern, ⁇ DT ⁇ , ⁇ NNP ⁇ , ⁇ CD, ENTITYXYZ ⁇ , ⁇ IN ⁇ > can match the sentence segment: a/DT Nokia/NNP 7390/CD at/IN to produce the candidate entity: 7390.
  • Step 4 - Candidate pruning The above pattern matching method can extract many wrong entities.
  • a pruning method based on POS check is proposed by the present disclosure. It remedies some errors made by a POS tagger system. Since an entity is always associated with a POS tag in the present patterns, this method checks in the dataset to see whether the POS tag is the most frequent one for this candidate. If it is not, the candidate entity can be eliminated (a possible POS tagging error).
  • Example 5 Given the sentence:
  • Step 5 Finding additional entities using brand and model relation.
  • the second task in this step is to use the Brand to identify additional models.
  • a regular expression is used which assumes that a model name must have a digit.
  • Step 6 Finding more entities using syntactic patterns. Using some syntactic patterns can help finding competing entities (brands and models). The syntactic patterns exploit conjunctions and comparisons in sentences. [00063] In the present disclosure C denotes a discovered entity and CN as a competitor. The following eight patterns are used:
  • The/DT correct/JJ comparison/NN was/VBD made/VBN many/JJ times/NNS as/IN e398/CD vs./IN k700/CD ./.
  • Comparative sentences express similarity and differences of more than one entity. There can be three main types of comparatives:
  • Non-equal gradable "greater or less than” that expresses a total ordering of some entities with regard to some shared features or attributes. For example, the sentence, "Camera-X' s battery life is longer than that of Camera-T ⁇ orders Camera-X and Camera-Y based on their shared feature "battery life”.
  • Non-gradable Comparing two or more entities, but do not grade them.
  • the sentence, "Camera-X and Camera-Y have different shapes”, expresses a comparison of the shapes of the two cameras but does not grade them.
  • a superlative sentence expresses a relation of the type "greater or less than all others ' ", i.e., it ranks one entity over all other entities. For example, the sentence,
  • Camera-B and Camera-C Camera-A is the best.
  • FIG. 1 depicts an illustrative embodiment of an algorithm based on the above disclosure.
  • the flowchart of FIG. 1 follows the simple method provided above but with special handlings to comparative sentences as discussed above.
  • the input is a post, and the output is the entities discussed in each sentence.
  • the algorithm is simplified for presentation clarity.
  • the start post and quotes in replies are also considered as entities may be inherited from them.
  • Comparative sentences here also cover superlative sentences that contain more than one entity. For a superlative sentence with only a single entity, it is treated as a normal sentence.
  • the notations used in the algorithm are:
  • opinionQ It is the sentiment analysis function that analyzes a non- comparative sentence.
  • compOpinionQ It is the sentiment analysis function that finds superior and inferior entities from a comparative sentence.
  • SENTIMENT ANALYSIS It is the sentiment analysis function that finds superior and inferior entities from a comparative sentence.
  • Sentiment orientations of opinions can identify whether the opinions are positive, negative or neutral. Since the present disclosure is not concerned with entity features as in references [4, 9], entity features are not used in the analysis. In an application, entity features can be discovered in various ways if needed, e.g., the method in references [9, 20]. There are three main sentiment indicators, i.e., sentiment words and phrases, negations, and but-clauses. They are discussed below. [00090] Sentiment Indicators
  • Sentiment words and phrases In most cases, sentiments in sentences are expressed with sentiment (or opinion) words, e.g., "great” , "good”, “bad”, and “poor”. Although words that express sentiments are usually adjectives and adverbs, verbs and nouns can be used to express sentiments/opinions too. researchers have compiled sets of such words. Such lists are collectively called the sentiment lexicon. Apart from individual words, there are sentiment phrases and idioms, e.g., "cost someone an arm and a leg”. Furthermore, some phrases may involve sentiment words, but the whole phrases have no opinion. For example, the phrase "a good deal of" does not have an opinion although it has the positive sentiment word "great”.
  • Such phrases are called non-sentiment phrases involving sentiment words.
  • Negations Sentiment words and phrases form the basis of opinions in a sentence. Negations reverse their orientations. Apart from “not”, many other words and phrases can be used to express negations. Furthermore, “not” may not express negation in some cases, e.g., in “not only ... but also”. Such phrases are called non- negations involving negation words.
  • but means contrary. For example, the sentence, "The picture quality is great, but not the battery life” expresses a positive sentiment on "picture quality” but a negative sentiment on "battery life”. The following rule states the effect of "but”: The orientation before “but” is opposite to that after "but”. [00095] Apart from the word “but”, many other words and phrases behave similarly, e.g., “though” and “except that”. Similar to opinions and negations, not every "but” changes sentiment direction. For example, “but” in the pattern “not only ... but also” does not. Such phrases are called non-but phrases involving "but”. [00096] Specification for Sentiment Indicators
  • each indicator word is represented as a rule.
  • Each rule consists of two parts, an item on the right and an action on the left.
  • the ⁇ item> is either an individual word or a word attached with a type, which may be anyone of the part-of-speech (POS) tags.
  • POS part-of-speech
  • the specification can consist of a set of rules. Each rule has two parts, a phrase on the right and an action on the left. Each phrase can have a target word, indicated by [T], to which the action is applied.
  • the idea is that the left-hand- side of the rule is first matched in the sentence and then the action of the rule is applied to the target in the sentence.
  • indicator symbols Indicators ym
  • words Indicators ym
  • distances Indicators ym
  • indicator Sym These are indicator symbols, Po, Ne, Neu, Ng and But, from individual indicator words discussed above.
  • a "type" may also be attached, specifying the POS tag of the word.
  • Word It can be any word with an optional type.
  • Distance It indicates the number of words (or gap) that can appear between two non-distance items in the phrase. "-” means from “num” to "num” (num is an integer number).
  • Target It is the core item of the phrase, indicating which word the rule is applied to.
  • the action on the right states that the action symbol should be associated with the target.
  • the action symbol can be any of the outcomes or their negations, i.e.,
  • the ordering of rules can be significant. When the first rule for a target word is matched and applied, the rest will not be tried.
  • Step 1 Part-of-speech tagging: The tags are used for matching ⁇ type>'s in the rules.
  • Step 2 Applying indicator word rules: All sentiment words, negation words and but-like words in the sentence are identified in this step. After this step, one can obtain
  • the picture quality is not[Ng] good[Po], reaction is too slow[Neu], but[But] the battery life is long[Neu].
  • Step 3 - Applying phrase rules This step identifies all phrases in the sentence and performs the actions specified in the rules. After this step, the running example sentence becomes:
  • the picture quality is not[Ng] good[Po], reaction is too slow[NE], but[But] the battery life is long[Neu].
  • Step 4 - Handling negations A negation in a sentence reverses the orientation of an opinion. For neutral, it is turned to negative. After negation handling, the running example sentence becomes ("good" is now turned to negative from positive):
  • Step 5 Aggregating opinions: This step first finds but-symbols ("But” or "BUT"), which indicate sentiment changes. The sentiments on the two sides of a but- symbol are opposite to each other. For illustration purposes, only the sentiment in the first clause of the sentence is used.
  • Opinion aggregation All opinion indicators in the first clause of the sentence are aggregated to arrive at the final sentiment. The algorithm simply sums up all indicators [9]. A positive (or negative) indicator is assigned 1 (or -1). If the final sum is greater than 0, then the clause is positive. If the sum is less than 0, then the clause is negative and neutral otherwise. For our example, the sentiment of the first part (before "but") is positive.
  • Identifying superior and inferior entities as expressed in a comparative sentence is called comparative opinion mining.
  • the sentiment analysis method above can be adapted to find superior and inferior entities in comparative sentences. This is due to the following observation,
  • Positive and negative sentiment words have their corresponding comparative and superlative forms indicating superior and inferior states respectively.
  • the positive sentiment word, "good” has its comparative and superlative forms, “better” and “best”, which indicate superior (and inferior) entities.
  • comparatives and superlatives are special forms of adjectives and adverbs. In general, comparatives are formed by adding the suffix "- ⁇ ?r” and superlatives are formed by adding the suffix "-esf to the base (or original) adjectives and adverbs. Adjectives and adverbs with two syllables or more and not ending in y do not form comparatives or superlatives this way.
  • the heuristics rules used in the present disclosure are as follows (if a sentence matches anyone of the rules, it is considered a comparative or a superlative sentence): a) pronoun + compkey + prodname, b) prodname + compkey + pronoun, c) prodname + compkey + prodname d) pronoun + superkey e) prodname + superkey d) as + JJ + as (except "as long as” and "as far as") where compkey is a comparative keyword, prodname is a product name and superkey is a superlative keyword.
  • Identify superior entities As mentioned earlier, the above sentiment analysis method for mining direct opinions can be used to identify superior/preferred entities. Since a gradable comparative sentence typically has entities on the two sides of the comparative keyword, i.e., "Camera-X is better than Camera-Y ⁇ Based on sentiment analysis, if the sentence is positive, then the entities before the comparative keyword is superior and otherwise they are inferior (with the negation considered). Superlative sentences can be handled in a similar way. Note that equative and non- gradable comparisons do not express preferences. [000134] EMPERICAL EVALUATON [000135] This section evaluates the proposed techniques for the two tasks, entity identification and entity assignment. The disclosure below presents datasets and corresponding experimental results.
  • HowardForums is a message board dedicated to mobile phones while AVSforum is a message board dedicated to Home Theater and the products used. Data from AVSforum are discussions about Plasma and LCD TVs,
  • Table 1 shows the characteristics of the two data sets.
  • NET is a Named Entity Tagger, which can be used in the present case as product names are named entities.
  • the CRF system used in the description below is from
  • CRF is the data obtained from step 2 of our algorithm. Recall that the data from step 2 is automatically generated. The entities in those sentences are regarded as positive data and all the other words in the sentences are regarded as negative data.
  • the test data is the whole set for all the systems. Using the whole set as the test data is reasonable because present system does not use any manually labeled training data.
  • Table 3 gives the experimental results for entity assignment, which include the results of two baseline methods.
  • the disclosed method uses ED to denote the proposed technique. Below, the columns are explained one-by-one, and also discussion is provided on the results.
  • Two sets of experiments were conducted. The first set is denoted by “Next Sentences” in Table 3. "Next Sentences” means that only the comparative sentences and their subsequent sentences are considered. This set of experiments thus shows how effective the ED technique is in its intended task. The second set of experiments is denoted by "All Sentences", which considers all sentences. It shows how the ED method affects the overall implicit entity assignment task.
  • Column 1 Baseline 1 -next sentences: Baselinel works as follows: If a sentence does not mention any product name, one can simply take the last product of the previous sentence. Note that the product of the previous sentence can be inherited from its previous sentence and so on. The accuracy measure is used here because one can gauge how accurate the assignments of products to sentences are.
  • Column 2 baseline2-next sentences: In the Baseline2 method, if a sentence does not mention a product name, it simply takes the first product of the previous sentence. One can observe that Baseline2 is always more accurate than Baselinel because in most cases, the first product is the superior product in a comparative sentence and the next sentence also tends to talk about that product.
  • Column 3 (ED (k-com) - next sentences): It gives the result of each data set using the proposed ED method assuming that the comparative and superlative sentences are known, k-com denotes this assumption.
  • FIG. 3 depicts an exemplary diagrammatic representation of a machine in the form of a computer system 300 within which a set of instructions, when executed, may cause the machine to perform any one or more of the methodologies discussed above.
  • the machine operates as a standalone device.
  • the machine may be connected (e.g., using a network) to other machines.
  • the machine may operate in the capacity of a server or a client user machine in server-client user network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
  • the machine may comprise a server computer, a client user computer, a personal computer (PC), a tablet PC, a laptop computer, a desktop computer, a control system, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine.
  • a device of the present disclosure includes broadly any electronic device that provides voice, video or data communication.
  • the term "machine” shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein.
  • the computer system 300 may include a processor 302 (e.g., a central processing unit (CPU), a graphics processing unit (GPU, or both), a main memory 304 and a static memory 306, which communicate with each other via a bus 308.
  • the computer system 300 may further include a video display unit 310 (e.g., a liquid crystal display (LCD), a flat panel, a solid state display, or a cathode ray tube (CRT)).
  • the computer system 300 may include an input device 312 (e.g., a keyboard), a cursor control device 314 (e.g., a mouse), a disk drive unit 316, a signal generation device 318 (e.g., a speaker or remote control) and a network interface device 320.
  • the disk drive unit 316 may include a machine-readable medium 322 on which is stored one or more sets of instructions (e.g., software 324) embodying any one or more of the methodologies or functions described herein, including those methods illustrated above.
  • the instructions 324 may also reside, completely or at least partially, within the main memory 304, the static memory 306, and/or within the processor 302 during execution thereof by the computer system 300.
  • the main memory 304 and the processor 302 also may constitute machine-readable media.
  • Dedicated hardware implementations including, but not limited to, application specific integrated circuits, programmable logic arrays and other hardware devices can likewise be constructed to implement the methods described herein.
  • Applications that may include the apparatus and systems of various embodiments broadly include a variety of electronic and computer systems. Some embodiments implement functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit. Thus, the example system is applicable to software, firmware, and hardware implementations.
  • the methods described herein are intended for operation as software programs running on a computer processor.
  • software implementations can include, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the methods described herein.
  • the present disclosure contemplates a machine readable medium containing instructions 324, or that which receives and executes instructions 324 from a propagated signal so that a device connected to a network environment 326 can send or receive voice, video or data, and to communicate over the network 326 using the instructions 324.
  • the instructions 324 may further be transmitted or received over a network 326 via the network interface device 320.
  • machine-readable medium 322 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
  • the term “machine-readable medium” shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure.
  • machine-readable medium shall accordingly be taken to include, but not be limited to: solid-state memories such as a memory card or other package that houses one or more read-only (non- volatile) memories, random access memories, or other re- writable (volatile) memories; magneto-optical or optical medium such as a disk or tape; and/or a digital file attachment to e-mail or other self- contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a machine-readable medium or a distribution medium, as listed herein and including art-recognized equivalents and successor media, in which the software implementations herein are stored.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A system that incorporates teachings of the present disclosure may include, for example, an evaluation system having a controller to identify a plurality of entities in opinionated text, identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from a plurality of comparative sentences in the opinionated text, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and a plurality of non-comparative sentences of the opinionated text according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non- comparative sentences. Additional embodiments are disclosed.

Description

SYSTEM AND METHODS FOR SENTIMENT ANALYSIS
Inventors
Bing Liu
Xiaowen Ding
FIELD OF THE DISCLOSURE
[0001] The present disclosure relates generally to data mining techniques, and more specifically to a system and methods for sentiment analysis.
BACKGROUND
[0002] User-generated content such as product reviews, forum discussions and blogs contains valuable information that can be exploited for many applications. Reviews contain customer opinions about products and services that can be used for marketing and competitive intelligence gathering. Forums also contain product questions and problems, which can be used for product improvements. Although many studies have been reported on user-generated content, the analysis of such content at the semantic level is in its early stages of investigation.
BRIEF DESCRIPTION OF THE DRAWINGS
[0003] FIG. 1 depicts an illustrative embodiment of a method for assigning entities;
[0004] FIG. 2 depicts an illustrative embodiment of a method for identifying entities using a variety of seeds;
[0005] FIG. 3 depicts an illustrative diagrammatic representation of a machine in the form of a computer system within which a set of instructions, when executed, may cause the machine to perform any one or more of the methodologies disclosed herein;
[0006] Table 1 depicts an illustrative embodiment of a plurality of data sets associated with two forums;
[0007] Table 2 depicts an illustrative embodiment of experimental results for entity identification; [0008] Table 3 depicts an illustrative embodiment of experimental results for entity assignment; and
[0009] Table 4 depicts an illustrative embodiment of part-of-speech (POS) tags.
DETAILED DESCRIPTION
[00010] One embodiment of the present disclosure entails identifying a plurality of entities in opinionated text generated by a plurality of users, each user expressing one or more opinions about at least one of the plurality of entities, identifying a plurality of comparative sentences and a plurality of non-comparative sentences in the opinionated text, identifying inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences, determining a semantic orientation for each of the plurality of non-comparative sentences, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the determined semantic orientation of the plurality of non-comparative sentences, the plurality of comparative opinions, and sentiment consistency between consecutive sentences in the opinionated text. [00011] An embodiment of the present disclosure entails a computer-readable storage medium having computer instructions to identify a plurality of entities in opinionated text, identify a plurality of comparative sentences and a plurality of non- comparative sentences in the opinionated text, identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
[00012] An another embodiment of the present disclosure entails an evaluation system having a controller to identify a plurality of entities in opinionated text, identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from a plurality of comparative sentences in the opinionated text, and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and a plurality of non- comparative sentences of the opinionated text according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
[00013] A popular semantic level analysis is sentiment analysis or opinion mining, which tries to discover user opinions about products and services. Such studies are mainly conducted in the context of product reviews [4, 6, 8, 9, 18, 19, 20, 22] due to the fact that reviews are focused on the entities being reviewed and contain little irrelevant information.
[00014] However, this cannot be said about forum discussions and blogs because in such posts, the user may talk about multiple entities (e.g., products), and compare them. This raises two important issues: (1) how to identify the entities that are talked about in a post and (2) how to determine what entities that each sentence talks about because in many sentences entity names are not explicitly mentioned, but are implied. The first problem can be termed as entity identification and the second problem entity assignment. Without knowing the entities and which sentence talks about which entities, any sentence level analysis is of limited use. For example, if an algorithm finds that a sentence expresses a negative opinion about something, but it cannot determine on which product, then the extracted opinion is of no use. Also, if a product problem is discovered, but it is not known about which product, then the information is also meaningless.
[00015] The first problem is similar to a named entity recognition (NER) problem. However, common NER methods do not work well because of the ungrammatical nature of the forum posts, over-capitalization and under capitalization. Overcapitalization means that the user may capitalize every word in the sentence, and under-capitalization means that the first letters of many entity names are not capitalized. These cause serious problems for existing entity recognition methods. [00016] The second problem bears some resemblance to pronoun resolution [3, 24, 25] in natural language processing (NLP), which identifies what each pronoun in a sentence refers to. Pronoun resolution is still a major challenge in NLP. The accuracy of the current state of the art systems is only about 60-70% on well-formed sentences such as those in news articles [3, 24, 25]. For the user-generated content, the problem is harder due to ungrammatical sentences, and missing or wrong punctuations. Many sentences do not have pronouns, but it is helpful to know which entities these sentences talk about. For example, sentence (5) of the post in Example 1 below has no pronoun or any other reference to resolve. The question is how to discover that sentence (5) in Example 1 talks about Camera-A, and sentence (5) in example 2 talks about Camera-B (for easy reference, a number is added before each sentence). [00017] Example 1: "(1) I bought Camera-A yesterday. (2) I took some pictures in the evening in my living room. (3) The images are very clear. (4) They are definitely better than those from my old Camera-B. (5) The battery is very good too." [00018] A simple approach to identifying the entities talked about in each sentence is the following: The algorithm sequentially processes each sentence. Whenever an entity name is encountered in a sentence, it is assumed that the sentence talks about that entity. It is also assumed that the subsequent sentences talk about that entity as well until a new entity name occurs. Then the new entity is the one talked about in its sentence. The subsequent sentences also talk about the new entity, and so on. This simple strategy works reasonably well in practice. However, it breaks down when a comparative sentence is encountered.
[00019] Clearly, sentences (1) - (3) of Example 1 talk about Camera-A because Camera-A is encountered in sentence (1), and sentences (2) and (3) contain no new product. Sentence (4) is a comparative sentence, which also introduces Camera-B. The simple strategy mentioned above is not applicable because although sentence (4) only mentions Camera-B, it actually talks about both cameras. What is more serious is that the strategy also infers that sentence (5) talks about Camera-B, which is clearly wrong. Human beings can readily ascertain that sentence (5) talks about Camera-A. The question is how to solve the problem using an automated technique. It can be said that the sentence after the comparative sentence should talk about the product mentioned before. Unfortunately, this is not right either. To illustrate this point, example 1 can be changed to:
[00020] Example 2: "(1) I bought Camera-A yesterday. (2) I took a few pictures in the evening in my living room. (3) The images are very clear. (4) They are definitely better than those from my old Camera-B. (5) The pictures of that camera were blurring for night shots, but for day shots it was ok"
[00021] Example 2 is the same as example 1 except the last sentence. Obviously, sentence (5) of example 2 talks about Camera-B. The above algorithm does not work with example 2. Since the method does not rely on pronouns, it has two advantages.
First, there is no need to solve the difficult problem of pronoun resolution in NLP. As a result, there is no need to worry about the situation where a pronoun does not refer to any previous entity, e.g., "it" in "it is great to have such a beautiful camera." Note that even deciding what to resolve itself is challenging, e.g., how to decide we need to resolve the meaning of "that camera" in sentence 5 of example 2. Second, sentences that do not use pronouns can be handled.
[00022] Comparative sentences (4) in both examples say that Camera-A is superior to Camera-B. The next sentence, sentence (5) in example 1, expresses a positive sentiment. Intuitively, sentence (5) in example 1 should refer to the superior product.
Similarly, sentence (5) in example 2, which expresses a negative sentiment in its first clause, should refer to the inferior product. This phenomenon can be called sentiment consistency, which says that consecutive sentiment expressions should be consistent with each other. It would be ambiguous if this consistency were not observed in writing.
[00023] To solve this problem sentiment analysis of direct opinions can be employed. Two tasks are necessary: (1) for a comparative sentence, there is a need to identify which entity is superior (or preferred), and (2) for the subsequent sentence, there is a need to determine whether its first clause (sentence 5 of example 2) is positive, negative, or neutral.
[00024] A sentiment analysis method can be adapted for direct opinions to solve aforementioned problem. By a direct opinion [17], what is meant is a sentence or a clause that directly expresses a positive or negative opinion on an entity or a feature of the entity, such as illustrated by sentence (5) of Example 1. Direct opinions are in contrast to comparative opinions. A comparative opinion does not directly express a positive or negative opinion on anything but expresses a preference of some entities. For example, sentence (4) expresses a comparative opinion, i.e., "Camera- A" is superior or preferred to "Camera-B" when comparing their images. It can be observed that sentence (4) alone does not say any camera is good or bad, but just states a comparison.
[00025] To solve an entity assignment problem, two more contributions are made to sentiment analysis. First, it was discovered that a sentiment analysis method [9] for direct opinions can also be adapted to analyze comparative opinions. Thus, an entirely new algorithm is not needed although the two types of sentences look quite different (e.g., sentences (4) and (5) in Example 1). Second, a novel rule specification language can be defined to generalize the lexicon-based sentiment analysis technique [9, 18] so that it can consider complex phrases. The existing method is only able to handle individual words. Experimental results based on 753 forums posts from 63 threads with 4385 sentences show that the technique described in the present disclosure is effective.
[00026] Related works about entity identification can be found in the field of information extraction, specifically, named entity recognition (NER). NER aims to identify entities such as persons, organizations and locations in natural language text. On product extraction, references [6, 11] have studied the problem in the context of comparative sentences. The methods described in these references exploit specific structures of such sentences for extraction. The present disclosure is more general and not focused on comparative sentences. The present disclosure can also be different from the classic NER as there is only an interest in product type of entities. Reference [2] provides good surveys of existing information extraction algorithms. Conditional random fields (CRF) [16] have been shown to perform the best so far. It will be shown that the method described in the present disclosure outperforms CRF dramatically for our task.
[00027] Related works are mainly in two areas, pronoun resolution and sentiment analysis. These works are reviewed below. [00028] Pronoun resolution has been extensively studied in natural language processing [3, 24, 25]. However, it is still a major challenge. As discussed in the introduction, the present disclosure describes a technique which is in fact quite different because many sentences do not have pronouns, but there is still a need to know which entities they discuss.
[00029] Although the objective of the proposed problem is not sentiment analysis, some of the sentiment analysis techniques can be used to solve the problem. The most widely studied sentiment analysis topic is sentiment classification, which investigates ways to classify whole product reviews as positive, negative, or neutral [19, 22]. Sentiment classification is not applicable to sentences and clauses considered in the present disclosure. Sentence-level and clause sentiment classification has been studied in [e.g., 15, 21, 23].
[00030] The present disclosure relates to feature-based sentiment analysis or opinion mining [4, 9, 18, 20], which finds sentiments expressed on product features. For example, in the above example, "photo quality" and "battery" are product features. The sentiment on "photo quality" is positive and the sentiment on "battery" is negative. Existing techniques exploit sentiment words for the task. Sentiment words are words that express desired or undesired states. Positive words express desired states, e.g., "great" and "good". Negative words express undesired states, e.g., "bad" and "poor". Identifying sentiment words have been studied in [5, 9, 13, 14]. Several lists have been compiled. Apart from individual words, there are also many sentiment phrases, e.g., "cost someone an arm and a leg". The present disclosure does not identify product features. Product features can be discovered by another system or given by the user of the technique described in the present disclosure. [00031] The method in reference [9] can be adapted to perform sentiment analysis of each clause. The present disclosure also makes two contributions to sentiment analysis. First, reference [9] hard-codes all sentiment phrases in the system, which is undesirable because any addition/deletion of phrases will involve changing the program code. In reference [4], context dependent opinions are considered. Although the opinion analysis method used in reference [4] is the same as that used in reference [9], the opinion aggregation method is different. A context is defined as a pair, a feature and an opinion. Context dependent opinions are determined based on the pair. The present disclosure does not use this context definition. To improve the algorithm for determining semantic orientation of an opinion in references [9, 4], a specification language is disclosed to enable the user to add/delete complex sentiment indicators, which can be words, phrases or other language constructs without touching the underlying program. Second, the present disclosure shows that a sentiment analysis method for analyzing direct opinions can be adapted to analyzing comparative sentences to mine comparative opinions.
[00032] On the study of comparative sentences, references [11, 12] propose a method to find comparative and superlative sentences. However, the teachings in these references do not determine superior entities expressed in comparative sentences. They only extract some useful items from sentences. Such items alone are not sufficient in determining the superior entities. Reference [1] proposes a method to extract items from superlative sentences. It does not study sentiments either. In reference [7], the authors tried to identify which entity has more of a certain property in a comparative sentence. Again, it is not concerned with the problem of identifying the superior entities. Reference [8] studied the sentiment analysis of comparative sentences. However, it needs a large volume of external information, i.e., product reviews.
[00033] The basic information unit of forums, blogs and discussion boards consists of a start post and a list of follow-up posts or replies. This basic information unit is often called a thread. A thread t thus can be modeled as a sequence of posts, <pι, p2, ..., pn>- Pi is the start post. Each post consists of a sequence of sentences, <su S2, ..., sm>. Each sentence s, describes something on a subset of entities ε = Je1...e}\ eh,e} G E], where E is the set of all entities. An entity can be a person, a product, an organization, an event, etc. If an entity name is explicitly mentioned in sentence s,, one can say that the entity is an explicit entity in S1. If the entity is not explicitly mentioned in S1 but it is implied, one can say that the entity is an implicit entity. For example, Camera-A in the first sentence below is an explicit entity. Camera-A is an implicit entity in sentence 2 as it is not explicitly mentioned there, but it is implied. [00034] "Camera-A looks really pretty. The battery lasts very long". [00035] Most sentences talk about a single entity, i.e., the size of ε is usually 1. If a sentence involves multiple entities (explicit and/or implicit), it is usually a comparative sentence, e.g., "Camera-A is better than Camera-B". A related type of sentences is the superlative sentences, e.g., "Camera-A is the best."
[00036] This is a simplified model of the real word. For example, it does not cover irrelevant sentences, which are usually rare. It does not cover quotes (from previous posts) in the reply. However, quotes are easy to handle because they are usually in a different format.
[00037] Given a set of threads T in a particular domain, two tasks are performed in the present disclosure:
1. Entity identification: identify the set of entities E discussed in the posts of the threads, and
2. Entity assignment: determine the entities in E that each sentence S1 of each post/?, in t (e T) talks about.
[00038] For entity assignment, the present disclosure uses sentiment analysis methods for both direct opinions and comparative opinions, which are defined below. [00039] Definition (direct opinion): A direct opinion is a positive or negative opinion on an entity or some feature of the entity without mentioning any other similar entities.
[00040] For example, "The picture quality of this camera is poof expresses a direct opinion, which is positive on the product feature "picture quality". Note that positive, negative and neutral are called semantic orientation (opinion orientation or sentiment orientation) of an opinion.
[00041] Definition (comparative opinion): A comparative opinion is expressed in a comparative or superlative sentence. It states that some entities are superior or preferred to some other entities with respect of some shared features or attributes of the entities. [00042] For example, "The picture quality of Camera-Y is better than that of
Camera-X." is comparative sentence, which states that "Camera-Y" is superior or preferred to "Camera X" when comparing their "picture quality".
[00043] The next two sections detail the proposed techniques for the two tasks. In the process, the present disclosure will discuss the new methods for mining direct opinions (which can be referred to also as non-comparative opinions) and comparative opinions which are instrumental for entity assignment.
[00044] ENTITY IDENTIFICATION
[00045] The main idea for entity identification is to discover linguistic patterns through learning and then use the learnt patterns to extract entity names. However, traditional methods need a large number of manually labeled training examples, and labeling is very time consuming. For a different domain, the labeling process may need to be repeated. This section proposes an automated pattern discovery method for the task, which is thus unsupervised.
[00046] The basic idea of the algorithm is that the user starts with a few seed entities. The system bootstraps from them to find more entities in a set of documents
(or posts). The algorithm is thus iterative. Pattern mining is employed at each iteration to find more entities based on already found entities. The iterative process ends when no new entity names are found. Pruning methods are also proposed to remove those unlikely entities.
[00047] Given a set of seed entities E = {eu e2, ...en), the algorithm can consist of the following iterative steps:
[00048] Step 1 - data preparation for sequential pattern mining: This step perform two tasks, it first finds all sentences that contain any one of the seed entities, eι, β2, ..., en in the dataset, and then generates a sequence for each occurrence of et for pattern mining. In order to focus patterns on entities and not generate too many patterns, the present disclosure can use only the window of 5 words before each entity name and 5 words after each entity name. Each word of a seed entity name is replaced with a generic (unique) name "ENTITYXYZ". Utilizing this generic word can ensure that generic patterns about any entities are found. Note that each entity name can consist of more than one word. The part-of-speech (POS) tag of each word can also be used. In the final sequence each element of the sequence can be a pair, POS tag of the word and the word. An example follows below:
[00049] Example 3: The sentence that follows has POS tags attached. Here n95 is a phone model (an entity).
Hiiiiiiiii/NNP SK/NNP -/: ,/, dont/NN be/VB mad/JJ everyone/NN doesnt/NN have/VBP a/DT n95/CD phone/NN fetish/NN ducky/JJ The window is (n95 has been replaced with ENTITYXYZ): mad/JJ everyone/NN doesnt/NN have/VBP a/DT ENTITYXYZ /CD phone/NN fetish/NN ducky/JJ The resulting sequence is:
<{JJ, mad} {NN, everyone} {NN, doesnt} {VBP, have} {DT, a} {CD, ENTITYXYZ} {NN, phone} {NN, fetish} {JJ, ducky}>
[00050] Table 4 depicts POS Tags used above and throughout the rest of the disclosure.
[00051] Step 2 - Sequential pattern mining: Given the set of sequences generated from step 1 , a sequential pattern mining algorithm is applied to generate sequential patterns. Sequential pattern mining is a popular data mining algorithm [17], which finds all patterns that appear frequently in the data. The frequency threshold is set by the user, which is called the minimum support. The present disclosure uses 0.01 as the minimum support. In the present disclosure each pattern contains {POStag, ENTITYXYZ} with a length greater than or equal to 2. An example pattern is:
<{IN}, {DT}, {NNP, ENTITYXYZ }, {is}>
[00052] Here "IN", "DT", "NNP" are POS tags which can match any words with that tag, and "is" is a concrete word which can only match this particular word. [00053] Step 3 - Pattern matching to extract candidate entities: For each sentence in the test dataset, a system can match the generated patterns to extract a set of candidate entities. The patterns can be sorted based on their supports. In order not to generate too many spurious candidates, the matching process in a sentence terminates after five patterns have been matched. [00054] Example 4: The following sentence is presented with POS tags attached:
The/DT misses/VBZ has/VBZ currently/RB got/VBN a/DT Nokia/NNP 7390/CD at/IN the/DT end/NN of/IN the/DT day ,/VBG all/DT she/PRP does/VBZ is/VBZ text/NN and/CC make/VB calls,/NN but/CC the/DT reception/NN is/VBZ terrible,/VBG where/WRB my/PRP$ 6233/CD would/MD get/VB full/JJ bars/NNS hers/PRP would/MD only/RB get/VB I/CD or/CC 2./CD
The pattern, <{DT}, {NNP, ENTITYXYZ}, {CD}>, can match the sentence segment: a/DT Nokia/NNP 7390/CD to produce the candidate entity: "Nokia". The pattern, <{DT}, {NNP}, {CD, ENTITYXYZ}, {IN}>, can match the sentence segment: a/DT Nokia/NNP 7390/CD at/IN to produce the candidate entity: 7390.
[00055] Step 4 - Candidate pruning: The above pattern matching method can extract many wrong entities. A pruning method based on POS check is proposed by the present disclosure. It remedies some errors made by a POS tagger system. Since an entity is always associated with a POS tag in the present patterns, this method checks in the dataset to see whether the POS tag is the most frequent one for this candidate. If it is not, the candidate entity can be eliminated (a possible POS tagging error). [00056] Example 5: Given the sentence:
You/PRP can/MD also/RB be/VB sure/JJ it/PRP will/MD work/VB with/IN all/PDT the/DT Sony/NNP Ericsson/NNP walkman/NN phone/NN accessories/CD
The pattern, <{IN} {DT} {CD, ENTITYXYZ}>, matches the sentence segment: with/IN all/PDT the/DT Sony/NNP Ericsson/NNP walkman/NN phone/NN accessories/CD to produce the candidate entity: accessories, which is incorrect.
[00057] But when the algorithm goes over the sentences in the dataset again, it can find that "accessories" appear as "NNS" more often than as "CD". This candidate is deleted.
[00058] The algorithm so far is generic and applicable to any domain because no assumption was made. The two steps below may be more applicable to manufactured products, which has brands and models. It is desirable to extract both. An assumption that is made below is that the model name has a digit in it. In the experimental section their results will be shown separately.
[00059] Step 5 - Finding additional entities using brand and model relation.
For most manufactured products, brands and models often appear together, e.g., "Nokia N95". Here we need to use the above digit assumption. Thus, based on the entities that were found so far (step 4), this step tries to find additional entities by matching the pattern "Brand Model". The first task is to identify brands from the entities identified so far. This is simple as the example below shows. [00060] Example 6: Provides the following sentence:
As/RB far/RB as/IN I/PRP heard/VBD Nokia/NNP N95/CD seems/VBZ to/TO be/VB the/DT leader/NN in/IN this/DT sense./CD
In this sentence, if both "Nokia" and "N95" are in the entity list, "Nokia" is considered as "Brand"
The second task in this step is to use the Brand to identify additional models. A regular expression is used which assumes that a model name must have a digit.
[00061] Example 7: Given the following sentence,
Nokia/NNP 6280/CD is/VBZ the/DT best,/NN it's/NNS pictures/NNS are/VBP sharp/JJ and/CC cool./CD
"Nokia" is a Brand, and "6280" conforms to our requirement. So "6280" will be added to the Model.
[00062] Step 6 - Finding more entities using syntactic patterns. Using some syntactic patterns can help finding competing entities (brands and models). The syntactic patterns exploit conjunctions and comparisons in sentences. [00063] In the present disclosure C denotes a discovered entity and CN as a competitor. The following eight patterns are used:
C and CN CN and C
C or CN CN or C
C vs CN CN vs C C more than CN CN more than C [00064] Example 7: Given the following sentence,
The/DT correct/JJ comparison/NN was/VBD made/VBN many/JJ times/NNS as/IN e398/CD vs./IN k700/CD ./.
If "e398" has been found from the previous step, the pattern "C vs. CN" will find k700.
[00065] ENTITY ASSIGNMENT
[00066] The discussion below presents an entity assignment algorithm, which depends on sentiment analysis of comparative sentences. The disclosure below first introduces the concepts of comparatives and superlatives and then discusses their impacts on entity assignment. Based on the discussion, the algorithm is naturally derived.
[00067] Comparatives and Superlatives
[00068] Comparative and superlative sentences can be defined according to references [11, 12].
[00069] Comparative Sentences
[00070] Comparative sentences express similarity and differences of more than one entity. There can be three main types of comparatives:
1) Non-equal gradable: "greater or less than" that expresses a total ordering of some entities with regard to some shared features or attributes. For example, the sentence, "Camera-X' s battery life is longer than that of Camera-T\ orders Camera-X and Camera-Y based on their shared feature "battery life".
2) Equative: "equal to " that states two entities as equal with respect to some features. For example, the sentence, "Camera-X and Camera-Y are of the same size ", expresses that the two cameras are equal in term of their shared feature "size".
3) Non-gradable: Comparing two or more entities, but do not grade them. For example, the sentence, "Camera-X and Camera-Y have different shapes", expresses a comparison of the shapes of the two cameras but does not grade them.
[00071] Superlative Sentences
[00072] A superlative sentence expresses a relation of the type "greater or less than all others'", i.e., it ranks one entity over all other entities. For example, the sentence,
"Camera-X's battery life is the longest", expresses a superlative relation. Note that a superlative sentence can also contain more than one entity, e.g., "Among Camera- A,
Camera-B and Camera-C, Camera-A is the best."
[00073] Sentiment Consistency
[00074] Intuitively, in a post, if an author starts with a particular entity, s/he will likely continue with the entity. If s/he wants to introduce a new entity e, s/he will likely state the name of the entity explicitly in a sentence so, which can be (1) a normal, (2) a comparative or (3) a superlative sentence. The question is what happens to the next sentence s i if s i is a normal sentence and does not mention any entity, or S1 is a comparative sentence and it does not mention e.
[00075] For (1), when so is a normal sentence, if S1 is a normal sentence, it should talk about e. If S1 is a comparative sentence, it should compare e with a new entity, which should be explicitly mentioned. For (2), when so is a comparative sentence, if S1 is a normal sentence, there are a few cases:
[00076] so is non-equal gradable: If S1 has no entity name and it expresses a positive (respectively negative) sentiment, it should talk about the superior (or inferior) entity to satisfy sentiment consistency.
[00077] so is equative: In this case, it is unclear which entity is referred to in S1. One can assume that it is the previous entity before s0-
[00078] so is non- gradable: In this case, it is also unclear which entity is referred to in S1. It can be assumed to be the previous entity talked about before SQ.
[00079] For (3), when so is a superlative sentence, if S1 is a normal sentence, it refers to the superlative entity in SQ. For both (2) and (3), if S1 is a comparative sentence, the entities in S1 are taken.
[00080] The Algorithm [00081] FIG. 1 depicts an illustrative embodiment of an algorithm based on the above disclosure. The flowchart of FIG. 1 follows the simple method provided above but with special handlings to comparative sentences as discussed above. The input is a post, and the output is the entities discussed in each sentence. Note that the algorithm is simplified for presentation clarity. In the implemented system, the start post and quotes in replies are also considered as entities may be inherited from them. Comparative sentences here also cover superlative sentences that contain more than one entity. For a superlative sentence with only a single entity, it is treated as a normal sentence. The notations used in the algorithm are:
[00082] 5,.Entity: It stores the names of the entities discussed in sentence S1, which can be explicit or implicit.
[00083] 5,.superiorEntity: It stores the set of superior entities in the comparative sentence s,. Note that one can use a set here because the sentence may compare two sets of entities, e.g., "Camera-A is better than Camera-B and Camera-C " However, in practice, each set mostly contains only a single entity.
[00084] 5,.inferiorEntity: It stores the set of inferior entities in the comparative sentence S1.
[00085] opinionQ: It is the sentiment analysis function that analyzes a non- comparative sentence.
[00086] compOpinionQ: It is the sentiment analysis function that finds superior and inferior entities from a comparative sentence. [00087] SENTIMENT ANALYSIS
[00088] The disclosure that follows describes the sentiment analysis method used in the algorithm (i.e., opinion(s,)) for direct opinions. Below it will also be shown that comparative opinions can also be mined in a similar way from comparative sentences (i.e., compOpinionQ).
[00089] The disclosure below describes the use of sentiment indicators to determine the sentiment orientations of opinions expressed on entity features. Sentiment orientations of opinions can identify whether the opinions are positive, negative or neutral. Since the present disclosure is not concerned with entity features as in references [4, 9], entity features are not used in the analysis. In an application, entity features can be discovered in various ways if needed, e.g., the method in references [9, 20]. There are three main sentiment indicators, i.e., sentiment words and phrases, negations, and but-clauses. They are discussed below. [00090] Sentiment Indicators
[00091] Sentiment words and phrases: In most cases, sentiments in sentences are expressed with sentiment (or opinion) words, e.g., "great" , "good", "bad", and "poor". Although words that express sentiments are usually adjectives and adverbs, verbs and nouns can be used to express sentiments/opinions too. Researchers have compiled sets of such words. Such lists are collectively called the sentiment lexicon. Apart from individual words, there are sentiment phrases and idioms, e.g., "cost someone an arm and a leg". Furthermore, some phrases may involve sentiment words, but the whole phrases have no opinion. For example, the phrase "a good deal of" does not have an opinion although it has the positive sentiment word "great". Such phrases are called non-sentiment phrases involving sentiment words. [00092] While most adjectives/adverbs have explicit positive or negative orientations, there are also many words whose orientations depend on contexts in which they appear. For example, the word "long" in the following two sentences has completely different orientations, one positive and one negative: "The battery of this camera lasts long" and "This program takes a long time to run." A method will be described below to deal with this.
[00093] Negations: Sentiment words and phrases form the basis of opinions in a sentence. Negations reverse their orientations. Apart from "not", many other words and phrases can be used to express negations. Furthermore, "not" may not express negation in some cases, e.g., in "not only ... but also". Such phrases are called non- negations involving negation words.
[00094] But-clauses: "but" means contrary. For example, the sentence, "The picture quality is great, but not the battery life" expresses a positive sentiment on "picture quality" but a negative sentiment on "battery life". The following rule states the effect of "but": The orientation before "but" is opposite to that after "but". [00095] Apart from the word "but", many other words and phrases behave similarly, e.g., "though" and "except that". Similar to opinions and negations, not every "but" changes sentiment direction. For example, "but" in the pattern "not only ... but also" does not. Such phrases are called non-but phrases involving "but". [00096] Specification for Sentiment Indicators
[00097] With a large number of indicators, one can hard-code them in a system, which is, however, very undesirable because whenever a new word or phrase is encountered the program needs to be changed, which is time consuming. In references [9, 4], all phrases are hard-coded in the system. In the present disclosure, a specification language is used to enable the user to specify indicators, which are (1) sentiment words and phrases, (2) negation words and phrases, (3) but-like words and phrases, (4) non-sentiment phrases involving sentiment words, (5) non-negation phrases involving negation words, and (6) non-but phrases involving but-like words. The system then automatically uses the indicators for sentiment analysis (see Section 6.3).
[00098] Two types of specifications can be used: one for individual words and one for phrases. The reason for the separation is that individual words express their default meanings, but their meanings can be changed by phrases, i.e., overwriting the defaults to express the indicators (4), (5) and (6).
[00099] Specification of Individual Words: The grammar of the language for expressing individual words, which include sentiment words, negation words and but- like words, is given below:
<rule> := <item> "=>" <action>
<item> := <word> I <word> "[" <type> "]"
<word> := [a-z]+
<type> := JJ I RB I NN I VB I ...
<action> := Po I Ne I Neu I Ng I But
[000100] The specification can consist of a set of rules, i.e., each indicator word is represented as a rule. Each rule consists of two parts, an item on the right and an action on the left. The <item> is either an individual word or a word attached with a type, which may be anyone of the part-of-speech (POS) tags. <action> may be anyone of the five symbols, Po (positive), Ne (negative), Neu (neutral), Ng (negation) and But (but-like word). For example, one can express that "like" expresses a positive sentiment when it is a verb, one can therefore use: like[VB] => Pos
[000101] Given a sentence, the system applies each rule by matching the word together with its type in the sentence and then associates the action symbol to the matched word.
[000102] Specification for Phrases: The grammar for expressing phrases is given below.
<rule> := <pattern> "=>" <action> <pattern> := <exp> "+" <target> "+" <exp>
I <exp> "+" <target> I <target> "+" <exp>
<exp> := <element> I <exp> "+" <element>
I <exp> "+" <distance> "+" <exp>
I <exp> "+" <distance>
I <distance> "+" <exp>
I !<num> "+" !<item> "+" <exp>
I <exp> "+" !<num> "+" !<item> l<exp> "+" !<num> "+" !<item> "+" <exp>
<element> := <item> I item "/" element <item> := <indicator> I <word> <indicator> := <indicatorS ym>
I <indicatorSym> "[" <type> "]"
<target> <indicator> "[T]" I <word> "[T]"
<indicatorSym> = Po I Ne I Neu I Ng I But
<word> = [a-z]+ I [a-z]+ "[" <type> "]"
<distance> := <num> I <num> - <num>
<num> := 0 I [1-9] [0-9]*
<action> := <outcome>l !<outcome> <outcome> := PO I NE I NEU I NG I BUT
<type> := JJ I RB I NN I VB I ...
[000103] The specification can consist of a set of rules. Each rule has two parts, a phrase on the right and an action on the left. Each phrase can have a target word, indicated by [T], to which the action is applied. The idea is that the left-hand- side of the rule is first matched in the sentence and then the action of the rule is applied to the target in the sentence.
[000104] Three kinds of items can appear on the left-hand- side of a rule: indicator symbols (indicators ym), words, and distances. [000105] indicator Sym: These are indicator symbols, Po, Ne, Neu, Ng and But, from individual indicator words discussed above. A "type" may also be attached, specifying the POS tag of the word.
[000106] Word: It can be any word with an optional type.
[000107] Distance: It indicates the number of words (or gap) that can appear between two non-distance items in the phrase. "-" means from "num" to "num" (num is an integer number).
[000108] Target: It is the core item of the phrase, indicating which word the rule is applied to.
[000109] Some additional notes about the grammar: "+" is the separator, "/" means
"or" and "!num + !<item>" means that within <num> words gap, <item> does not appear.
[000110] The action on the right states that the action symbol should be associated with the target. The action symbol can be any of the outcomes or their negations, i.e.,
PO (positive), NE (negative), NEU (neutral), NG (negation), and BUT (but-like). "!" means 'not'. These action symbols cannot appear on left-hand-side, which prevents looping.
[000111] What follows are two example rules: The rule "too + Neu[JJ][T] => NE" says that "too" before a neutral adjective (Neu[JJ]) changes the orientation of the adjective to negative (NE). Thus, the target word should be marked with NE. The rule
"a + great[T] + deal + of => NEU" says that "great" has no opinion (NEU) in this context, which overwrites its default orientation of positive (see below also).
[000112] Some observations can be made about the language:
1. This is a linear language in the sense that the left-hand- side simply specifies a linear sequence of words or symbols (e.g., Po and Ne) together with gaps or distances between words. This makes pattern matching very easy.
2. The ordering of rules can be significant. When the first rule for a target word is matched and applied, the rest will not be tried.
3. Choosing the right target is important in the situation where a phrase overwrites the default meaning of a single word. The target should be the word in question. For example, the rule "great => Po" specifies that "great" is positive. However, the phrase "a great deal of" overwrites the orientation of "great" because "a great deal of" has no opinion. In this case, the rule should be "a great[T] + deal + of => NEU" as the opinion of "great" is nullified by the phrase. If we use, "a great deal[T] of => NEU", "great" will still be treated as positive.
[000113] Sentiment Analysis with an Example
[000114] What follows is a description of sentiment analysis, which utilizes the following running example sentence to show each step of the disclosed method:
"The picture quality of this camera is not good, reaction is too slow, but the battery life is long."
[000115] Step 1 - Part-of-speech tagging: The tags are used for matching <type>'s in the rules.
[000116] Step 2 - Applying indicator word rules: All sentiment words, negation words and but-like words in the sentence are identified in this step. After this step, one can obtain
The picture quality is not[Ng] good[Po], reaction is too slow[Neu], but[But] the battery life is long[Neu].
[000117] All the bold attachments are added in this step. The POS tags are omitted to improve readability.
[000118] Step 3 - Applying phrase rules: This step identifies all phrases in the sentence and performs the actions specified in the rules. After this step, the running example sentence becomes:
The picture quality is not[Ng] good[Po], reaction is too slow[NE], but[But] the battery life is long[Neu].
[000119] The orientation of "slow" is revised to negative ([NE]) due to the rule: "too
+ Neu[JJ][T] => NE".
[000120] Step 4 - Handling negations: A negation in a sentence reverses the orientation of an opinion. For neutral, it is turned to negative. After negation handling, the running example sentence becomes ("good" is now turned to negative from positive):
The picture quality is not[Ng] good[Negative], reaction is too slow[NE], but[But] the battery life is long[Neu]. [000121] Step 5 - Aggregating opinions: This step first finds but-symbols ("But" or "BUT"), which indicate sentiment changes. The sentiments on the two sides of a but- symbol are opposite to each other. For illustration purposes, only the sentiment in the first clause of the sentence is used.
[000122] Opinion aggregation: All opinion indicators in the first clause of the sentence are aggregated to arrive at the final sentiment. The algorithm simply sums up all indicators [9]. A positive (or negative) indicator is assigned 1 (or -1). If the final sum is greater than 0, then the clause is positive. If the sum is less than 0, then the clause is negative and neutral otherwise. For our example, the sentiment of the first part (before "but") is positive.
[000123] Handling domain-dependent opinions: For those sentences that the above process cannot determine their orientations, the algorithm checks if it can detect domain-dependent opinions as in reference [10], which uses several rules. Only the conjunction rule is used in this work (the others are inaccurate). For example, in "The battery life is long", it is unclear whether "long" means positive or negative. The method tries to see whether any other person said that "long" is positive (or negative). If another person wrote "this camera takes great pictures and has a long battery life". From this sentence, we can infer that "long" is positive for "battery life" because it is conjoined with the positive word "great". This is the conjunction rule, which says that a sentence only expresses one sentiment, unless there is a but-like word changing the direction.
[000124] Comparative Opinion Mining
[000125] Identifying superior and inferior entities as expressed in a comparative sentence is called comparative opinion mining. As mentioned earlier, the sentiment analysis method above can be adapted to find superior and inferior entities in comparative sentences. This is due to the following observation,
Positive and negative sentiment words have their corresponding comparative and superlative forms indicating superior and inferior states respectively.
[000126] For example, the positive sentiment word, "good", has its comparative and superlative forms, "better" and "best", which indicate superior (and inferior) entities. [000127] In English, comparatives and superlatives are special forms of adjectives and adverbs. In general, comparatives are formed by adding the suffix "-<?r" and superlatives are formed by adding the suffix "-esf to the base (or original) adjectives and adverbs. Adjectives and adverbs with two syllables or more and not ending in y do not form comparatives or superlatives this way. Instead, "more", "most", "less" and "least" are used before such words, e.g., "more interesting" and "most awful". These two types are called regular comparatives and superlatives. English also has irregular comparatives and superlatives that do not follow the above rules. These are, "more", "most", "less", "least", "better", "best", "worse", "worst", "further/farther" and "furthest/farthest".
[000128] In order to use the sentiment analysis method above to find superior and/or inferior entities, one can first convert those sentiment adjectives and adverbs to their comparative and superlative forms, which is done automatically by using English grammar rules and WordNet. One can then regard the comparatives and superlatives as positive and negative as their base forms respectively. For irregular comparatives, "better" and "best" are treated as positive, and "worse" and "worst" are treated as negative, "more", "most", "less", and "least" require special handling. They are considered together with sentiment words using the following four rules: more/most + Pos → Positive more/most + Neg — > Negative less/least + Pos → Negative less/least + Neg → Positive
[000129] Rule 1 says that "more/most" and a positive (Pos) sentiment word together mean positive, e.g., "more beautiful". Other rules have similar meanings. [000130] Non-standard words: Apart from the above comparatives and superlatives, many other words can also express comparisons, e.g., "win", "prefer", "superior" and "inferior". For example, the sentence, "In term of battery life, Camera- X is superior to Camera-T\ expresses a comparison indicating that Camera-X is preferred with regard to "battery life". These words are treated as positive or negative based on their meanings. [000131] Identify comparative and superlative sentences: Before one can identify superior entities from comparative sentences, one needs to identify such sentences. Reference [12] proposed a pattern mining approach to identifying comparative and superlative sentences. The present disclosure, does not focus on this task. Only several heuristic rules are designed to identify such sentences, which perform quite well. [000132] Clearly, comparative and superlative sentences are signaled by various keywords. The present disclosure uses a list of 67 keywords (obtained from the authors of [12]), which includes 4 part-of- speech tags, i.e., JJR (comparative adjective), RBR (comparative adverb), JJS (superlative adjective) and RBS (superlative adverb). The heuristics rules used in the present disclosure are as follows (if a sentence matches anyone of the rules, it is considered a comparative or a superlative sentence): a) pronoun + compkey + prodname, b) prodname + compkey + pronoun, c) prodname + compkey + prodname d) pronoun + superkey e) prodname + superkey d) as + JJ + as (except "as long as" and "as far as") where compkey is a comparative keyword, prodname is a product name and superkey is a superlative keyword.
[000133] Identify superior entities: As mentioned earlier, the above sentiment analysis method for mining direct opinions can be used to identify superior/preferred entities. Since a gradable comparative sentence typically has entities on the two sides of the comparative keyword, i.e., "Camera-X is better than Camera-Y\ Based on sentiment analysis, if the sentence is positive, then the entities before the comparative keyword is superior and otherwise they are inferior (with the negation considered). Superlative sentences can be handled in a similar way. Note that equative and non- gradable comparisons do not express preferences. [000134] EMPERICAL EVALUATON [000135] This section evaluates the proposed techniques for the two tasks, entity identification and entity assignment. The disclosure below presents datasets and corresponding experimental results.
[000136] Experimental Data Collections
[000137] The experiment data collections are crawled from two forums,
HowardForums and AVSforums. HowardForums is a message board dedicated to mobile phones while AVSforum is a message board dedicated to Home Theater and the products used. Data from AVSforum are discussions about Plasma and LCD TVs,
Projectors and DVD players. Table 1 shows the characteristics of the two data sets.
Altogether, 64 threads were downloaded, which contain 753 individual posts with
1072 comparative and superlative sentences. The total number of sentences is 4385.
All the sentences and product names were annotated by two graduate students based on consensus. The conflicting cases were resolved after discussion with the faculty advisor.
[000138] Experimental Results
[000139] The experimental results for both tasks are presented below.
[000140] Entity Identification
[000141] The results of entity identification are given first. The present method is referred to EI. It is compared with the NET system [27] from University of Illinois at
Urbana Champion, and the Conditional Random Fields method (CRF) reference [16].
NET is a Named Entity Tagger, which can be used in the present case as product names are named entities. The CRF system used in the description below is from
Sunita Sarawagi [26], which is available as a public domain software. Table 2 shows the results.
[000142] Note that the NET system does not need training. The training data for
CRF is the data obtained from step 2 of our algorithm. Recall that the data from step 2 is automatically generated. The entities in those sentences are regarded as positive data and all the other words in the sentences are regarded as negative data. The test data is the whole set for all the systems. Using the whole set as the test data is reasonable because present system does not use any manually labeled training data.
Only a set of seed entities is supplied. The training data is automatically generated. [000143] In Table 2, we also compare EI when only the first 4 steps are used (EI (1- 4)) and when all 6 steps are used (EI (1-6)). Using the first 4 steps basically means that the system only uses pattern mining for extraction. As expected EI(I -4) produces high recalls but low precisions. However, EI(l-4)'s F scores are already dramatically higher than those of CRF and NET. For NET, one can use its results for organization entities. For other types of entities, the results are much worse. From Table 2, it is observed that additional steps of EI improve the result further (EI(l-6)). Compared to EI(I -4), the precision increases dramatically with a small drop in recall, but the overall F scores are much higher.
[000144] Recall that the disclosed method uses some seeds to start the process. The question is how the number of seeds affects the final results. A set of experiments were performed by varying the number of seeds to see their effects. FIG. 2 gives the results of 5, 10, 15 and 20 seeds. Clearly, when the number of seeds is small the precision is higher, but the recall is very low. With more seeds, one can achieve more balanced results. If more seeds are selected, although the results are slightly better, it defeats the purpose of method which requires little user knowledge. The experimental results reported in Table 2 are based on 15 seeds. All results are the averages of 10 random runs with randomly selected seeds. [000145] Entity Assignment
[000146] Table 3 gives the experimental results for entity assignment, which include the results of two baseline methods. The disclosed method uses ED to denote the proposed technique. Below, the columns are explained one-by-one, and also discussion is provided on the results. Two sets of experiments were conducted. The first set is denoted by "Next Sentences" in Table 3. "Next Sentences" means that only the comparative sentences and their subsequent sentences are considered. This set of experiments thus shows how effective the ED technique is in its intended task. The second set of experiments is denoted by "All Sentences", which considers all sentences. It shows how the ED method affects the overall implicit entity assignment task.
[000147] Column 1 (baseline 1 -next sentences): Baselinel works as follows: If a sentence does not mention any product name, one can simply take the last product of the previous sentence. Note that the product of the previous sentence can be inherited from its previous sentence and so on. The accuracy measure is used here because one can gauge how accurate the assignments of products to sentences are. [000148] Column 2 (baseline2-next sentences): In the Baseline2 method, if a sentence does not mention a product name, it simply takes the first product of the previous sentence. One can observe that Baseline2 is always more accurate than Baselinel because in most cases, the first product is the superior product in a comparative sentence and the next sentence also tends to talk about that product. [000149] Column 3 (ED (k-com) - next sentences): It gives the result of each data set using the proposed ED method assuming that the comparative and superlative sentences are known, k-com denotes this assumption.
[000150] Column 4 (ED (unk-com) - next sentences): It gives the result of each data set using the proposed ED technique assuming that the comparative and superlative sentences are unknown, unk-com denotes this fact. This is the realistic situation, in which the system has to detect comparative and superlative sentences automatically using the method in Section 6.4. One can observe that the ED method outperforms the two baseline methods dramatically, i.e., on average from the best accuracy of the baselines, 82.1%, to the accuracy of the realistic situation of not knowing the comparatives, 89.9%. Knowing the comparative sentences (k-com) only performs slightly better as compared to not knowing them. Note that the accuracy here means the total number of sentences that have been correctly assigned products compared to the total number of sentences that need such assignments.
[000151] Columns 5-8 (all sentences): These results correspond to those in columns 1-4 except that all sentences are used in the experiments. In this case, the disclosed method assigns products to every sentence rather than only to the sentence after each comparative and superlative sentence. Again, one can observe substantial improvements, i.e., on average from the best of the baselines, 80.0%, to the realistic situation of not knowing the comparatives, 85.9%. In this case, ED improves slightly less because comparative sentences are only a small proportion of all sentences. The results are lower than columns 1-4 since due to propagation if the identification in one sentence is wrong, one can expect to get the implicit entity in the next sentence wrong and so on.
[000152] Columns 9-11: They give the precision, recall and F-score of each data set on the task of identifying comparative and superlative sentences. The average result (F = 84.1%) is better than that given in [12], i.e., the average F = 79%. [000153] In summary, the experimental results demonstrate the effectiveness of the ED method.
[000154] It would be evident to an artisan with ordinary skill in the art that the aforementioned embodiments of the disclosed method can be modified, reduced, or enhanced without departing from the scope and spirit of the claims described below. Accordingly, the reader is directed to the claims for a fuller understanding of the breadth and scope of the present disclosure.
[000155] FIG. 3 depicts an exemplary diagrammatic representation of a machine in the form of a computer system 300 within which a set of instructions, when executed, may cause the machine to perform any one or more of the methodologies discussed above. In some embodiments, the machine operates as a standalone device. In some embodiments, the machine may be connected (e.g., using a network) to other machines. In a networked deployment, the machine may operate in the capacity of a server or a client user machine in server-client user network environment, or as a peer machine in a peer-to-peer (or distributed) network environment. [000156] The machine may comprise a server computer, a client user computer, a personal computer (PC), a tablet PC, a laptop computer, a desktop computer, a control system, a network router, switch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) that specify actions to be taken by that machine. It will be understood that a device of the present disclosure includes broadly any electronic device that provides voice, video or data communication. Further, while a single machine is illustrated, the term "machine" shall also be taken to include any collection of machines that individually or jointly execute a set (or multiple sets) of instructions to perform any one or more of the methodologies discussed herein. [000157] The computer system 300 may include a processor 302 (e.g., a central processing unit (CPU), a graphics processing unit (GPU, or both), a main memory 304 and a static memory 306, which communicate with each other via a bus 308. The computer system 300 may further include a video display unit 310 (e.g., a liquid crystal display (LCD), a flat panel, a solid state display, or a cathode ray tube (CRT)). The computer system 300 may include an input device 312 (e.g., a keyboard), a cursor control device 314 (e.g., a mouse), a disk drive unit 316, a signal generation device 318 (e.g., a speaker or remote control) and a network interface device 320. [000158] The disk drive unit 316 may include a machine-readable medium 322 on which is stored one or more sets of instructions (e.g., software 324) embodying any one or more of the methodologies or functions described herein, including those methods illustrated above. The instructions 324 may also reside, completely or at least partially, within the main memory 304, the static memory 306, and/or within the processor 302 during execution thereof by the computer system 300. The main memory 304 and the processor 302 also may constitute machine-readable media. [000159] Dedicated hardware implementations including, but not limited to, application specific integrated circuits, programmable logic arrays and other hardware devices can likewise be constructed to implement the methods described herein. Applications that may include the apparatus and systems of various embodiments broadly include a variety of electronic and computer systems. Some embodiments implement functions in two or more specific interconnected hardware modules or devices with related control and data signals communicated between and through the modules, or as portions of an application-specific integrated circuit. Thus, the example system is applicable to software, firmware, and hardware implementations. [000160] In accordance with various embodiments of the present disclosure, the methods described herein are intended for operation as software programs running on a computer processor. Furthermore, software implementations can include, but not limited to, distributed processing or component/object distributed processing, parallel processing, or virtual machine processing can also be constructed to implement the methods described herein.
[000161] The present disclosure contemplates a machine readable medium containing instructions 324, or that which receives and executes instructions 324 from a propagated signal so that a device connected to a network environment 326 can send or receive voice, video or data, and to communicate over the network 326 using the instructions 324. The instructions 324 may further be transmitted or received over a network 326 via the network interface device 320.
[000162] While the machine-readable medium 322 is shown in an example embodiment to be a single medium, the term "machine-readable medium" should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions. The term "machine-readable medium" shall also be taken to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the present disclosure.
[000163] The term "machine-readable medium" shall accordingly be taken to include, but not be limited to: solid-state memories such as a memory card or other package that houses one or more read-only (non- volatile) memories, random access memories, or other re- writable (volatile) memories; magneto-optical or optical medium such as a disk or tape; and/or a digital file attachment to e-mail or other self- contained information archive or set of archives is considered a distribution medium equivalent to a tangible storage medium. Accordingly, the disclosure is considered to include any one or more of a machine-readable medium or a distribution medium, as listed herein and including art-recognized equivalents and successor media, in which the software implementations herein are stored.
[000164] Although the present specification describes components and functions implemented in the embodiments with reference to particular standards and protocols, the disclosure is not limited to such standards and protocols. Each of the standards for Internet and other packet switched network transmission (e.g., TCP/IP, UDP/IP, HTML, HTTP) represent examples of the state of the art. Such standards are periodically superseded by faster or more efficient equivalents having essentially the same functions. Accordingly, replacement standards and protocols having the same functions are considered equivalents.
[000165] The illustrations of embodiments described herein are intended to provide a general understanding of the structure of various embodiments, and they are not intended to serve as a complete description of all the elements and features of apparatus and systems that might make use of the structures described herein. Many other embodiments will be apparent to those of skill in the art upon reviewing the above description. Other embodiments may be utilized and derived therefrom, such that structural and logical substitutions and changes may be made without departing from the scope of this disclosure. Figures are also merely representational and may not be drawn to scale. Certain proportions thereof may be exaggerated, while others may be minimized. Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense.
[000166] Such embodiments of the inventive subject matter may be referred to herein, individually and/or collectively, by the term "invention" merely for convenience and without intending to voluntarily limit the scope of this application to any single invention or inventive concept if more than one is in fact disclosed. Thus, although specific embodiments have been illustrated and described herein, it should be appreciated that any arrangement calculated to achieve the same purpose may be substituted for the specific embodiments shown. This disclosure is intended to cover any and all adaptations or variations of various embodiments. Combinations of the above embodiments, and other embodiments not specifically described herein, will be apparent to those of skill in the art upon reviewing the above description. [000167] The Abstract of the Disclosure is provided to comply with 37 C.F.R. §1.72(b), requiring an abstract that will allow the reader to quickly ascertain the nature of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. In addition, in the foregoing Detailed Description, it can be seen that various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments require more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive subject matter lies in less than all features of a single disclosed embodiment. Thus the following claims are hereby incorporated into the Detailed Description, with each claim standing on its own as a separately claimed subject matter. References
[I] Bos, J., and Nissim, M. An Empirical Approach to the Interpretation of Superlatives. EMNLP'06, 2006.
[2] Bunescu R., Ge R., Kate R., Marcotte E., Mooney R., Ramani A., and Wong Y.: Comparative experiments on learning information extractors for proteins and their interactions, Artificial Intelligence in Medicine 33(2): 2005.
[3] Denis, P., and Baldridge, J, A Ranking Approach to Pronoun Resolution, UCAF07, 2007.
[4] Ding, X., Liu, B., and Yu, P. A Holistic Lexicon-Based Approach to Opinion Mining, WSDM'08, 2008.
[5] Esuli, A., and Sebastiani, F. Determining Term Subjectivity and Term Orientation for Opinion Mining, EACL' 06, 2006.
[6] Feldman R., Fresko M., Goldenberg J., Netzer O., and Ungar L.: Extracting Product Comparisons from Discussion Boards, ICDM 2007
[7] Fiszman, M., Demner-Fushman, D., Lang, F., Goetz, P., and Rindflesch, T. Interpreting Comparative Constructions in Biomedical Text. BioNLP, 2007.
[8] Ganapathibhotla M., and Liu, B. Mining opinions in comparative sentences, Coling-2008.
[9] Hu. M and Liu, B. "Mining and summarizing customer reviews," Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), pp. 168-177, 2004.
[10] Hatzivassiloglou, V., and McKeown, K. Predicting the Semantic Orientation of Adjectives. ACL-EACL'97, 1997.
[I I] Jindal, N., and Liu, B. Mining Comparative Sentences and Relations. AAAI'06, 2006.
[12] Jindal, N., and Liu, B. Identifying Comparative Sentences in Text Documents. SIGIR' 06, 2006.
[13] Kaji, N., and Kitsuregawa, M. Building Lexicon for Sentiment Analysis from Massive Collection of HTML Documents. EMNLP'07, 2007.
[14] Kanayama, H., and Nasukawa, T. Fully automatic lexicon expansion for domain-oriented sentiment analysis. EMNLP'06, 2006.
[15] Kim, S., and Hovy, E. Determining the Sentiment of Opinions. COLING'04, 2004.
[16] Lafferty J., McCallum A., and Pereira F.: Conditional Random Fields:
Probabilistic Models for Segmenting and Labeling Sequence Data, ICML 2001
[17] Liu, B. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer, 2006.
[18] Liu, B., Hu. M. and Cheng, J. Opinion observer: Analyzing and comparing opinions on the web, WWW, 2005. [19] Pang, B., Lee, L., and Vaithyanathan, S. Thumbs up? Sentiment Classification Using Machine Learning Techniques. EMNLP '02, 2002.
[20] Popescu, A.-M., and Etzioni, O. Extracting Product Features and Opinions from Reviews. EMNLP'05, 2005.
[21] Riloff, E., and Wiebe, J. Learning extraction patterns for subjective expressions. EMNLP'03, 2003.
[22] Turney, P. Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification of Reviews. ACL'02, 2002.
[23] Wilson, T., Wiebe, J., and Hwa, R. Just how mad are you? Finding strong and weak opinion clauses. AAAF 04, 2004.
[24] Yang, X, Su, J., and Tan, CL. Improving Pronoun Resolution Using Statistics- Based Semantic Compatibility Information, ACL' 05, 2005.
[25] Yang, X, Zhou, G., Su, J., and Tan, CL. Coreference Resolution Using Competitive Learning Approach, ACL' 03.
[26] http://crf.sourceforge.net/
[27] http://12r.cs.uiuc. edu/~cogcomp/asoftware.php?skey=NE

Claims

CLAIMSWhat is claimed is:
1. A method, comprising: identifying a plurality of entities in opinionated text generated by a plurality of users, each user expressing one or more opinions about at least one of the plurality of entities; identifying a plurality of comparative sentences and a plurality of non- comparative sentences in the opinionated text; identifying inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences; determining a semantic orientation for each of the plurality of non-comparative sentences; and assigning at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the determined semantic orientation of the plurality of non-comparative sentences, the plurality of comparative opinions, and sentiment consistency between consecutive sentences in the opinionated text.
2. The method of claim 1, wherein each entity comprises one of a person, a product, a service, an event, an organization, and a topic, and wherein a superior entity is preferred over an inferior entity.
3. The method of claim 1 , comprising identifying each of the plurality of entities as an explicit entity or an implicit entity.
4. The method of claim 1, comprising: receiving a list of entities; and identifying at least a portion of the plurality of entities according to the list of entities.
5. The method of claim 1, comprising classifying the comparative sentences as one of non-equal gradable sentences, equative sentences, non-gradable sentences, and superlative sentences, and non-comparative sentences in the opinionated text as normal.
6. The method of claim 1, comprising identifying a sentiment consistency between the identified comparative and non-comparative sentences.
7. The method of claim 1, comprising identifying sentiment indicators from sentences in the opinionated text.
8. The method of claim 7, comprising identifying the one or more semantic orientations according to at least a portion of the identified sentiment indicators.
9. The method of claim 7, wherein the identified sentiment indicators comprise at least one of words, phrases, negations and but-clauses.
10. The method of claim 7, comprising applying a specification language to each of the identified sentiment indicators.
11. The method of claim 10, wherein the specification language comprises grammatical rules.
12. The method of claim 11, wherein the grammatical rules comprise at least two of an indicator symbol, a word, a distance and a target.
13. The method of claim 12, wherein the indicator symbol comprises at least one of a positive indicator symbol, a negative indicator symbol, a neutral indicator symbol, a negation indicator symbol and a but indicator symbol, wherein the word comprises any word with an option type, wherein the distance indicates a gap of words, and wherein the target indicates which word a grammatical rule applies to.
14. A computer-readable storage medium, comprising computer instructions to: identify a plurality of entities in opinionated text; identify a plurality of comparative sentences and a plurality of non- comparative sentences in the opinionated text; identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from the plurality of comparative sentences; and assign at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and the plurality of non-comparative sentences according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
15. The storage medium of claim 14, wherein each entity comprises one of a person, a product, a service, an event, an organization, and a topic, and wherein the storage medium comprises computer instructions to identify each of the plurality of entities as an explicit entity or an implicit entity.
16. The storage medium of claim 14, comprising computer instructions to: receive a list of entities; and identify at least a portion of the plurality of entities according to the list of entities.
17. The storage medium of claim 14, comprising computer instructions to classify the comparative sentences as one of non-equal gradable sentences, equative sentences, non-gradable sentences, and superlative sentences.
18. The storage medium of claim 14, comprising computer instructions to: identify a sentiment consistency between the identified comparative and non- comparative sentences; and determine a semantic orientation for each of the plurality of non-comparative sentences.
19. The storage medium of claim 18, comprising computer instructions to: identify sentiment indicators from sentences in the opinionated text; identify the one or more semantic orientations according to at least a portion of the identified sentiment indicators, wherein the identified sentiment indicators comprise at least one of words, phrases, negations and but-clauses; apply a specification language to each of the identified sentiment indicators, wherein the specification language comprises grammatical rules, wherein the grammatical rules comprise at least two of an indicator symbol, a word, a distance and a target, wherein the indicator symbol comprises at least one of a positive indicator symbol, a negative indicator symbol, a neutral indicator symbol, a negation indicator symbol and a but indicator symbol, wherein the word comprises any word with an option type, wherein the distance indicates a gap of words, and wherein the target indicates which word a grammatical rule applies to.
20. An evaluation system, comprising a controller to: identify a plurality of entities in opinionated text; identify inferior and superior entities from the plurality of entities according to a plurality of comparative opinions determined from a plurality of comparative sentences in the opinionated text; and assign at least a portion of the superior and inferior entities to one of the plurality of comparative sentences and a plurality of non-comparative sentences of the opinionated text according to the plurality of comparative opinions, sentiment consistency between consecutive sentences in the opinionated text, and a semantic orientation of the plurality of non-comparative sentences.
PCT/US2009/044197 2009-05-15 2009-05-15 System and methods for sentiment analysis Ceased WO2010132062A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2009/044197 WO2010132062A1 (en) 2009-05-15 2009-05-15 System and methods for sentiment analysis

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2009/044197 WO2010132062A1 (en) 2009-05-15 2009-05-15 System and methods for sentiment analysis

Publications (1)

Publication Number Publication Date
WO2010132062A1 true WO2010132062A1 (en) 2010-11-18

Family

ID=40962445

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2009/044197 Ceased WO2010132062A1 (en) 2009-05-15 2009-05-15 System and methods for sentiment analysis

Country Status (1)

Country Link
WO (1) WO2010132062A1 (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013168043A (en) * 2012-02-16 2013-08-29 Nec Corp Complaint extracting device, complaint extracting method, and complaint extracting program
WO2016066228A1 (en) * 2014-10-31 2016-05-06 Longsand Limited Focused sentiment classification
CN112199956A (en) * 2020-11-02 2021-01-08 天津大学 Entity emotion analysis method based on deep representation learning
CN112446202A (en) * 2019-08-16 2021-03-05 阿里巴巴集团控股有限公司 Text analysis method and device
CN113298365A (en) * 2021-05-12 2021-08-24 北京信息科技大学 LSTM-based cultural additional value assessment method
CN117973946A (en) * 2024-03-29 2024-05-03 云南与同加科技有限公司 A teaching-oriented data processing method and system
CN119783675A (en) * 2025-03-10 2025-04-08 北京博大网信股份有限公司 Internet online common sense extraction method and device, electronic equipment and storage medium

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BING LIU: "Sentiment Analysis and Subjectivity", 24 August 2009 (2009-08-24), XP002542660, Retrieved from the Internet <URL:http://www.cs.uic.edu/~liub/FBS/NLP-handbook-sentiment-analysis.pdf> [retrieved on 20090824] *
MURTHY GANAPATHIBHOTLA AND BING LIU: "Mining Opinions in Comparative Sentences"", PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON COMPUTATIONAL LINGUISTICS (COLING-2008), 18 August 2008 (2008-08-18) - 22 August 2008 (2008-08-22), Manchester, UK, pages 241 - 248, XP002542647 *
NITIN JINDAL AND BING LIU: "Mining Comparative Sentences and Relations", PROCEEDINGS OF 21ST NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE (AAAI-2006), 16 July 2006 (2006-07-16) - 20 July 2006 (2006-07-20), Boston, Massachusetts, USA, pages 1331 - 1336, XP002542644 *
TETSUYA NASUKAW AND JEONGHEE YI: "Sentiment analysis: capturing favorability using natural language processing", PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON KNOWLEDGE CAPTURE, 23 October 2003 (2003-10-23) - 25 October 2003 (2003-10-25), Sanibel Island, FL, USA, pages 70 - 77, XP002542648 *
XIAOWEN DING, BING LIU AND LEI ZHANG: "Entity discovery and assignment for opinion mining applications", PROCEEDINGS OF THE 15TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 28 June 2009 (2009-06-28) - 1 July 2009 (2009-07-01), Paris, France, pages 1125 - 1133, XP002542645 *
XIAOWEN DING, BING LIU AND PHILIP S. YU: "A holistic lexicon-based approach to opinion mining", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON WEB SEARCH AND WEB DATA MINING, 11 February 2008 (2008-02-11), pages 231 - 239, XP002542646 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013168043A (en) * 2012-02-16 2013-08-29 Nec Corp Complaint extracting device, complaint extracting method, and complaint extracting program
WO2016066228A1 (en) * 2014-10-31 2016-05-06 Longsand Limited Focused sentiment classification
CN112446202A (en) * 2019-08-16 2021-03-05 阿里巴巴集团控股有限公司 Text analysis method and device
CN112199956A (en) * 2020-11-02 2021-01-08 天津大学 Entity emotion analysis method based on deep representation learning
CN113298365A (en) * 2021-05-12 2021-08-24 北京信息科技大学 LSTM-based cultural additional value assessment method
CN113298365B (en) * 2021-05-12 2023-12-01 北京信息科技大学 Cultural additional value assessment method based on LSTM
CN117973946A (en) * 2024-03-29 2024-05-03 云南与同加科技有限公司 A teaching-oriented data processing method and system
CN119783675A (en) * 2025-03-10 2025-04-08 北京博大网信股份有限公司 Internet online common sense extraction method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
Ding et al. Entity discovery and assignment for opinion mining applications
US11379512B2 (en) Sentiment-based classification of media content
US9948595B2 (en) Methods and apparatus for inserting content into conversations in on-line and digital environments
US20220138404A1 (en) Browsing images via mined hyperlinked text snippets
US9626622B2 (en) Training a question/answer system using answer keys based on forum content
US9720904B2 (en) Generating training data for disambiguation
US20130060769A1 (en) System and method for identifying social media interactions
US20130311485A1 (en) Method and system relating to sentiment analysis of electronic content
AU2018383346A1 (en) Domain-specific natural language understanding of customer intent in self-help
WO2018149115A1 (en) Method and apparatus for providing search results
US10740406B2 (en) Matching of an input document to documents in a document collection
US20160179966A1 (en) Method and system for generating augmented product specifications
US9811515B2 (en) Annotating posts in a forum thread with improved data
Castellanos et al. LCI: a social channel analysis platform for live customer intelligence
WO2010132062A1 (en) System and methods for sentiment analysis
CN109960721A (en) Multi-compressed construct content based on source content
US20180150450A1 (en) Comment-centered news reader
US20090327877A1 (en) System and method for disambiguating text labeling content objects
CN107798622A (en) A kind of method and apparatus for identifying user view
Nigam et al. Towards a robust metric of polarity
US20230112385A1 (en) Method of obtaining event information, electronic device, and storage medium
CN112148988B (en) Method, apparatus, device and storage medium for generating information
US9305103B2 (en) Method or system for semantic categorization
CN111368036B (en) Method and device for searching information
CN111310016B (en) Label mining method, device, server and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 09789688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 09789688

Country of ref document: EP

Kind code of ref document: A1